UC-IEG-I
From EGI Knowledge Base
Use Case title: Employment of remote farms for the ATLAS online monitoring, calibration and filtering system
Short description: The application spans across a number of computers: part of them being at CERN close to the ATLAS TDAQ system, while the others are sprayed out across the Europe (+ Canada and USA) in experiment collaborating institutions. This application has been included as a test of Network Quality of Service.
The part of application located at CERN collects data (events) from the ATLAS Data Acquisition system (TDAQ). The data are then copied or moved out to remote locations for processing. In case of monitoring and calibration the data will be discarded (or saved locally in some pre-processed forms) after processing, whereas in case of filtering, results of processing will be sent back to CERN. In the latter case, the copy of data will be maintained at CERN. The results will be appended to the data and send to permanent storage at CERN afterwards.
The user of the application is an experiment operator of ATLAS. For every ATLAS run he defines the number of CPUs that are needed for analysis and prepare the configuration of the machines before the run starts, and then monitor if events are processed without any hang-ups. It should also be envisaged that during the run, as the accelerator conditions change, some intervention from the operator is feasible (for example to enable and engage more remote resources).
The general working scheme of the application is depicted in Figure 1. Here we observe the different steps to go through: Data Acquisition at the ATLAS detector, data distribution among the remote Processing Farms (PF), data collection and storage in the local mass storage at CERN.
Actors involved:
Technical Requirements
If the application will be part of the ATLAS filter system, one node should process 12 Mbit event data. Each event is assumed to take about 1 second. Therefore, to keep the computer busy every CPU should receive 12 Mb/s. The throughput to the site is therefore the number of CPUs in the site, times 12Mb/s.
The assignation of CPUs from int.eu.grid to this application has to be analyzed on the perspective of the Bandwidth of the particular site to the NRE. For example a site with a connectivity of 1 Gb/s can allocate (assuming a bandwidth efficiency of 70%) no more than 60 CPUs.
According to the numbers coming in the Technical Annex, and updated by all the partners, it would mean that the maximum number of CPUs per site for this application would be approximately:
Interactive Grid added value
The idea of porting this application to the Grid is motivated by the possibility of gathering a large amount of resources at a given time. The particular interest of using the int.eu.grid resources arises because the application requires reaction times of the order of seconds, and not minutes, which is the case in typical batch job submissions.
The ATLAS operator is also required to interact with the application in case malfunctioning of single nodes occurs. In this sense the application is fault tolerant, but fails have to be detected, and the operator should be allowed to add more resources when the analysis rate is not acceptable, or the load between the processing nodes is not balanced.
General Working Scheme in INT.EU.GRID
The key words of this application are monitorization and interactivity. A typical use case of the application can be described in three steps (see Figure 2)
1.Before CERN run is started The ATLAS operator will specify the total number of requested CPUs, minimum individual CPU speed and minimum amount of RAM, according to the Network bandwidth constraints we have described before. The operator then chooses between the lists of possible sites complying with Hardware and Software requirements and starts the run.
2.Running time The operator will monitor the data collected from sites running the pre-processing tasks. The minimal information includes rates of events being processed by each site, status of the site and status of the interconnection.
3.After run is stopped The operator should be then able to modify the list of active sites. Some sites, currently active can be excluded, and new ones can be added.
