Operation of tools and services
From EGI Knowledge Base
- O-E-1 and O-N-1
- Operation of the Grid configuration repositories (EGI.org and NGIs) - mandatory
Many aspects of operations rely on the availability of information (as applicable) from NGIs about service nodes, contact details, security contacts, certification status, sites in scheduled downtime, etc. The Grid repository provides all such information. Information input is devolved to regions and sites. The current central repository (known as GOCDB in EGEE) may need to be adapted to support a two-tier distributed model. This requires the definition and implementation of an exchange protocol between peer NGI repositories, or of other alternative implementation techniques.
- O-E-2 and O-N-2
- Operation of accounting repositories for global VOs - mandatory
The accounting repository is responsible of keeping records about usage of compute, storage, networking and other types of resources as required. It is the responsibility of a NGI to collect accounting data, and to keep a permanent master copy of usage records. Accounting information is needed by Global VOs in order to allow VO managers to know about the amount of IT resources consumed by the VO across different domains of the e-Infrastructure. For this reason, the deployment of standard interfaces between accounting systems in different NGIs, is important to ensure the interoperable exchange of records between different domains. EGI.org is responsible of the gathering and of making publicly available accounting information (as applicable and according to local lows) for each NGIs.
- O-E-3 and O-N-3
- Operation of the Grid repositories for SLA compliance and performance monitoring - mandatory
Availability and performance of Grid services and sites are important elements of information to check the health of the infrastructure and to verify the Quality of Service delivered to VOs and other NGIs. As SLAs can be established between VOs and sites, VOs and NGIs, NGIs and global VOs, tools need to be available to monitor the level of SLA conformance. This requires the maintenance of available tools and of the schema for central publishing of site and service status information. EGI will help VOs, NGIs and resource centres to define their SLAs according to a common format recognized by EGI.org and NGIs. VOs, NGIs and resource centres are free to choose the most suitable SLAs. Performance information allows the monitoring the Quality of Service delivered by NGIs and the related resource centres, to global VOs. Performance monitoring is also important for network quality assurance/reporting and metrics follow up, to ensure the underlying network infrastructure is working properly, that it is efficiently used by the project, and that network providers are respecting their contractual obligations, when SLAs are in place. EGI.org tasks are the publication of SLA-compliance statistics, maintenance of tools and schema for central publishing of site and service status information, preparation of reports on performance of NGI’s, maintenance of monitoring tools able to generate alarms in case of SLA violations, and of a central dashboard tool.
- O-E-4 and O-N-4
- Operation of the Grid Operations Portals - mandatory
The Grid operations portals provide an entry point for various actors to support their operational needs. Different "views" are necessary according to the role of the customer (Grid operators, VOs, Grid site managers, Region Operations Managers, etc.). Information on display is retrieved from several distributed sources (databases, Grid information systems, etc). It provides static information about sites/VOs, and dynamic information about resources/services status and allocation. The central Operations portal is the aggregation point of regional information also accessible via regional operations portals.
- O-E-5 and O-N-5
- Grid operation and oversight of the e-Infrastructure - mandatory
Oversight activities over the NGI infrastructures are needed for detecting problems, coordinating the diagnosis, and monitoring the problems during the entire lifecycle until resolution. Oversight of the NGI Grid is based on monitoring of status of services operated by sites, opening of tickets and their follow up for problem resolution, 1st line support for operations problems. This task includes all the work related to operation support including managing and responding to problems reported by the grid operator, running the required grid services at each site as well as services provided by the NGI, and services required by virtual organizations, such as file catalogues, and other VO-specific services. This is currently done in EGEE in cooperation with the relevant Regional Operations Centres (via rotating shifts) according to a two-level hierarchical model [12]. We foresee the possibility to evolve this model, in such a way that NGIs can autonomously run oversight activities in the region, or to federate in order to share efforts. Regardless of this distributed model, during the transition we foresee the need of performing quality checks of the services provided by NGIs and of taking care of operational problems that can not be successfully distributed to NGIs. EGI.org supports and actively controls the overall status of Grid services and sites, opening of tickets for requesting problem fixing, and tackling of residual problems not successfully distributed to NGI’s.
