UC-NGI-CZ-VII

From EGI Knowledge Base

Jump to: navigation, search

Use Case title: Grid services development, tuning and deployment

Short description: In production deployment, Grid services are exposed to load scale and non-deterministic behaviour of the infrastructure (delays, failures, and outages) virtually irreproducible in any testing environment. Therefore a standard software release process (prepare release candidate, test independently, fix problems found in testing, do the release, gather and react to users' bug reports) is rather inefficient. The most critical problems, e.g. race and congestion conditions, are not hit in testing because it is unaffordable to run independent tests in a scale comparable to the target production. On the other hand, due to the infrastructure complexity and non-determinism, it is extremely difficult for the users to provide problem reports that would be reproducible, or at least precise enough to spot the true problem.

A radical change in the approach to the software release is required to address the different conditions. Representatives of service developers, testers, infrastructure operators, and end-users must form task forces focused on a specific software release. One or few instances of the release candidate services are injected into the production environment, and the end-user representatives generate their load in a way that is controlled and reproducible, while still resembling real production, in both usage pattern and scale. The experimental services are continuously monitored by the infrastructure operators in a close cooperation with the developers. Then any behavioural anomalies and malfunctions are detected almost immediately, they can be repeated for further analysis with high probability and less effort than in real production, and they are fixed on the spot finally. Once the service operation is settled, its current codebase and configuration snapshot form the desired release which is spread in a conventional way then.

This use case is supported by the experience with ``Experimental services operation of new gLite Workload Management System on EGEE infrastructure in 2007. During this several months exercise the WMS services were successfully tuned to considerably outperform ``acceptance criteria specified by HEP applications (10k jobs/day sustained submission over 5 days, without manual intervention, and with less than 0.5% failure rate). The resulting code and configuration forms the first gLite 3.1 WMS release (aka patch #1251). The following CHEP'07 presentations provide further details (to appear in J.Phys.: Conference series):

  • S. Campana et al.: Experience with the gLite Workload Management System in ATLAS Monte Carlo Production on LCG
  • M. Checchi et al.: The gLite Workload Management System


Actors involved: local research community, VO operator, NGI operator

Personal tools
hidden pages