UC-UPV-metagenomics
From EGI Knowledge Base
Use Case title: Metagenomics Analysis on the Grid
Short description: In the recent years, the sequencing of the DNA of different species has been generalised. However, there are many organisms that are understudied because it is not possible to make them grow isolated in a sufficient amount to sequence their genome. However, the analysis of a sample from different specimen makes the problem untreatable.
Actors involved:
- Users: Biologists who own the data samples and define the targets.
- Application Developers: Integrators of the biocomputing tools in the Grid Environment.
- Operators: In charge of the splitting of databases and jobs, submitting and monitoring the jobs and the retrieval of results.
Related Requirements: The objective is to reach 30K sequences per day. We will use EGEE and EELA resources. This experiment will require about 3 CPU years. We will run about 10000 jobs of less than 4 hours each. The storage needs will be of 500 Mbytes of local storage per computing resource (temporally). The space required on the SEs will be in the order of few Gbytes in total. No special network or security requirements are met.
Pre-Conditions: In some cases, bioinformatics tools must be preinstalled (such as MrBayes, ClustalW or MPIBlast). Other tools can be submitted without prior installation (e.g. BLAST).
Steps:
- Retrieve the databases and discard entries that will not produce any hit.
- Split the database and the jobs according to the availability of the resources, the size of the sequences and the tools available.
- Submit and monitor the multiple-alignment jobs.
- Retrieve the result, perform a post-processing.
- Split the results and jobs for the phylogenetic stage.
- Submit and monitor the phylogenetic jobs.
- Retrieve the results and decide if a new iteration (from 3 onwards) is required.
Post-conditions: A group of operators update the structured report templates and ontologies, used in the indexing of the cases, as well as the groups and roles of the users.
Projects involved: Partially EELA, but basically own resources.
Middleware: gLite 3.0.
Application: This experiment comprises different applications and tools. The main outcome is the identification of new genes for the microorganisms of the sample, the characterisation of families and the construction of phylogenetic trees. This will help on the identification of new treatments and actions on this microorganisms and their effect. For example, the changes on the bacteria living in the human digestive track are a main symptom of a disease. Knowing better the genes of the bacteria will enable on preserving a sound balance of the microorganisms and thus human’s health.
