Similar to other large-scale genome projects, the ICGC will require a Data Coordination Center (DCC) that is well integrated with ICGC operations at participating centers, and ICGC Governance and Scientific Coordination bodies. This requires a comprehensive management system that is designed to:
Figure 2: ICGC data coordination as a franchise system
The community will obtain access to the ICGC data via one or more front ends (e.g., websites), that will provide an interface to the coordination backend. In addition, all project data will be submitted to the appropriate public repositories. The exact path that the data will take from the group that generates it to the public repository will be flexible. For some data types it would be appropriate for the ICGC participant to submit the information directly from their internal workflow system. In other cases, it might be appropriate for the information to be submitted from the franchise database or from the coordination database itself. This architecture provides the flexibility to allow certain specialized ICGC data types -- microarray CEL files, raw short sequence reads, details of tumor-specific treatment regimens, histology slide images -- to be submitted directly to the appropriate archive without bottlenecking through a central coordinating center or generic data model. Nevertheless, by whichever path the detailed data takes to the repository, the tracking information needed to connect the sample data to that detailed information will be captured by the franchise database and available to researchers via the coordinating back end.
Box 8 includes additional recommendations with respect to requirements for data storage, analysis, distribution and protection.
- provide secure and reliable mechanisms for the sequencing centers, biorepositories, histopathology groups, and other ICGC participants to upload their data;
- track data sets as they are uploaded and processed, to perform basic integrity checks on those sets;
- allow regular audit of the project in order to provide high-level snapshots of the consortium's status;
- perform more sophisticated quality control checks of the data itself, such as checks that the expected sequencing coverage was achieved, or that when a somatic mutation is reported in a tumor, the sequence at the reported position differs in the matched normal tissue;
- enable the distribution of the data to the long-lived public repositories of genome-scale data, including sequence trace repositories, microarray repositories and the genome browsers;
- provide essential meta-data to each public repository that will allow the data to be understandable;
- facilitate the integration of the data with other public resources, by using widely-accepted ontologies, file formats and data models;
- manage an ICGC data portal that provides researchers with access to the contents of all franchise databases and provides project-wide search and retrieval services.
- support for hypothesis-driven research: The system should support small-scale queries that involve a single gene at a time, a short list of genes, a single specimen, or a short list of specimens. The system must provide researchers with an interactive system for identifying specimens of interest, finding what data sets are available for those specimens, selecting data slices across those specimens (e.g., counts of the number of somatic mutations observed a region within the UTR of a gene of interest), and running basic analytic tests on those data slices;
- support for computational biologists: The system should allow large subsets, or even the entire ICGC dataset, to be downloaded;
- enable ICGC and legislative policies for protecting the confidentiality of tissue donors, by prohibiting access to protected data to users who are not duly authorized.
Box 8. Additional guidelines for ICGC data management and security
Quality standards: Periodic quality assurance exercises, such as round-robin validation experiments, should be coordinated and interpreted by the DCC. The results of these validation exercises will be made available via the ICGC data portal.
Public and protected tiers: A binary system shall apply to portions of the data such that a datum is either public, meaning that all end-users can gain access to it, or protected, meaning that access is only available to authorized researchers who have agreed to protect patient confidentiality.
Multilateral authorization: The ICGC should have multiple bodies that can authorize a researcher to gain access to protected data as per IDAC Policies. Once a researcher is authorized by any of these bodies, he or she should be granted access to all protected ICGC data, regardless of which collaborator generated it or which country the data resides in.
Other portals: The ICGC should encourage the redistribution, integration and visualization of the data by community bioinformatics portals. However, portals that provide access to protected data sets must agree to respect and to implement ICGC's authentication and authorization standards for protection of patient confidentiality.
Submission to archival repositories: The unprotected portion of the data should be submitted to public data repositories as rapidly as possible after passing QC and other verification tests.
Use of community standards: Whenever possible, the ICGC coordinating center and participating data acquisition groups should represent data sets using existing community file formats, ontologies and other standards.
Analysis services: Analysis and data aggregation services, which may be deployed against the ICGC data sets, will sometimes need to be co-located with the primary data in order to provide acceptable performance. In the event that a primary data set resides in a public archive, such as the short read archive, this will require coordination between the ICGC and the archive managers.