Presentation at the Canadian Cancer Research Conference satellite bioinformatics.ca workshop. This one is an introduction to tcga, icgc and cosmic databases.
3. You are free to:
Copy, share, adapt, or re-mix;
Photograph, film, or broadcast;
Blog, live-blog, or post video of;
This presentation. Provided that:
You attribute the work to its author and
respect the rights and licenses associated
with its components.
Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero.
Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at;
http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
Module 1: Cancer Genomic Databases
bioinformatics.ca
6. Schedule for Module 1
Cancer Genomic Databases
•The Databases:
– The International Cancer Genome Consortium (ICGC)
– The Cancer Genome Atlas (TCGA)
– The Catalogue of Somatic Mutations in Cancer (COSMIC)
•Data Access: human genomes and security and
privacy issues, Open vs. Controlled Access data
Module 1: Cancer Genomic Databases
bioinformatics.ca
10. Workshops planned for 2014:
http://bioinformatics.ca/workshops
1.
2.
3.
4.
5.
6.
7.
8.
Exploratory Analysis of Biological Data using R
Bioinformatics for Cancer Genomics
Informatics for RNA-sequence Analysis
Informatics on High Throughput Sequencing Data
Pathway and Network Analysis of -omics Data
Flow Cytometry Data Analysis using R
Microarray Data Analysis
Informatics and Statistics for Metabolomics
Module 1: Cancer Genomic Databases
bioinformatics.ca
13. Soap-Box time!
•
•
Open Access, Open Data and Open Source are essential for good
Science.
Openness is a responsibility, an obligation, and something that comes
with the privilege of doing publicly funded work.
Open Source
Open Access
Open Data
Opencourseware
Module 1: Cancer Genomic Databases
bioinformatics.ca
15. Cancer therapy is like
beating the dog with
a stick to get rid of
his fleas.
- Anna Deavere Smith,
Let me down easy
Module 1: Cancer Genomic Databases
bioinformatics.ca
17. The revolution in cancer
research can summed up
in a single sentence:
cancer is in essence,
a genetic disease.
- Bert Vogelstein
Module 1: Cancer Genomic Databases
bioinformatics.ca
18. Cancer: a Disease of the Genome
Challenge in Treating Cancer:
Every tumour is different
Every cancer patient is different
Module 1: Cancer Genomic Databases
bioinformatics.ca
19. Cancer Genomic Databases
Chin et al, Genes. Dev. 2011 March 15; 25(6): 534-555
http://www.ncbi.nlm.nih.gov/pubmed/?term=21406553
Module 1: Cancer Genomic Databases
bioinformatics.ca
20. TCGA
The Cancer Genome Atlas is a
comprehensive and coordinated
effort to accelerate our
understanding of the molecular
basis of cancer through the
application of genome analysis
technologies, including largescale genome sequencing.
Module 1: Cancer Genomic Databases
bioinformatics.ca
21. About the TCGA
•
•
•
•
National Cancer Institute (NCI)
National Human Genome Research
Institute (NHGRI)
Phased Structure:
– Three-year pilot in 2006 with an investment of $50 million
from each
– TCGA will collect and characterize more than 20 additional
tumour types (now at 16)
Module 1: Cancer Genomic Databases
bioinformatics.ca
22. Where to start with the TCGA?
Wiki: https://wiki.nci.nih.gov/display/TCGA/About+TCGA
Module 1: Cancer Genomic Databases
bioinformatics.ca
23. Division of Labour
•
Biospecimen Core Resource (BCR)
– centre where samples are carefully catalogued, processed, qualitychecked
and stored along with participant clinical information
•
Genome Sequencing Centre (GSC)
– uses high-throughput methods to identify changes to DNA sequences that are
associated with specific cancer types
•
Genome Characterization Centre (GCC)
– uses high-throughput technologies to analyze genomic changes involved in cancer
•
Genome Data Analysis Centre (GDAC)
– provides novel informatics tools to the research community
•
– provides analysis results using TCGA data.
Data Coordinating Centre (DCC)
– Central provider of TCGA data.
– Standardizes data formats and validates submitted data.
Module 1: Cancer Genomic Databases
bioinformatics.ca
24. TCGA Data
• Sequence reads from newer sequencing
technologies are available at the Cancer Genome
Hub: https://cghub.ucsc.edu/
• Higher level sequence data (variation calls and
abundance measures) are available at the TCGA
Portal: http://cancergenome.nih.gov/
Module 1: Cancer Genomic Databases
bioinformatics.ca
26. Data Coordinating Centre
• Play a central role
– Receiving data from BCR, GSC and GCC sites
– Providing access to users
– Performing analysis of data
• Responsibilities:
–
–
–
–
Protecting participant privacy and confidentiality
Developing data standards and controlled vocabularies
Establishing informatics pipelines for data flow
Developing new analytical and visualization technologies
to facilitate data analysis, for all audiences
Module 1: Cancer Genomic Databases
bioinformatics.ca
27. TCGA DCC Data Portal
• Provides a platform to search, download and
analyze TCGA data sets
• Two data access tiers: Open and Controlled
• Analytic tools include: Cancer Molecular Analysis
and Cancer Genome Workbench (NCBIB),
Integrative Genomics Viewer (Broad) and
CancerGenomics Analysis (MSKCC).
Module 1: Cancer Genomic Databases
bioinformatics.ca
29. The International Cancer Genome Consortium
(ICGC)
• http://www.icgc.org/
• “ICGC was launched
to coordinate largescale cancer genome
studies in tumours
from 50 different
cancer types and/or
subtypes that are of
clinical and societal
importance across
the globe”
Module 1: Cancer Genomic Databases
bioinformatics.ca
31. ICGC Map – November 2013
67 projects launched
Module 1: Cancer Genomic Databases
bioinformatics.ca
32. Hardeep Nahal
ICGC datasets to date
ICGC Data Portal Cumulative Donor Count for Member Projects
10,000
Release 14
Release 11
Release 13
9000
Release 12
8000
Release 10
Release 9
7000
6000
Number
of
Donors
5000
Release 8
4000
Release 7
3000
2000
1000
Dec-11
Jan-2012
Feb
March
April
May
June
July
Aug
Sept
Oct
Nov
Module 1: Cancer Genomic Databases
Dec
Jan-2013
Feb
March
April
May
June
July
Aug
Sept-2013
bioinformatics.ca
33. ICGC dataset version 14
September 2013
Hardeep Nahal
• Cancer types: 41
• Donors: 8,532 (18,056 specimens)
• Simple somatic mutations: 1,995,134
• Copy number mutations: 18,526,593
• Structural rearrangements: 18,614
• Genes affected* by simple somatic mutations: 22,074
• Genes affected* by non-synonymous coding mutations: 19,150 Genes
affected* by copy number mutations: 20,341
• Genes affected* by structural rearrangements: 1,884
•
*out 22,259 protein coding genes annotated in Ensembl Human release 69
• Open tier and controlled data currently available
46. ICGC Controlled
Access Datasets
• Detailed Phenotype and Outcome data
Region of residence
Risk factors
Examination
Surgery
Radiation
Sample
Slide
Specific histological features
Analyte
Aliquot
Donor notes
• Gene Expression (probe-level data)
• Raw genotype calls
• Gene-sample identifier links
• Genome sequence files
ICGC OA
Datasets
• Cancer Pathology
Histologic type or subtype
Histologic nuclear grade
• Patient/Person
Gender, Age range,
Vital status, Survival time
Relapse type, Status at follow-up
• Gene Expression (normalized)
• DNA methylation
•Computed Copy Number and
Loss of Heterozygosity
• Newly discovered somatic variants
http://goo.gl/w4mrV
Module 1: Cancer Genomic Databases
bioinformatics.ca
47. Identify
Identify
yourself
yourself
Fill out detail form which
Fill out detail form which
includes:
includes:
••Contact and Project
Contact and Project
Information
Information
••InformationTechnology
Information Technology
details and procedures
details and procedures
for keeping data secure
for keeping data secure
••DataAccess Agreement
Data Access Agreement
Module 1: Cancer Genomic Databases
All of these
All of these
documents are
documents are
put into a PDF
put into a PDF
file that you
file that you
print and get your
print and get your
institution to sign
institution to sign
off on your behalf
off on your behalf
bioinformatics.ca
55. DACO/DCC User Data Access Process
•
Users approved through DACO are now automatically granted access to
ICGC controlled access datasets available through the ICGC Data Portal and
the EBI’s EGA repository
DACO Web
DACO Web
Application
Application
application
approved
by DACO
user
accounts
activated
DCC Data
DCC Data
Portal
Portal
DCC User
DCC User
Registry
Registry
EBI EGA
EBI EGA
Module 1: Cancer Genomic Databases
bioinformatics.ca
56. Catalogue of Somatic Mutations in Cancer
(COSMIC)
• http://cancer.sanger.ac.uk/cancerg
enome/projects/cosmic/
• COSMIC is designed
to store and display
somatic mutation
information and
related details and
contains information
relating to human
cancers.
Module 1: Cancer Genomic Databases
bioinformatics.ca
57. COSMIC
• Somatic Mutations Only
• Diverse sources
– Literature (Arrays, Next-Gen, PCR...)
– TCGA
– ICGC
• Diverse ways to look at data
–
–
–
–
–
Gene
Variation
Tumour type
Cell line
Experiment
Module 1: Cancer Genomic Databases
bioinformatics.ca
62. In closing
• Remember all these sites have great amounts of
documentation
• The field is changing quickly, and so are the portals.
• New features are planned as we speak, and so you
need to use the sites, and keep coming back.
• Don’t be afraid to explore
• Interested in learning more after today? Consider
one of the bioinformatics.ca workshops!
Module 1: Cancer Genomic Databases
bioinformatics.ca
{"33":"Ensembl 61 Hs has 53,515 gene loci annotated, which explain high affected genes numbers for SSMs (I’ve double-checked these numbers)\n","29":"A few notes on ICGC\n","19":"Consequtive basepairs\n","59":"Summary page with basic gene description and list of curated pubs. Click on Histogram to view the distribution of mutations. \n"}