Short tutorials on how to use the web-based tool DAVID - Database for Annotation, Visualization and Integrated Discovery) - http://david.abcc.ncifcrf.gov/
DAVID provides a comprehensive set of functional annotation tools for investigators to understand biological meaning behind large list of genes.
Extracting biological meaning from large gene lists with DAVID
1. Extracting biological meaning from large gene list with DAVID Huang et al., CurrProtoc Bioinformatics (2009) http://david.abcc.ncifcrf.gov/home.jsp Francesco Mattia Mancuso (francesco.mancuso@crg.es) Bioinfarmatics Core Facility Short Tutorial
15. Main objectives of GO project Compile and provide GO terms; Use of structured vocabularies in the annotation of gene products; Provide open access to the GO database and Web resource. Independent sets of vocabularies Molecular Function (MF) – elemental activity or task performed, or potentially performed, by individual gene products (e.g. “DNA binding” and “catalytic activity”); Cellular Component (CC) – location of action for a gene product (e.g. “organelle membrane” and “cytoskeleton”); Biological Process (BP) – broad biological objective or goal in which a gene product participates. (e.g. “DNA replication” and “response to stimulus”).
20. Enrichment and p-valuescalculatedwith a hypergeometricdistribution N = all genes (universe) M = all genes belonging to a pathway n = your gene list m = genes of your gene list that belongs to the pathway Other well-known statistical methods: χ2, Fisher’s exact test, Binomial probability
21. A 'good' gene list Contains many important genes (marker genes) as expected; Reasonable number of genes ranging from hundreds to thousands (e.g., 100–2,000 genes), not extremely low or high; Most of the genes significantly pass the statistical threshold; Portion of up- or down-regulated genes are involved in certain interesting biological processes, rather than being randomly spread throughout all possible biological processes; Consistently contain more enriched biology than that of a random list in the same size range; High reproducibility to generate a similar gene list under the same conditions; Data high quality can be confirmed by other independent experiments.
30. Exercise 1 Submit data and convert the IDs Cicala, C. et al. HIV envelope induces a cascade of cell signals in non-proliferating target cells that favor virus replication. Proc. Natl. Acad. Sci. USA 99, 9380–9385 (2002). “Freshly isolated peripheral blood mononuclear cells were treated with an HIV envelope protein (gp120) and genome-wide gene expression changes were observed using Affymetrix U95A microarray chips. The aim of the experiment was to investigate cellular responses to viral envelope protein infection, which may help in understanding the mechanisms for HIV replication in resting or sub-optimally activated peripheral blood mononuclear cells.” DOWNLOAD THE DATASET FROM : http://www.nature.com/nprot/journal/v4/n1/suppinfo/nprot.2008.211_S1.html Supplementary Data 2
45. Attention!!!!! DAVID enrichment analysis is more of an exploratory procedure than a pure statistical solution. “The final interpretation and analytic result decisions (in terms of accepting the results that make sense biologically in the context of the study, or rejecting ones that do not) should be made by the biologists/analysts themselves, rather than by any of the tools.” (Huang et al., 2009)
46.
47.
48. EASE Score Threshold (Maximum Probability): the threshold of EASE Score, a modified Fisher Exact P-value, for gene-enrichment analysis. It ranges from 0 to 1. Fisher Exact P-Value = 0 represents perfect enrichment.
49. The Fold Enrichment is defined as the ratio of the two proportions. For example, if 40/400 (i.e. 10%) of your input genes involved in "kinase activity" and the background information is 300/30000 genes (i.e. 1%) associating with "kinase activity", roughly 10% / 1% = 10 fold enrichment.
50. In DAVID annotation system, Fisher Exact is adopted to measure the gene-enrichment in annotation terms. When members of two independent groups can fall into one of two mutually exclusive categories, Fisher Exact test is used to determine whether the proportions of those falling into each category differs by group.
51. Benjamini-Hochberg, Bonferroni, FDR (False Discovery Rate) are different 'standard' statistics for multiple comparison corrections. They correct P-values to be more conservative in order to lower family-wise false discovery rate.
52. LT (list total): number of genes in your gene list mapped to any term in this ontology ("system”)
53. PH (population hits): number of genes with this GO term on the background list (the whole chip)
54. PT (population total): number of genes on the background list (the whole chip) mapped to any term in this ontology ("system”)
Notes de l'éditeur
GoMiner, GOstat, Onto-express, GoToolBox, FatiGO, GFINDer and GSEA
3 - (e.g., selecting genes by comparing gene expression between control and experimental cells with t-test statistics: fold changes greater than or equal to 2 and P-values less than or equal to 0.05)6 - e.g., by independent experiments under the same conditions or by leave-one-out statistical test
Functional classification: ability for investigators to explore and view functionally related genes together, as a unit, to concentrate on the larger biological network rather than at the level of an individual gene.Functional Annotation chart: provides typical gene–term enrichment (overrepresented) analysis to identify the most relevant (overrepresented) biological terms associated with a given gene listFunctional Annotation Clustering: uses a similar fuzzy clustering concept as functional classification by measuring relationships among the annotation terms on the basis of the degree of their coassociation with genes within the user’s list to cluster somewhat heterogeneous, yet highly similar annotation into functional annotation groupsFunctional annotation table: is a query engine for the DAVID knowledgebase, without statistical calculations. For a given gene list, the tool can quickly query corresponding annotation for each gene and present them in a table format.