Chapter 8 Methods
8.1 Data processing
Public dataset. The public datasets that are accessible through OmicsView are a subset of the DiseaseLand https://www.qiagenbioinformatics.com/diseaseland/ database. The DiseaseLand data service uses common analysis pipelines to quantify and normalize publicly available microarray and RNA-seq expression data from raw data. For each project, and each sample, metadata are curated to apply controlled vocabularies and ensure consistent formatting of terms. Information on the data processing pipeline and the DiseaseLand product is available at the following links ( http://www.arrayserver.com/wiki/index.php?title=DiseaseLand_Curation_Pipeline, http://www.arrayserver.com/wiki/index.php?title=Omicsoft_Affymetrix_Microarray_Preprocessing, http://www.arrayserver.com/wiki/index.php?title=RNA-Seq_Normalized_FPKM_Values_in_Land ). The data in OmicsView was exported from DiseaseLand prior to 8/27/2019. No OmicsView data can be exported from the portal, distributed with a software package, or otherwise redistributed without express permission from QIAGEN.
8.2 Functional and Pathway Enrichment
The function enrichment algorithms implemented in OmicsView come in two flavors.
The first is based on hypergeometric enrichment of differentially expressed genes (DEGs) against several knowledge databases, based on the HOMER package (http://homer.salk.edu/homer/microarray/go.html). This requires a significance cutoff to be applied to the two group comparison expression data such that we can select a set of genes that are upregulated and downregulated, i.e., differentially expressed genes. The HOMER package comes with some collections of gene knowledge databases that we use for enrichment.
The second approach is based on a variant of the Gene Set Enrichment Analysis (GSEA) approach called PAGE (Parametric Analysis of Gene Set Enrichment) which is faster than the original GSEA approach and more sensitive [1]. We use an implementation of PAGE that is available in the R piano package (https://www.bioconductor.org/packages/release/bioc/html/piano.html). GSEA based approaches just require a ranked list of genes, i.e., genes with a numeric quantity for ranking them that shows how different they are between groups in the comparison, usually log Fold Change (logFC) values. HOMER and PAGE and are typically more sensitive than traditional GSEA in terms of revealing pathway enrichment. Both packages come pre-built with a set of known functions/pathways in the form of gene sets, against which we look for enrichment in the comparison of interest.
8.2.1 HOMER workflow
8.2.1.1 Selection of Differentially Expressed Genes (DEGs) for enrichment analysis
For each comparison, we generated the up- and down-regulated gene lists as input for functional enrichment analysis. We used a dynamic cutoff of logFC and AdjustedPValue / PValue to aim for 200-2000 genes in each list. To achieve this, we start with a stringent cutoff of adjusted p-value of 0.05, and 2-fold change up or down. Getting a list of greater than 200 genes is ideal for the downstream analysis. If there are fewer, we drop the nominal p-value to 0.01 (with 2-fold up or down), and subsequently if 200 is not reached we drop the fold change to 1.2 in either direction or increase the adjusted p-value to 0.1. If the most relaxing cutoff doesn’t generate a list of at least 50 dysregulated genes, we select the top 50 genes ranked by absolute logFC values to generate the lists for enrichment analysis.
8.2.1.2 Functional Enrichment of the DEG list against the genome
The findGo.pl program from Homer package ( http://homer.salk.edu/homer/microarray/go.html ) is used to analyze the functional enrichment for each DEG list.
There are several different “ontologies,” or libraries of gene groupings that came with the Homer package. We used Homer version v4.8.3, human-o v5.8 and mouse-o v5.8 libraries (current as of Dec 6, 2016). These include the following categories:
Gene Ontology: Biological Process, Molecular Function, Cellular Component
Chromosome Location: Genes with similar chromosome localization (NCBI Entrez Gene)
KEGG Pathways: Groups of proteins in the same pathways (From KEGG)
Protein-Protein Interactions: Groups of proteins interacting with the same protein (From NCBI Entrez Gene) Interpro: Proteins with similar domains and features (Interpro)
Pfam: Proteins with similar domains and features (Pfam)
SMART: Proteins with similar domains and features (SMART)
Gene3D: Proteins with similar domains and features (Gene3D Database)
Prosite: Proteins with similar domains and features (Prosite Database)
PRINTS: Proteins with similar domains and features (PRINTS Database)
MSigDB: Lists of genes maintained by the Molecular Signature Database (includes many different categories of genes (MSigDB)
BIOCYC: Groups of proteins in the same pathway (NCBI Biosystems/BIOCYC)
COSMIC: Human proteins that are mutated in the same cancers (COSMIC)
GWAS Catalog: Human genes with risk SNPs identified in their vicinity for the same disease (GWAS Catalog)
Lipid Maps: Mouse proteins found in the same lipid processing pathways (NCBI Biosystems/LIPID MAPS)
Pathway Interaction Database: Proteins in the same pathway (NCBI Biosystems/PID)
REACTOME: Proteins in the same biochemical pathways (NCBI Biosystems/REACTOME)
SMPDB: Proteins in the same pathway (SMPDB)
WikiPathways: Protein in the same pathway (WikiPathways)
The HTML output file of HOMER was further enhanced by a custom php script to make it more user friendly.
8.2.2 Gene Set Enrichment Analysis (GSEA) workflow (PAGE)
For our second enrichment tool, we used a variation of the GSEA method, PAGE (Parametric Analysis of Gene Set Enrichment) [1] to process the comparison data as PAGE is much faster and more sensitive than GSEA.
For each comparison, we produced a rank file with gene symbols and corresponding logFC values. If a gene symbol appeared multiple times in the same comparison, the average logFC was used. Human gene sets were downloaded from MsigDB (http://software.broadinstitute.org/gsea/msigdb) version 5.2 (msigdb.v5.2.symbols.gmt).
Mouse gene sets were downloaded from the Bader Lab website from Univ. of Toronto (http://baderlab.org/GeneSets). This version is labeled as December_01_2016 (Mouse_GO_AllPathways_with_GO_iea_December_01_2016_symbol.gmt). This gmt file contains special characters that cannot be used by R piano package, therefore we manually replaced special characters to “/” or “ _” . In addition, some mouse gene sets have the same name, so we added suffix “_altSetX” to make all the names unique.
8.2.2.1 PAGE Analysis using the R piano package
The R package piano (https://bioconductor.org/packages/release/bioc/html/piano.html) was used to run the PAGE analysis for all the rank files. To simplify the piano output, we combined down-regulated and up-regulated gene sets into a single table for each gene set. From the piano results, we extracted out the p-value, FDR, and Z-score.
8.3 Meta-Analysis
The meta-analysis functions allow a user to combine expression data across multiple studies to find changes that are robust. DiseaseAtlas offers two ways to perform meta-analysis:
The first method works on comparison data. The system uses the comparison data (logFC, p-value) to compute combined p-value and rank product. This method is fast and can be applied to any type of comparison data. However, it does not use the individual sample data, nor does it consider the number of samples in each comparison.
The second method uses per sample Gene Expression data. The user is expected to input a list of factors that indicate comparisons across or within studies, and then gene level significant changes are recomputed by extracting expression data from all samples for each comparison. The RankProd [2] and/or MetaDE [3] packages are then applied to perform the meta-analysis. Limma is applied to get the statistics for each individual comparison. This analysis takes much longer (10 minutes to an hour for a typical analysis, even longer if number of samples are very large), and it has more strict sample requirements (e.g., no samples can occur in two different comparisons).
Statistically, the second method is more robust. However, both methods should detect consistently changed genes.
8.3.1 Meta-analysis statistics
For comparison data, the combined p-value is computed using Fisher’s method, that is -2*(sum of ln(p-value)) is compared against a Chi-squared distribution with N degrees of freedom, where N is the number of p-values being combined. This is carried out for every gene and yields a combined p-value that is reported. Another simpler approach is to report the maximum p-value for the gene across all the comparisons – a much more stringent measure of overall significance. This approach to combining p-values is implemented in the MetaDE R package. Note MetaDE will not produce results if more than 30% of the comparisons for a gene have missing p-values. In addition, with this approach, the p-value combination does not account for the direction of the fold change (up or down), so the up regulated and down regulated percentage summaries need to be referred to for interpretation.
The RankProd method (https://bioconductor.org/packages/release/bioc/html/RankProd.html) converts log2 fold changes across all genes in a comparison to ranks and then computes a meta-statistic per gene which is the geometric mean of the ranks across comparisons. It is a non-parametric approach, and computes statistical significance based on a permutation approach which also addresses the multiple testing aspect of looking for significance within the set of all genes in the transcriptome.
References:
- Kim, S.Y., & Volsky, D.J. (2005). PAGE: parametric analysis of gene set enrichment. BMC bioinformatics, 6, 144 (2005).
- Del Carratore, D et al. RankProd 2.0: a refactored Bioconductor package for detecting differentially expressed features in molecular profiling datasets. Bioinformatics 33 (17), 2774-2775 (2017).
- Ma, T. et al. Metaomics: analysis pipeline and browser-based software suite for transcriptomic meta-analysis. Bioinformatics 35 (9), 1597-1599 (2019).