Chapter 4 Data preparation

4.1 Retrieving data from NCBI GEO

ExpressionAnalysis requires several input files, which should be organized in specific formats. Here, we present an example of the data processing steps using a public NCBI GEO dataset: SRP199678. The R code recount3.example.R can be run in the terminal under the conda environment created before.

conda activate ExpressionAnalysis
cd RNASequest/example

Rscript recount3.example.R

4.2 File hierarchy

After running recount3.example.R, we will have all input files for ExpressionAnalysis with the following hierarchy:

SRP199678/
    ├── comparison.csv
    ├── counts.tsv
    ├── geneAnnotation.csv
    ├── geneLength.tsv
    ├── meta.csv
    └── project.yml

4.3 Self-prepared files

For users that have their own dataset to run EA, several files can be prepared from scratch. Here is the detailed instruction on what files are required to run the pipeline.

Please use the same file hierarchy as 4.2 after creating them.

  • comparison.csv

    This file contains comparison groups, which must be defined in the meta.csv.

    Here is an example of the comparison.csv file:

    Group_name,Group_test,Group_ctrl
    group,grp2,grp1
    group,grp3,grp2

    Group_name: this column specifies the sample meta variable that is of the interest for comparison.

    Group_test: this column specifies the “case” group of the sample meta variable used for comparison.

    Group_ctrl: this column specifies control group of the sample meta variable used for comparison.

    There are more information you can define in this comparison file:

    CompareName: this is the comparison name column. Every entry in this column must be unique. No limma or DESeq2 tag is needed as they will be added automatically in the result output.

    Subsetting_group: this is the column to specify the subsetting information if the user wants to do the DEG analysis for a subset of the samples.

    The entry format is:

    Covariate1: Covariate1_level; Covariate2: Covariate2_level; Covariate3: Covariate3_level…

    Different covariates are separated by “;”, the covariate name and covariate value are separated by “:”.

    e.g. Tissue:brain - in this case Tissue needs to be a column in the sample meta table that has a level “brain”. The DESeq (or limma) counts object will be subset to these samples only.

    Note: this is an “AND” subsetting operation if using multiple covariates. e.g. Tissue:brain;Sex:male means pull out samples from brain that are also male. This means that it cannot be used to pool together (union) subsets of samples under the same covariate, e.g. Tissue:brain;Tissue:spinal_cord will not return a union of brain and spinal_cord samples. To achieve this operation the user needs to define a new grouping variable in the sample meta file, say Tissue2, that includes a level “brain_spinal_cord” and includes all of those samples.

    Model: the model formulas are stored in this column.The formula can be combinations of the following formats: Additive: var1 + var2 Nested: var1: var2 Interaction: var1*var2

    Covariate_levels: this column is to specify the covariate levels of interest when the variable of interest is interacting with or nested by other covariate(s). In this case, the levels of covariates are required to specify the covariate’s group in which the comparison is made. The entry format is the same as subsetting_group:

    Covariate1: Covariate1_level; Covariate2: Covariate2_level; Covariate3: Covariate3_level…

    Different covariates are separated by “;”, the covariate name and covariate value are separated by “:”.

    e.g. Sex:male;Treatment:placebo. This is an advanced usage parameter, normally you can leave this blank.

    Analysis_method: this column specifies the tool to be used for the DEG analysis. There are 2 options: DESeq2 and limma.

    Shrink_logFC: this column specifies if the the user want to apply the logFC shrink during the DEG analysis. The entry can be: yes or no. Please note that this function only applies to the comparison analyzed by DESeq2. The limma package has inbuilt shrinkage function so there is no need to do an extra one.

    LFC_cutoff: this column is to set the logFoldChange value for the null hypotheses of testing. The default value is 0. Putting this away from 0 makes calls more stringent.

    Additional notes:

    Continuous/Numeric variable as the comparison group:

    If the user is interested in using a numeric variable (such as Age) as the comparison group in the comparison, please provide the column header of the numeric column in the sample meta table in the “Group_name” column of the comparison.csv table, and leave “Group_test” and “Group_ctrl” columns of that row empty. Currently this function is only supported if DE analysis is run using the DESeq2 method.

  • counts.tsv

    This file stores the raw counts of each sample in a tab-delimited format (tsv).

    $ head -4 counts.tsv
    
    ""      "SRR9139048"     "SRR9139049"        "SRR9139050"
    "ENSMUSG00000079800.2"      7       9       11
    "ENSMUSG00000095092.1"      2       1       10
    "ENSMUSG00000079192.2"      7       7       1
  • geneAnnotation.csv

    This file contains the detailed annotation of each gene shown in the count.tsv table.

    Here is an example:

    $ head -4 geneAnnotation.csv
    
    "","seqnames","start","end","width","strand","source","type","bp_length","phase","UniqueID","gene_type","Gene.Name","level","mgi_id","havana_gene","tag","id"
    "ENSMUSG00000079800.2","GL456210.1",9124,58882,49759,"-","ENSEMBL","gene",1271,NA,"ENSMUSG00000079800.2","protein_coding","AC125149.3","3",NA,NA,NA,1
    "ENSMUSG00000095092.1","GL456210.1",108390,110303,1914,"-","ENSEMBL","gene",366,NA,"ENSMUSG00000095092.1","protein_coding","AC125149.5","3",NA,NA,NA,2
    "ENSMUSG00000079192.2","GL456210.1",123792,124928,1137,"+","ENSEMBL","gene",255,NA,"ENSMUSG00000079192.2","protein_coding","AC125149.1","3",NA,NA,NA,3
  • geneLength.tsv

    This file has the gene length information. It must match the dimension of the count.tsv file.

    $ head -4 geneLength.tsv
    
    ""      "SRR9139048"     "SRR9139049"        "SRR9139050"
    "ENSMUSG00000079800.2"        1271      1271        1271
    "ENSMUSG00000095092.1"        366       366     366
    "ENSMUSG00000079192.2"        255       255     255
  • meta.csv

    This meta.csv file can contain many annotation columns for the dataset. For example, this is a meta file containing group information:

    $ head -4 meta.tsv
    
    "","group"
    "SRR9139048","grp1"
    "SRR9139049","grp2"
    "SRR9139050","grp3"
  • project.yml

    The project.yml file contains a high-level summary of the project, including project name, species, etc.

    $ cat project.yml
    
    project: SRP199678
    species: mouse
    file_source: sra
    project_home: data_sources_sra
    project_type: data_sources
    number_samples: 155