Chapter 6 Input configuration

First, make sure the .env has been created in the src folder under the pipeline directory:

ls ~/scRNAsequest/src/.env

6.1 Generate template files

Now we are ready to run the pipeline. Since the main pipeline script scAnalyzer requires a config.yml file, a sample metadata file sampleMeta.csv, and a DEGinfo.csv file for DE analysis (can be filled in later), we can first generate templates for these files.

We run scAnalyzer by providing a path to our working directory. This can be any directory on you system:

scAnalyzer /path/to/working/dir

Then you will see the following files generate:

config.yml          #This is the pipeline configuration file
DEGinfo.csv         #This is for DE analysis, see the 'Differential expression (DE) analysis' section for more details
                    #This file can be left blank, which means skip DE analysis
sampleMeta.csv      #Sample metadata file, storing sample level information
sc20230320.init.log #Log file

In the log file, we can also see the following messages:

*****
External data, please fill the config file and provide sample sheet with two mandatory columns:
Two columns with names 'Sample_Name' and 'h5path' indicating sample name and UMI path
Option columns for sample annotation
Option column 'metapath' can be provided for cell level annotation first column cell bar code
*****

Please following the instructions below to set up these files.

6.2 Prepare the config.yml file

To run the scRNAsequest pipeline, a config.yml file is required to be filled in. Please use the following template as an example to prepare this file:

config.yml

## Project info and filtering parameters
prj_name: E-MTAB-11115            #required, project name
prj_title: "Cell2location maps fine-grained cell types in spatial transcriptomics" #required, and quotes might be needed
ref_name:                         #optional. You can choose one from scAnalyzer call without an argument
output: /home/ysun4/E-MTAB-11115/ #required, output path
sample_name: Sample_Name
sample_meta: ~/E-MTAB-11115/sampleMeta.csv   #required, sample meta information, see section 5.1
gene_group:
  MT:
    startwith: ["MT-","Mt-","mt-"]
    cutoff: 20           # percentage cutoff to filter out the cells (larger than this cutoff)
    rm: False            # this means the genes specified "startwith" will be REMOVED from the downstream analysis
  # The following is an example (Ribosomal genes) to add additional gene set/list to be quantified the percentage over total UMI per cell
  # additional gene set/list can be added as the format
  RP:
    startwith: [MRPL,MRPS,RPL,RPS]
    cutoff: 50
    rm: False
filter_step: True        #if False, the above 'gene_group' filtering will be skipped as well
dbl_filter: True         #if True, the dbl finding class will be used to remove the predicted doublets, if a numeric (0~1), cells with larger predicted doublet scores will be removed
min.cells: 3             #filtering genes by minimal cell
min.features: 50         #filtering cells by minimal genes
highCount.cutoff: 10000  #any cells with higher total counts to be removed
highGene.cutoff: 3000    #any cells with a higher number of detected genes to be removed
group:                   #if provided, a 10X QC box plot will be ploted in QC plot
rasterizeFig: True       #should image in pdf be rasterized

reRunQC: True            #if satisfied with the above setting, please set this to False to save time 
runAnalysis: False       #False means only run QC steps before data normalization and harmonization
                         #If QC looks good, turn this to True to run the full pipeline
newProcess: False
methods: [SCT,Liger,SeuratRef,SeuratRPCA,sctHarmony]    #SCT method (including log1p) is required expression normalization
                         #The listed methods include all available analyses. If you want to ignore Liger, for example, you can delete it here
expScaler: 10000         #integer value, 0: SCT (along with following option); 1-100: logNormal with scale.factor to be the specified quantile
                         #>100: logNormal with scale.factor to be the specified value                                         
PrepSCTFindMarkers: True # should the "PrepSCTFindMarkers" applied on SCT to be exported for visualization expression                                                             
parallel: False          #"sge" or "slurm"
core: 2
overwrite: False         #if you are rerunning the same project, set this to True to overwrite previous results
major_cluster_rate: 0.7  #the proportion of cells of an integration cluster to be assigned to a seurat reference label

## DEG analysis for an annotation (such as disease vs health) within a cell type annotation
DEG_desp:  ~/E-MTAB-11115/DEGinfo.csv         #required for DEG analysis, but you can leave this empty if not going to run DEG
NAstring: [] #provide a list of strings which should be considered NA and associated cells to be removed
# Please be causion of changing the following default filtering
# More details can be found: section 2.4 in https://pubmed.ncbi.nlm.nih.gov/35743881/
# Applies the 1st round of biostats filtering pipeline. Note that this filter is applied to all cells of the experiment
min.cells.per.gene: 3                         # if `perc_filter` is FALSE, then keep only genes that have expression in at least min.cells.per.gene
min.genes.per.cell: 250                       # keep cells with expression in at least min.genes.per.cell genes.
min.perc.cells.per.gene: 0.00 #if 'perc_filter'`' is TRUE, then keep only genes that have expression in at least min.per.cells.per.gene * 100 percent of cells
perc_filter: TRUE                             # if TRUE, apply the cells.per.gene filter using percentages (expressed as a decimal) rather than an absolute threshold
# Apply the 2nd round of biostats filtering.  For "group" mode, the filtering is applied to `ref_group` and `alt_group` for the given cell type of interest.
R6_min.cells.per.gene: 3                      # minimum cells expressed per gene.  This filter is applied if `R6_perc.cells.filter` is FALSE
R6_min.perc.cells.per.gene: 0.1               # minimum % cells expressed per gene filtering (use decimal form of percentage).  This threshold is applied if 'R6_perc.cells.filter' is TRUE and 'R6_cells.per.gene.filter' is TRUE
R6_min.cells.per.gene.type: "or"              # The type of cell per gene filtering.  If it has the value "and" then it requires the gene be expressed in both reference and non-reference groups. If it has the value "or" then it requires the gene be expressed in either group
R6_cells.per.gene.filter: TRUE                # TRUE means apply cells per gene filtering
R6_perc.cells.filter: TRUE                    # TRUE means apply cell.per.gene filtering by use of a percentage rather than absolute threshold.  If the percentage results in a number less than R6_min.cells.per.gene, the code will automatically switch to min.cells.per.gene absolute thresholding
R6_perc.filter: FALSE                         # If TRUE, then apply the 75th percentile gene filtering
R6_perc.filter.type: "and"                    # The type of percentile gene filtering.  If it has the value "and" then any gene that has 75th percentile of zero in both groups will be filtered out.  If it has the value "or" then any gene that has a 75th percentile of zero in either group will be filtered out.
R6_perc_threshold: 0.90                       # Percentile threshold, 75th percentile is default.  Express percentile as a decimal value.
R6_min.ave.pseudo.bulk.cpm: 1 #cpm filtering threshold
R6_pseudo.bulk.cpm.filter: FALSE              # if TRUE, then apply a cpm filter on the pseudo-bulk counts
R6_min.cells.per.subj: 3                      # Minimum cells required per subject, must be a nonzero number

## publish to celldepot
publish: False

In the first run (see the below section 7.1 scAnalyzer section on how to run the pipeline), the user can set runAnalysis: False and use the above default parameters to perform cell filtering and QC.

Then this pipeline will only run basic QC checking and output a Bookdown report with QC figures. This step is semi-automated because different datasets may need distinct filtering criteria. The design of this pipeline is to pause here to make sure the filtering step is adequate before running through the whole analysis.

For HPC clusters, if you set parallel to be “sge” or “slurm” based on your cluster, the pipeline will automatically submit separate jobs and monitor the run. Finally, it will collect results and summarize them.

Once the filtering step is validated by checking the QC report, the user can change the following settings and run the full analysis:

runAnalysis: True
overwrite: True

6.3 Prepare the sample meta file

The sample meta file, usually named as sampleMeta.csv, is a csv file storing data information. The minimal columns for this file are Sample_Name (This name can be changed, but it must be consistent with the sample_name in the config.yml file) and h5path. If you have the cell type annotation information, you can provide is by adding a column called metapath, which is an optional column. For other information, you can add as many columns in this csv file and the column names can be defined by yourself.

Another usage of the metapath column in sampleMeta.csv is to subset the cells. The pipeline will only use the cells included in the metapath file to run analysis. Thus, you can adjust this information by selecting only a subset of barcodes.

6.3.1 h5 input

Here is an example with minimal information for the public dataset E-MTAB-11115 (10X h5 format):

Sample_Name,h5path
5705STDY8058280,~/E-MTAB-11115/data/5705STDY8058280.filtered_feature_bc_matrix.h5
5705STDY8058281,~/E-MTAB-11115/data/5705STDY8058281.filtered_feature_bc_matrix.h5
5705STDY8058282,~/E-MTAB-11115/data/5705STDY8058282.filtered_feature_bc_matrix.h5
5705STDY8058283,~/E-MTAB-11115/data/5705STDY8058283.filtered_feature_bc_matrix.h5
5705STDY8058284,~/E-MTAB-11115/data/5705STDY8058284.filtered_feature_bc_matrix.h5
5705STDY8058285,~/E-MTAB-11115/data/5705STDY8058285.filtered_feature_bc_matrix.h5

Another example if you would like to use the cell type annotation defined by their analysis, which can be included in the metapath column. Moreover, the user has the freedom to provide further information such as sex and age:

Sample_Name,h5path,metapath,sex,age
5705STDY8058280,~/E-MTAB-11115/data/5705STDY8058280_filtered_feature_bc_matrix.h5,~/E-MTAB-11115/data/5705STDY8058280_annotation.csv,Female,56d
5705STDY8058281,~/E-MTAB-11115/data/5705STDY8058281_filtered_feature_bc_matrix.h5,~/E-MTAB-11115/data/5705STDY8058281_annotation.csv,Female,56d
5705STDY8058282,~/E-MTAB-11115/data/5705STDY8058282_filtered_feature_bc_matrix.h5,~/E-MTAB-11115/data/5705STDY8058282_annotation.csv,Female,56d
5705STDY8058283,~/E-MTAB-11115/data/5705STDY8058283_filtered_feature_bc_matrix.h5,~/E-MTAB-11115/data/5705STDY8058283_annotation.csv,Male,56d
5705STDY8058284,~/E-MTAB-11115/data/5705STDY8058284_filtered_feature_bc_matrix.h5,~/E-MTAB-11115/data/5705STDY8058284_annotation.csv,Male,56d
5705STDY8058285,~/E-MTAB-11115/data/5705STDY8058285_filtered_feature_bc_matrix.h5,~/E-MTAB-11115/data/5705STDY8058285_annotation.csv,Male,56d

#A brief look at the annotation file used in the metapath column:
$ head -3 5705STDY8058280_annotation.csv
Cell.ID,sample,annotation_1,annotation_1_print
AAACCCAAGGAAGTAG-1,5705STDY8058280,Ext_L25,23_Ext_L25
AAACCCAAGGGCAGTT-1,5705STDY8058280,Ext_L56,24_Ext_L56

6.3.2 MEX input

Since each data has three MEX files (barcodes.tsv.gz, features.tsv.gz and matrix.mtx.gz), here we provide the path to each MEX result directory. The minimal information in the sample meta file:

Sample_Name,h5path
FCtr,~/10XGenomicsData/GSM5617891_snRNA_FCtr
FEcig,~/10XGenomicsData/GSM5617892_snRNA_FEcig
MCtr,~/10XGenomicsData/GSM5617893_snRNA_MCtr
MEcig,~/10XGenomicsData/GSM5617894_snRNA_MEcig

Another example with additional information:

Sample_Name,h5path,Treatment,Sex
FCtr,~/10XGenomicsData/GSM5617891_snRNA_FCtr,Control,Female
FEcig,~/10XGenomicsData/GSM5617892_snRNA_FEcig,EcTreated,Female
MCtr,~/10XGenomicsData/GSM5617893_snRNA_MCtr,Control,Male
MEcig,~/10XGenomicsData/GSM5617894_snRNA_MEcig,EcTreated,Male

6.4 File hierarchy

Before running the pipeline, please check the file hierarchy. Here are examples for h5 and MEX inputs, and we suggest creating a separate folder (such as processing) to store the config.yml and sampleMeta.csv files. The output files will be generated in the directory specified in config.yml.

For data downloaded in 10X h5 format:

E-MTAB-11115/
    ├── data
        ├── 5705STDY8058280.annotation.csv           (optional)
        ├── 5705STDY8058280.metrics_summary.csv      (optional)
        ├── 5705STDY8058280.raw_feature_bc_matrix.h5
        ├── 5705STDY8058281.annotation.csv           (optional)
        ├── 5705STDY8058281.metrics_summary.csv      (optional)
        ├── 5705STDY8058281.raw_feature_bc_matrix.h5
        ...
    ├── processing
        ├── config.yml
        └── sampleMeta.csv

For data downloaded in 10X MEX format:

GSE185538/
    ├── GSM5617891_snRNA_FCtr
        ├── barcodes.tsv.gz
        ├── features.tsv.gz
        └── matrix.mtx.gz
    ├── GSM5617892_snRNA_FEcig
        ├── barcodes.tsv.gz
        ├── features.tsv.gz
        └── matrix.mtx.gz
    ...
    ├── processing
        ├── config.yml
        └── sampleMeta.csv