Chapter 9 Reference building

Reference building is critical for label transfer. To add a reference dataset into the scRNASequest pipeline, the reference matrix needs to be SCT transformed.

By running the following command, we can see the manual page of the reference generator, scRef:

$ scRef

***** 2023-01-26 15:29:54 *****
###########
## scRNAsequest: https://github.com/interactivereport/scRNAsequest.git
## Pipeline Path: /mnt/depts/dept04/compbio/edge_tools/scRNAsequest
## Pipeline Date: 2023-01-13 17:41:49 -0500
## git HEAD: 3d463e0b127af499942b7adc2fc5af6ddfc6f11e
###########

Loading resources

scRef /path/to/a/output/folder === or === scRef /path/to/a/Ref/config/file

The folder has to be existed.
The Ref config file will be generated automatically when a path is provided
===== CAUTION =====
    1. This process will add a seurat reference data into the scRNAsequest pipeline PERMANENTLY!
    2. Make sure the data provided for reference building is SCT transformed!

Powered by the Research Data Sciences group [zhengyu.ouyang@biogen.com;kejie.li@biogen.com]
------------

9.1 Initialize scRef

This pipeline can be initialized using an empty directory. for example, we first create a directory called ‘Reference_data’, then initiate the pipeline pointing to this directory:

scRef /path/to/the/directory

#Example:
scRef ~/Reference_data

After the run, a config file, refConfig.yml, and a log file will be generated in the directory. The refConfig.yml is a template of the following scRef run, passing critical parameters to the pipeline. The refConfig.yml file will be same as this template, but the output directory will be yours. Here is an example after filling in the configuration file:

output: ~/Reference_data
# the following is normally located in the same folder of celldepot hosting h5ad files
ref_h5ad_raw: /path/to/ProjectName_raw_added.h5ad # full path to the h5ad file contains raw UMI along with cell annotation and layout
ref_batch: library_id
# above two parameters are ignored, if a seurat object can be located in the project folder
ref_rds:                                          # full path to the processed seurat object including SCT assay, cell annotation and layout

# All information below are required
ref_name: Reference_data                # Please provide a unique name prefer to include species and tissue (check existed by calling scAnalyzer without argument)
ref_link:                               # The web link to the information of this reference. For scAnalyzer processed data, you could provide a Cellxgene VIP link here.
ref_src: sn                             # sc/"single cell" or sn/"single nuclei"
ref_platform: 10X                       # Single cell/neuclei technology e.g. 10X, SNARE-seq2, dropSeq, ...
#list a reduction to be used (at least 50 dimensions full name from VIP or one from seurat 'reductions' ) 
# details: https://github.com/satijalab/azimuth/wiki/Azimuth-Reference-Format
# this reduction is NOT directly used, but used to find the neighbors which is then used for computing sPCA reduction which is used in the reference
# For instance, if harmony was prefered layout then providing either 'harmony' (50 dimention) or 'harmony-PCA'
ref_reduction: pca 
ref_label: [predicted.celltype1]        # List the annotations (case sensitive) to be used for transferring. Please check this in the data. For our case, the header is called 'predicted.celltype1'
                                        # The cell type label/header name you would like to transfer
publish: False                          # Should this reference be published (added permanently) into scAnalyzer
overwrite: False                        # Overwrite the existing scAnalyzer reference

For the input data, scRef can take either an h5ad file containing raw UMI and annotatino information, or an R data file in rds format. If you have finished running a dataset using scAnalyzer, then you can directly use the ProjectName_raw_added.h5ad (not ProjectName.h5ad file) as input, and attach its path to ref_h5ad_raw. Alternatively, you can provide an RDS file using ref_rds. Either providing ref_h5ad_raw or ref_rds would be sufficient to the pipeline.

The ref_label is critical for label transfer if you would like to use this data as a reference in the future. It tells the program to use the cell type information in these (can be one more more) headers to perform label transfer. For a h5ad data, the easiest way is to open it using Cellxgene VIP, and identify the header that contains cell type labels, e.g. predicted.celltype1 in our case. For an RDS data, you could read it in R using the readRDS function and identify the column names containing cell type annotation.

For ref_rds input RDS file, please make sure it has been SCTransformed, and includes UMAP embeddings based on SCT values. Here are some codes to prepare it:

library(Seurat)

Data <- readRDS("Annotated.rds")                              #The meta.data table already has cell types annotated
Data <- SCTransform(Data, return.only.var.genes = F)          #Turn return.only.var.genes off
Data <- RunPCA(Data, verbose = FALSE)
Data <- RunUMAP(Data, dims = 1:30, verbose = FALSE)
Data <- FindNeighbors(Data, dims = 1:30, verbose = FALSE)
Data <- FindClusters(Data, verbose = FALSE)

saveRDS(Data, file = "Annotated.ForRuningscRef.rds")

9.2 Submit scRef

After filling in the information in the refConfig.yml file, we are ready to submit the full pipeline and build the reference data for label transfer:

scRef /path/to/a/Ref/config/file

#Example:
scRef ~/Reference_data/refConfig.yml

Output files:

Reference_data
    ├── init_20220630.log
    ├── refConfig.yml
    ├── refConfig.yml.20220630.log               # Output log information
    ├── Reference_data_for_scAnalyzer.rds        # The data to use for label transfer and downstream analysis
    ├── ref_notFor_scAnalyzer.rds                # Output rds file NOT for lebel transfer
    ...

Please also check the information in the refConfig.yml.20220630.log file, and pay attention to the last few lines:

...  #log of the running process omitted

The private reference could be used by provide the following full path to 'ref_name' in scAnalyzer config file:
    ~/Reference_data/Reference_data_for_scAnalyzer.rds
...

This indicates that the reference has been successfully generated, and it can be passed to scAnalyzer through the config.yml file. If you turned on “publish: True” in the refConfig.yml, this reference will be added to scAnalyzer, and you can use its name to refer it when running scAnalyzer.

9.3 Demo scRef

In this demo, we use a previously processed RDS file using cell type annotation from a public project: GSE172462. The RDS file is: GSE172462.SCT.For_scRef.rds.

We first run scRef by providing our working directory to generate necessary template files:

scRef ~/demo_scRef

Then we fill in the refConfig.yml file:

# please check wiki page for the details
output: ~/demo_scRef
# the following is normally located in the same folder of celldepot hosting h5ad files
ref_h5ad_raw:                                       # full path to the h5ad file contains raw UMI along with cell annotation and layout
ref_batch: library_id
# above two parameters are ignored, if a seurat object can be located in the project folder
ref_rds: ~/demo_scRef/GSE172462.SCT.For_scRef.rds   #full path to the processed seurat object including SCT assay, cell annotation and layout

# All information below are required
ref_name: demo                                      # please provide a unique name prefer to include species and tissue (check existed by calling scAnalyzer without argument)
ref_link: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE172462 # The web link to the information of this reference
ref_src: single nuclei # sc/"single cell" or sn/"single nuclei"
ref_platform: 10X # which single cell/neuclei technology e.g. 10X, SNARE-seq2, dropSeq, ...
#list a reduction to be used (at least 50 dimensions full name from VIP or one from seurat 'reductions' ) 
# details: https://github.com/satijalab/azimuth/wiki/Azimuth-Reference-Format
# this reduction is NOT directly used, but used to find the neighbors which is then used for computing sPCA reduction which is used in the reference
# For instance, if harmony was prefered layout then providing either 'harmony' (50 dimention) or 'harmony-PCA'
ref_reduction: pca 
ref_label: [celltype, neuron_subtype, manual_anno]  # list the annotations (case sensitive) to be used for transferring
publish: False                                      # should this reference be published into scAnalyzer
overwrite: False                                    # overwrite the existing scAnalyzer reference

In this demo, we turned off publish and overwrite to make it just to test scRef. This is a reduced data (with only 1000 files) to test the demo, so it may not be ideal to use it as a real reference in the pipeline.

Finally, we can run the scRef:

scRef ~/demo_scRef/refConfig.yml

This pipeline will finish in ~1 minute and the output file will be:

demo_for_scAnalyzer.rds