Chapter 10 CellDepot publishing

CellDepot is a comprehensive data management platform for single-cell RNA-seq datasets. Publishing to CellDepot allows easy navigation and visualization of the data, and facilitates big data deposition.

If you have run through the scRNASequest pipeline using scAnalyzer, you will be able to use the h5ad file directly and add it to CellDepot, following the ‘4.6 Import projects’ and ‘4.7 Create project’ in the full instruction here. In this case, you won’t need to run sc2celldepot. However, this may be cumbersome because you have to run the full scAnalyzer pipeline to get the h5ad files. Sometimes, you have a public data (such as an RDS file, or raw MEX/h5 UMI count files with UMAP coordinates) with cell annotation information, and you only need to use them for visualization, rather than running though the whole analysis.

Thus, the scRNASequest pipeline offers a function to publish a dataset to CellDepot using sc2celldepot. This workflow will use the existing cell type information (cell type, cluter id, etc.) and create an h5ad file. The h5ad file can be further used for uploading to the CellDepot platform. In sum, there are two cases to use this pipeline: 1) If your input is an RDS file, you can also view this pipeline as a converter from RDS to h5ad. 2) If you have raw UMI counts, together with annotated cell barcode information (e.g. cell type for each barcode), and some UMAP embedding coordinates (the embedding information is optional), you can use this pipeline to assemble them together and create an h5ad output.

By running the following command, we can see the manual page of the script:

$ sc2celldepot

***** 2023-01-26 15:25:32 *****
###########
## ExpressionAnalysis: https://github.com/interactivereport/scRNAsequest.git
## Pipeline Path: /mnt/depts/dept04/compbio/edge_tools/scRNAsequest
## Pipeline Date: 2023-01-13 17:41:49 -0500
## git HEAD: 3d463e0b127af499942b7adc2fc5af6ddfc6f11e
###########

Loading resources

sc2celldepot /path/to/a/output/folder === or === sc2celldepot /path/to/a/config/file

Please create the folder before running sc2celldepot.
The data config file will be generated automatically when a path is provided

Powered by the Research Data Sciences Group [zhengyu.ouyang@biogen.com;yuhenry.sun@biogen.com]
------------

10.1 Initialize sc2celldepot

Running sc2celldepot by providing a working directory will initiate the project and generate a template of config file. The config file template can be found: template.

First, we create a new directory under the project. Then we run the sc2celldepot command:

mkdir ~/E-MTAB-11115/CellDepot_publish
sc2celldepot ~/E-MTAB-11115/CellDepot_publish

E-MTAB-11115/
    ├── E-MTAB-11115.rds
    └── CellDepot_publish
        └── sc2celldepot.yml

Here is an example after filling in the configuration sc2celldepot.yml file.

For a simple example, we just provide a prefix and an RDS file path to this yml file. The pipeline will read the metadata and convert it to an h5ad file.

## The config file to process public sc/sn RNAseq and generate h5ad for celldepot
output: ~/E-MTAB-11115/CellDepot_publish
prefix: E-MTAB-11115                                        # the prefix of the file name of the h5ad

# seurat RDS is avaiable, otherwise please move to next section
seuratObj: ~/E-MTAB-11115/E-MTAB-11115.rds                  # the full path to the seurat RDS file with SCT & RNA assay along with meta.data and reduction
seuratUMI: RNA                                              # the name of the assay stores raw UMI
seuratSCT: SCT                                              # the name of the assay stores SCT
seuratMeta: []                                              # the list of cell annotations to be stored in h5ad, empty list means all meta.data entry from seurat rds

# Expression when the seurat RDS is not available (row gene/column cell)
# if the annotation files are seperated the same as expression files, they should be the same order, other wise cell ID will be used to match
expression: []                                              # full path the gene expression file/folder (h5/csv/txt/mtx), if multiple files, please provide the list separated by ','
dataUMI: True                                               # if the above expression is UMI, if the value in expression file should be used directly, please set "False"

# cell annotation (cell intersection will be used, first column is the cell ID)
annotation: []                                              # full path to the cell annotation file, first column is the cell ID which should match cell ID in expression
annotationUse: []                                           # the column names in the annotation file to be extracted for h5ad, empty list means all columns
sample_column:                                              # one column header from annotation file, if the one expression file needs to be splited into each sample

# cell layout: tSNE, UMAP, PCA, if separated the same as expression files, should be the same order
reduction:                                                  # optional (if missing UMAP will be created), other keys can be removed or added new ones, keys will be used in h5ad
  files: []                                                 # full path to the cell layout file (contains all layouts of a set of cells), first column is the cell ID which should match cell ID in expression
  umap: []                                                  # column headers from layout file to be used, please use quote for each column header
  tsne: []                                                  # column headers from layout file to be used, please use quote for each column header
  pca: []                                                   # column headers from layout file to be used, please use quote for each column header (can be more than 2 dimentions though only first two will be shown in VIP)

Alternatively, if you don’t have an RDS file, users can provide raw UMI files together with cell type annotation files and reduction embeddings (optional, by default it will generate UMAP) in this way:

Besides the required output and prefix parameters, in this example, the expression points to several h5 files or MEX folders, which is required. The annotation points to cell type annotation (and other cell level information) csv files, which are also required.

## The config file to process public sc/sn RNAseq and generate h5ad for celldepot
output: ~/E-MTAB-11115/CellDepot_publish
prefix: E-MTAB-11115                                        # the prefix of the file name of the h5ad

# seurat RDS is avaiable, otherwise please move to next section
seuratObj:                                                  # the full path to the seurat RDS file with SCT & RNA assay along with meta.data and reduction
seuratUMI: RNA                                              # the name of the assay stores raw UMI
seuratSCT: SCT                                              # the name of the assay stores SCT
seuratMeta: []                                              # the list of cell annotations to be stored in h5ad, empty list means all meta.data entry from seurat rds

# Expression when the seurat RDS is not available (row gene/column cell)
# if the annotation files are seperated the same as expression files, they should be the same order, other wise cell ID will be used to match
expression: [~/E-MTAB-11115/data/5705STDY8058280_filtered_feature_bc_matrix.h5,~/E-MTAB-11115/data/5705STDY8058281_filtered_feature_bc_matrix.h5,~/E-MTAB-11115/data/5705STDY8058282_filtered_feature_bc_matrix.h5,~/E-MTAB-11115/data/5705STDY8058283_filtered_feature_bc_matrix.h5,~/E-MTAB-11115/data/5705STDY8058284_filtered_feature_bc_matrix.h5,~/E-MTAB-11115/data/5705STDY8058285_filtered_feature_bc_matrix.h5] # full path the gene expression file/folder (h5/csv/txt/mtx), if multiple files, please provide the list separated by ','
dataUMI: True                                               # if the above expression is UMI, if the value in expression file should be used directly, please set "False"

# cell annotation (cell intersection will be used, first column is the cell ID)
annotation: [~/E-MTAB-11115/data/5705STDY8058280_annotation.csv,~/E-MTAB-11115/data/5705STDY8058281_annotation.csv,~/E-MTAB-11115/data/5705STDY8058282_annotation.csv,~/E-MTAB-11115/data/5705STDY8058283_annotation.csv,~/E-MTAB-11115/data/5705STDY8058284_annotation.csv,~/E-MTAB-11115/data/5705STDY8058285_annotation.csv] # full path to the cell annotation file, first column is the cell ID which should match cell ID in expression
annotationUse: []                                           # the column names in the annotation file to be extracted for h5ad, empty list means all columns
sample_column:                                              # one column header from annotation file, if the one expression file needs to be splited into each sample

# cell layout: tSNE, UMAP, PCA, if separated the same as expression files, should be the same order
reduction: #optional (if missing UMAP will be created), other keys can be removed or added new ones, keys will be used in h5ad
  files: []                                                 # full path to the cell layout file (contains all layouts of a set of cells), first column is the cell ID which should match cell ID in expression
  umap: []                                                  # column headers from layout file to be used, please use quote for each column header
  tsne: []                                                  # column headers from layout file to be used, please use quote for each column header
  pca: []                                                   # column headers from layout file to be used, please use quote for each column header (can be more than 2 dimentions though only first two will be shown in VIP)```

10.2 Run sc2celldepot

After preparing the sc2celldepot.yml file, we are ready to run the pipeline by the following command:

sc2celldepot ~/E-MTAB-11115/CellDepot_publish/sc2celldepot.yml

This workflow will generate the output files in the working directory, CellDepot_publish:

The output prefix is determined by the prefix parameter in the sc2celldepot.yml file.

E-MTAB-11115/
    ├── E-MTAB-11115.rds
    └── CellDepot_publish
        ├── E-MTAB-11115.h5ad
        ├── E-MTAB-11115.raw_added.h5ad
        └── sc2celldepot.yml

Finally, copy the E-MTAB-11115.h5ad to the CellDepot folder defined by sys.yml (celldepotDir) and follow the instructions for publishing it.

10.3 Demo sc2celldepot

To demo sc2celldepot, we use the following RDS file (GSE172462.SCT.For_scRef.rds), which we had used it to demo scRef. This RDS file contains cell type labeling and embedding information. Please download it to the working directory, for example, ~/demo_celldepot.

Then we run the sc2celldepot program by providing a working directory, in our case:

sc2celldepot ~/demo_celldepot

File hierarchy after running the above command:

~/demo_celldepot
    ├── GSE172462.SCT.For_scRef.rds
    └── sc2celldepot.yml

We set up the sc2celldepot.yml in the following way, by just adding information to prefix and seuratObj:

## The config file to process public sc/sn RNAseq and generate h5ad for celldepot
output: ~/demo_celldepot
prefix: demo_sc2celldepot                                   # the prefix of the file name of the h5ad
# seurat RDS is avaiable, otherwise please move to next section
seuratObj: ~/demo_celldepot/GSE172462.SCT.For_scRef.rds     # the full path to the seurat RDS file with SCT & RNA assay along with meta.data and reduction
seuratUMI: RNA                                              # the name of the assay stores raw UMI
seuratSCT: SCT                                              # the name of the assay stores SCT
seuratMeta: []                                              # the list of cell annotations to be stored in h5ad, empty list means all meta.data entry from seurat rds
# Expression when the seurat RDS is not available (row gene/column cell)
# if the annotation files are seperated the same as expression files, they should be the same order, other wise cell ID will be used to match
expression: []                                              # full path the gene expression file/folder (h5/csv/txt/mtx), if multiple files, please provide the list separated by ','
dataUMI: True                                               #if the above expression is UMI, if the value in expression file should be used directly, please set "False"
# cell annotation (cell intersection will be used, first column is the cell ID)
annotation: []                                              # full path to the cell annotation file, first column is the cell ID which should match cell ID in expression
annotationUse: []                                           # the column names in the annotation file to be extracted for h5ad, empty list means all columns
sample_column:                                              # one column header from annotation file, if the one expression file needs to be splited into each sample
# cell layout: tSNE, UMAP, PCA, if separated the same as expression files, should be the same order
reduction: #optional (if missing UMAP will be created), other keys can be removed or added new ones, keys will be used in h5ad
  files: []                                                 # full path to the cell layout file (contains all layouts of a set of cells), first column is the cell ID which should match cell ID in expression
  umap: []                                                  # column headers from layout file to be used, please use quote for each column header
  tsne: []                                                  # column headers from layout file to be used, please use quote for each column header
  pca: []                                                   # column headers from layout file to be used, please use quote for each column header (can be more than 2 dimentions though only first two will be shown in VIP)

Then run the pipeline:

sc2celldepot ~/demo_celldepot/sc2celldepot.yml

The results will be new h5ad files as below:

~/demo_celldepot
    ├── demo_sc2celldepot_rds.h5ad
    ├── demo_sc2celldepot_rds.raw_added.h5ad
    ├── GSE172462.SCT.For_scRef.rds
    ├── sc2celldepot.20230321.log
    └── sc2celldepot.yml

We provide another demo using two h5 data files from our previous demo dataset. Besides that, we also need two cell level annotation files, and I have prepared them: RatMaleCigarette.annotation.csv, RatFemaleCigarette.annotation.csv.

Take a quick look of the csv files (Usually, the cell barcodes in the list is only a subset of all barcodes in the h5, due to filtering in previous analysis):

$ head -3 RatMaleCigarette.annotation.csv

,library_id,predicted.celltype
AAACCCACAACTGAAA-1,RatFemaleCigarette,OPC
AAACCCACACTTTATC-1,RatFemaleCigarette,Neuron

Please organize these four files (2 h5 files, 2 csv files) in the working dir (in this demo: ~/demo_celldepot_2), then run the script:

sc2celldepot ~/demo_celldepot_2

The above run generated sc2celldepot.yml, and the current file hierarchy is:

~/demo_celldepot_2
    ├── RatFemaleCigarette.annotation.csv
    ├── RatFemaleCigarette.filtered_feature_bc_matrix.h5
    ├── RatMaleCigarette.annotation.csv
    ├── RatMaleCigarette.filtered_feature_bc_matrix.h5
    └── sc2celldepot.yml

Then we set up the sc2celldepot.yml file in this way:

## The config file to process public sc/sn RNAseq and generate h5ad for celldepot
output: ~/demo_celldepot_2
prefix: demo_sc2celldepot_2                                 # the prefix of the file name of the h5ad
# seurat RDS is avaiable, otherwise please move to next section
seuratObj:                                                  # the full path to the seurat RDS file with SCT & RNA assay along with meta.data and reduction
seuratUMI: RNA                                              # the name of the assay stores raw UMI
seuratSCT: SCT                                              # the name of the assay stores SCT
seuratMeta: []                                              # the list of cell annotations to be stored in h5ad, empty list means all meta.data entry from seurat rds
# Expression when the seurat RDS is not available (row gene/column cell)
# if the annotation files are seperated the same as expression files, they should be the same order, other wise cell ID will be used to match
expression: [~/demo_celldepot_2/RatMaleCigarette.filtered_feature_bc_matrix.h5,~/demo_celldepot_2/RatFemaleCigarette.filtered_feature_bc_matrix.h5]
                                                            # full path the gene expression file/folder (h5/csv/txt/mtx), if multiple files, please provide the list separated by ','
dataUMI: True                                               #if the above expression is UMI, if the value in expression file should be used directly, please set "False"
# cell annotation (cell intersection will be used, first column is the cell ID)
annotation: [~/demo_celldepot_2/RatMaleCigarette.annotation.csv,~/demo_celldepot_2/RatFemaleCigarette.annotation.csv]
                                                            # full path to the cell annotation file, first column is the cell ID which should match cell ID in expression
annotationUse: []                                           # the column names in the annotation file to be extracted for h5ad, empty list means all columns
sample_column:                                              # one column header from annotation file, if the one expression file needs to be splited into each sample

# cell layout: tSNE, UMAP, PCA, if separated the same as expression files, should be the same order
reduction: #optional (if missing UMAP will be created), other keys can be removed or added new ones, keys will be used in h5ad
  files: []                                                 # full path to the cell layout file (contains all layouts of a set of cells), first column is the cell ID which should match cell ID in expression
  umap: []                                                  # column headers from layout file to be used, please use quote for each column header
  tsne: []                                                  # column headers from layout file to be used, please use quote for each column header
  pca: []                                                   # column headers from layout file to be used, please use quote for each column header (can be more than 2 dimentions though only first two will be shown in VIP)

This will let the pipeline to process the h5 files by running normalization, and match labeled cell types to the UMAP.

Files after running the pipeline:

~/demo_celldepot_2
    ├── demo_sc2celldepot.h5ad
    ├── demo_sc2celldepot.raw_added.h5ad
    ├── demo_sc2celldepot.rds
    ├── RatFemaleCigarette.annotation.csv
    ├── RatFemaleCigarette.filtered_feature_bc_matrix.h5
    ├── RatMaleCigarette.annotation.csv
    ├── RatMaleCigarette.filtered_feature_bc_matrix.h5
    └── sc2celldepot.yml