Chapter 5 Data preparation
scRNASequest pipeline is compatible with different single-cell experiment outputs, including the 10X MEX format and the 10X h5 format from Cell Ranger. The user may need to manually separate the annotation file, so that no cell filtering was performed. In this chapter, we will walk through how to prepare the data before running the pipeline.
A bit more information about these two types of format:
10X MEX format: This format has three associated files: an mtx file and associated barcodes file as well as a features file. Based on the information provided by the 10X official website, the MEX is also called the Market Exchange (MEX) format, and the .mtx is a plain text file showing MatrixMarket matrix coordinates like this:
%%MatrixMarket matrix coordinate integer general
%
154 1 21
...
This means the gene in line 154 of the genes file, the 1st barcode in the barcodes file, has UMI counts=21.
10X h5 format: This format is a HDF5 format storing UMI count matrix, as described in the 10X official tutorial here.
5.1 Public data in h5 format
Here, we first present an example of the data processing steps using a public EMBL EBI dataset: E-MTAB-11115. The nine processed zip files were downloaded and unzipped.
There are total 6 data, with 6 corresponding *raw_feature_bc_matrix.h5 files. These files are required by the pipeline:
#h5 matrix files
-rwxrwxr--. 1 zouyang ngs 165M Oct 25 2021 5705STDY8058280.raw_feature_bc_matrix.h5
-rwxrwxr--. 1 zouyang ngs 165M Oct 25 2021 5705STDY8058281.raw_feature_bc_matrix.h5
-rwxrwxr--. 1 zouyang ngs 156M Oct 25 2021 5705STDY8058282.raw_feature_bc_matrix.h5
-rwxrwxr--. 1 zouyang ngs 162M Oct 25 2021 5705STDY8058283.raw_feature_bc_matrix.h5
-rwxrwxr--. 1 zouyang ngs 149M Oct 25 2021 5705STDY8058284.raw_feature_bc_matrix.h5
-rwxrwxr--. 1 zouyang ngs 177M Oct 25 2021 5705STDY8058285.raw_feature_bc_matrix.h5
This dataset also has cell type annotation files associated with each data. These files are optional to the pipeline, but if you would like to use their cell type labels, it would be better to include them in the sample meta file (See section 6.3).
#Annotation files (optional to the pipeline)
-rwxrwx---. 1 zouyang ngs 440K Apr 21 16:59 5705STDY8058280.annotation.csv
-rwxrwx---. 1 zouyang ngs 446K Apr 21 16:59 5705STDY8058281.annotation.csv
-rwxrwx---. 1 zouyang ngs 310K Apr 21 16:59 5705STDY8058282.annotation.csv
-rwxrwx---. 1 zouyang ngs 282K Apr 21 16:59 5705STDY8058283.annotation.csv
-rwxrwx---. 1 zouyang ngs 157K Apr 21 16:59 5705STDY8058284.annotation.csv
-rwxrwx---. 1 zouyang ngs 555K Apr 21 16:59 5705STDY8058285.annotation.csv
#A brief look at the annotation file:
$ head -3 5705STDY8058280_annotation.csv
Cell.ID,sample,annotation_1,annotation_1_print
AAACCCAAGGAAGTAG-1,5705STDY8058280,Ext_L25,23_Ext_L25
AAACCCAAGGGCAGTT-1,5705STDY8058280,Ext_L56,24_Ext_L56
If SampleName.metrics_summary.csv files (QC files generated by Cell Ranger) are available, please also add them in the same directory as the h5 files, and the pipeline will use them to generate QC plots, but they are not required files of the pipeline. The expected file names should be:
5705STDY8058280.metrics_summary.csv
5705STDY8058281.metrics_summary.csv
5705STDY8058282.metrics_summary.csv
5705STDY8058283.metrics_summary.csv
5705STDY8058284.metrics_summary.csv
5705STDY8058285.metrics_summary.csv
!!! Important Since we don’t provide the path of these metrics_summary.csv files to the pipeline, their prefix SampleName must be consistent with the Sample_Name column in the sample meta file (See section 5.2), so that the pipeline can recognize them automatically. You can certainly rename the files and the corresponding SampleName column if you would like to change the data names and how they appear in the final results. Also, for the metrics_summary.csv file names, the concatenator between the Sample_Name and “metrics_summary.csv” must be “.” so that the pipeline can read them automatically. This follows the naming criteria of Cell Ranger.
5.2 Public data in 10X MEX format
Another popular format for single-cell RNA-seq is the MEX format. Here, we use a public dataset from NCBI/GEO: GSE185538 to walk through the procedures for pipeline setup.
There are total 4 single-nucleus RNA-seq data, and all the processed data can be downloaded from the Supplementary file section as a tarball: GSE185538_RAW.tar:
#Untar the file
tar -xvf GSE185538_RAW.tar
$ ls -l #Only show the MTX-related files
-rw-rw-r-- 1 ysun4 compbio 97631 Sep 29 2021 GSM5617891_snRNA_FCtr_barcodes.tsv.gz
-rw-rw-r-- 1 ysun4 compbio 206935 Sep 29 2021 GSM5617891_snRNA_FCtr_features.tsv.gz
-rw-rw-r-- 1 ysun4 compbio 116235040 Sep 29 2021 GSM5617891_snRNA_FCtr_matrix.mtx.gz
-rw-rw-r-- 1 ysun4 compbio 89425 Sep 29 2021 GSM5617892_snRNA_FEcig_barcodes.tsv.gz
-rw-rw-r-- 1 ysun4 compbio 206935 Sep 29 2021 GSM5617892_snRNA_FEcig_features.tsv.gz
-rw-rw-r-- 1 ysun4 compbio 101349173 Sep 29 2021 GSM5617892_snRNA_FEcig_matrix.mtx.gz
-rw-rw-r-- 1 ysun4 compbio 283865 Sep 29 2021 GSM5617893_snRNA_MCtr_barcodes.tsv.gz
-rw-rw-r-- 1 ysun4 compbio 206935 Sep 29 2021 GSM5617893_snRNA_MCtr_features.tsv.gz
-rw-rw-r-- 1 ysun4 compbio 440019486 Sep 29 2021 GSM5617893_snRNA_MCtr_matrix.mtx.gz
-rw-rw-r-- 1 ysun4 compbio 117119 Sep 29 2021 GSM5617894_snRNA_MEcig_barcodes.tsv.gz
-rw-rw-r-- 1 ysun4 compbio 229853 Sep 29 2021 GSM5617894_snRNA_MEcig_features.tsv.gz
-rw-rw-r-- 1 ysun4 compbio 159326641 Sep 29 2021 GSM5617894_snRNA_MEcig_matrix.mtx.gz
Next, these files need to be organized into separate folders for the pipeline to read. In specific, we need to create separate folders for each data, and rename the three files to be: barcodes.tsv.gz, features.tsv.gz, and matrix.mtx.gz (Un-compressed files can also be used to run the pipeline, and the pipeline will compress them). The organized file hierarchy is below:
GSE185538/
├── GSM5617891_snRNA_FCtr
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz
├── GSM5617892_snRNA_FEcig
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz
├── GSM5617893_snRNA_MCtr
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz
└── GSM5617894_snRNA_MEcig
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz
5.3 Self-prepared files
If you have raw data in FASTQ format, please process them using the Cell Ranger pipeline to generate the raw_feature_bc_matrix.h5 and metrics_summary.csv files. Please visit Cell Ranger website here for more details about the outputs, in the “Output files” section.
In short, Cell Ranger outputs filtered_feature_bc_matrix.h5 (cells after filtering, recommended to use), raw_feature_bc_matrix.h5 (without filtering), and two folders for MEX format output: filtered_feature_bc_matrix, raw_feature_bc_matrix. In addition, Cell Ranger also has a metrics_summary.csv file generated, which can be provided to scRNASequest. Please re-organize, and if necessary, rename the Cell Ranger output files into the following file structures before running scRNASequest.
Below is the file hierarchy for h5 input files (metrics_summary.csv files are suggested to be included, but not required; If cell type classification annotation.csv files are available, it would be better to include them):
Project/
├── Data1.filtered_feature_bc_matrix.h5
├── Data1.annotation.csv (optional)
├── Data1.metrics_summary.csv (optional)
├── Data2.filtered_feature_bc_matrix.h5
├── Data2.annotation.csv (optional)
├── Data2.metrics_summary.csv (optional)
...
The file hierarchy for MEX files:
Project/
├── Data1
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz
├── Data2
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz
...