module pp

Global Variables

  • get_mean_var_major_kernel
  • get_mean_var_minor_kernel
  • find_indices_kernel

class ScaleSC

ScaleSC integrated pipeline in a scanpy-like style.

It will automatcially load dataset in chunks, see scalesc.util.AnnDataBatchReader for details, and all methods in this class manipulate this chunked data.

Args:

  • data_dir (str): Data folder of the dataset.
  • max_cell_batch (int): Maximum number of cells in a single batch.
  • Default: 100000.
  • preload_on_cpu (bool): If load the entire chunked data on CPU. Default: True
  • preload_on_gpu (bool): If load the entire chunked data on GPU, preload_on_cpu
  • will be overwritten toTruewhen this sets toTrue. Default: True.
  • save_raw_counts (bool): If save adata_X to disk after QC filtering.
  • Default: False.
  • save_norm_counts (bool): If save adata_X data to disk after normalization.
  • Default: False.
  • save_after_each_step (bool): If save adata (without .X) to disk after each step.
  • Default: False.
  • output_dir (str): Output folder. Default: './results'.
  • gpus (list): List of GPU ids, [0] is set if this is None. Default: None.

method __init__

__init__(
    data_dir,
    max_cell_batch=100000.0,
    preload_on_cpu=True,
    preload_on_gpu=True,
    save_raw_counts=False,
    save_norm_counts=False,
    save_after_each_step=False,
    output_dir='results',
    gpus=None
)

property adata

AnnData: An AnnData object that used to store all intermediate results without the count matrix.

Note: This is always on CPU.


property adata_X

AnnData: An AnnData object that used to store all intermediate results including the count matrix. Internally, all chunks should be merged on CPU to avoid high GPU consumption, make sure to invoke to_CPU() before calling this object.


method calculate_qc_metrics

calculate_qc_metrics()

Calculate quality control metrics.


method clear

clear()

Clean the memory


method filter_cells

filter_cells(min_count=0, max_count=None, qc_var='n_genes_by_counts', qc=False)

Filter genes based on number of a QC metric.

Args:

  • min_count (int): Minimum number of counts required for a cell to pass filtering.
  • max_count (int): Maximum number of counts required for a cell to pass filtering.
  • qc_var (str='n_genes_by_counts'): Feature in QC metrics that used to filter cells.
  • qc (bool=False): Call calculate_qc_metrics before filtering.

method filter_genes

filter_genes(min_count=0, max_count=None, qc_var='n_cells_by_counts', qc=False)

Filter genes based on number of a QC metric.

Args:

  • min_count (int): Minimum number of counts required for a gene to pass filtering.
  • max_count (int): Maximum number of counts required for a gene to pass filtering.
  • qc_var (str='n_cells_by_counts'): Feature in QC metrics that used to filter genes.
  • qc (bool=False): Call calculate_qc_metrics before filtering.

method filter_genes_and_cells

filter_genes_and_cells(
    min_counts_per_gene=0,
    min_counts_per_cell=0,
    max_counts_per_gene=None,
    max_counts_per_cell=None,
    qc_var_gene='n_cells_by_counts',
    qc_var_cell='n_genes_by_counts',
    qc=False
)

Filter genes based on number of a QC metric.

Note:

This is an efficient way to perform a regular filtering on genes and cells without repeatedly iterating over chunks.

Args:

  • min_counts_per_gene (int): Minimum number of counts required for a gene to pass filtering.
  • max_counts_per_gene (int): Maximum number of counts required for a gene to pass filtering.
  • qc_var_gene (str='n_cells_by_counts'): Feature in QC metrics that used to filter genes.
  • min_counts_per_cell (int): Minimum number of counts required for a cell to pass filtering.
  • max_counts_per_cell (int): Maximum number of counts required for a cell to pass filtering.
  • qc_var_cell (str='n_genes_by_counts'): Feature in QC metrics that used to filter cells.
  • qc (bool=False): Call calculate_qc_metrics before filtering.

method harmony

harmony(sample_col_name, n_init=10, max_iter_harmony=20)

Use Harmony to integrate different experiments.

Note:

This modified harmony function can easily scale up to 15M cells with 50 pcs on GPU (A100 80G). Result after harmony is stored into adata.obsm['X_pca_harmony'].

Args:

  • sample_col_name (str): Column of sample ID.
  • n_init (int=10): Number of times the k-means algorithm is run with different centroid seeds.
  • max_iter_harmony (int=20): Maximum iteration number of harmony.

method highly_variable_genes

highly_variable_genes(n_top_genes=4000, method='seurat_v3')

Annotate highly variable genes.

Note:

Only seurat_v3 is implemented. Count data is expected for seurat_v3. HVGs are set to True in adata.var['highly_variable'].

Args:

  • n_top_genes (int=4000): Number of highly-variable genes to keep.
  • method (str='seurat_v3'): Choose the flavor for identifying highly variable genes.

method leiden

leiden(resolution=0.5, random_state=42)

Performs Leiden clustering using rapids-singlecell.

Args:

  • resolution (float=0.5): A parameter value controlling the coarseness of the clustering. (called gamma in the modularity formula). Higher values lead to more clusters.
  • random_state (int=42): Random seed.

method neighbors

neighbors(n_neighbors=20, n_pcs=50, use_rep='X_pac_harmony', algorithm='cagra')

Compute a neighborhood graph of observations using rapids-singlecell.

Args:

  • n_neighbors (int=20): The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation.
  • n_pcs (int=50): Use this many PCs.
  • use_rep (str='X_pca_harmony'): Use the indicated representation.
  • algorithm (str='cagra'): The query algorithm to use.

method normalize_log1p

normalize_log1p(target_sum=10000.0)

Normalize counts per cell then log1p.

Note:

If save_raw_counts or save_norm_counts is set, write adata_X to disk here automatically.

Args:

  • target_sum (int=1e4): If None, after normalization, each observation (cell) has a total count equal to the median of total counts for observations (cells) before normalization.

method normalize_log1p_pca

normalize_log1p_pca(
    target_sum=10000.0,
    n_components=50,
    hvg_var='highly_variable'
)

An alternative for calling normalize_log1p and pca together.

Note:

Used when preload_on_cpu is False.


method pca

pca(n_components=50, hvg_var='highly_variable')

Principal component analysis.

Computes PCA coordinates, loadings and variance decomposition. Uses the implementation of scikit-learn.

Note:

Flip the directions according to the largest values in loadings. Results will match up with scanpy perfectly. Calculated PCA matrix is stored in adata.obsm['X_pca'].

Args:

  • n_components (int=50): Number of principal components to compute.
  • hvg_var (str='highly_variable'): Use highly variable genes only.

method save

save(data_name=None)

Save adata to disk.

Note:

Save to 'output_dir/data_name.h5ad'.

Args:

  • data_name (str): If None, set as data_dir.

method savex

savex(name, data_name=None)

Save adata to disk in chunks.

Note:

Each chunk will be saved individually in a subfolder under output_dir. Save to 'output_dir/name/data_name_i.h5ad'.

Args:

  • name (str): Subfolder name.
  • data_name (str): If None, set as data_dir.

method to_CPU

to_CPU()

Move all chunks to CPU.


method to_GPU

to_GPU()

Move all chunks to GPU.


method umap

umap(random_state=42)

Embed the neighborhood graph using rapids-singlecell.

Args:

  • random_state (int=42): Random seed.

This file was automatically generated via lazydocs.