scbiot.pp.create_gene_activity

scbiot.pp.create_gene_activity#

scbiot.pp.create_gene_activity(atac, rna, *, gtf_file, promoter_up=2000, promoter_down=500, batch_key=None, top_peaks=30000, var_features_key='var_features', normalize_var_features_output=False, make_binary=True, lsi_key='X_lsi', lsi_n_iter=2, lsi_components=51, drop_first_lsi_component=True, per_cluster_union=False, sample_cells_pre=None, include_gene_body=True, weight_by_distance=True, tss_decay_bp=2000, promoter_priority=True, ga_layer='ga', knn_neighbors=50, rna_name_col='gene_name', verbose=True, copy_atac=False)#

Build gene-activity (GA) from ATAC peaks and harmonize GA gene names to RNA.

Parameters#

atac:

ATAC AnnData with peak counts in .X and peak metadata in .var.

rna:

RNA AnnData used to harmonize gene names for the GA output.

gtf_file:

Path to the GTF annotation used to map peaks to genes.

promoter_up / promoter_down:

TSS window (bp) used for promoter definition.

batch_key:

atac.obs column used when selecting variable peaks. Set to None to skip per-batch filtering.

top_peaks:

Number of variable peaks to retain.

var_features_key:

Key to store variable peak annotations in .var.

normalize_var_features_output:

Normalize peak scores produced by variable-feature selection.

make_binary:

Binarize peak counts before LSI.

lsi_key:

Key used to store the LSI embedding in atac.obsm.

lsi_n_iter:

Number of LSI iterations during TF-IDF/LSI preprocessing.

lsi_components:

Number of LSI components to compute.

drop_first_lsi_component:

Drop the first LSI component (often depth-associated).

per_cluster_union:

Use a per-cluster union strategy for peak selection in LSI.

sample_cells_pre:

Fit the SVD on this many cells before projecting all cells.

include_gene_body:

Include gene-body peaks when computing gene activity.

weight_by_distance:

Weight peak contributions by distance to TSS.

tss_decay_bp:

Distance (bp) used for TSS-based decay weights.

promoter_priority:

Prefer promoter peaks when both promoter and gene-body peaks overlap.

ga_layer:

Name of the layer to store the GA matrix.

knn_neighbors:

Number of neighbors for KNN smoothing in ATAC space.

rna_name_col:

Column in rna.var with gene names (fallbacks to rna.var_names).

verbose:

Emit progress logging when True.

copy_atac:

Copy atac before mutation (LSI embedding writes).

Notes#

  • Side effect: writes LSI embedding to atac.obsm[lsi_key].

  • Set copy_atac=True if you don’t want atac modified in-place.

Examples#

Basic usage:

>>> import scbiot as scb
# download gtf from GENCODE: https://www.gencodegenes.org/human/ 
>>> gtf_file = f"{dir}/inputs/gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz"
>>> adata_ga = scb.pp.create_gene_activity(adata_atac, adata_gex, gtf_file=gtf_file, verbose=True)    
Parameters:
  • gtf_file (str)

  • promoter_up (int)

  • promoter_down (int)

  • batch_key (str | None)

  • top_peaks (int)

  • var_features_key (str)

  • normalize_var_features_output (bool)

  • make_binary (bool)

  • lsi_key (str)

  • lsi_n_iter (int)

  • lsi_components (int)

  • drop_first_lsi_component (bool)

  • per_cluster_union (bool)

  • sample_cells_pre (int | None)

  • include_gene_body (bool)

  • weight_by_distance (bool)

  • tss_decay_bp (int)

  • promoter_priority (bool)

  • ga_layer (str)

  • knn_neighbors (int)

  • rna_name_col (str)

  • verbose (bool)

  • copy_atac (bool)