scbiot.pp.create_gene_activity#
- scbiot.pp.create_gene_activity(atac, rna, *, gtf_file, promoter_up=2000, promoter_down=500, batch_key=None, top_peaks=30000, var_features_key='var_features', normalize_var_features_output=False, make_binary=True, lsi_key='X_lsi', lsi_n_iter=2, lsi_components=51, drop_first_lsi_component=True, per_cluster_union=False, sample_cells_pre=None, include_gene_body=True, weight_by_distance=True, tss_decay_bp=2000, promoter_priority=True, ga_layer='ga', knn_neighbors=50, rna_name_col='gene_name', verbose=True, copy_atac=False)#
Build gene-activity (GA) from ATAC peaks and harmonize GA gene names to RNA.
Parameters#
- atac:
ATAC AnnData with peak counts in
.Xand peak metadata in.var.- rna:
RNA AnnData used to harmonize gene names for the GA output.
- gtf_file:
Path to the GTF annotation used to map peaks to genes.
- promoter_up / promoter_down:
TSS window (bp) used for promoter definition.
- batch_key:
atac.obscolumn used when selecting variable peaks. Set toNoneto skip per-batch filtering.- top_peaks:
Number of variable peaks to retain.
- var_features_key:
Key to store variable peak annotations in
.var.- normalize_var_features_output:
Normalize peak scores produced by variable-feature selection.
- make_binary:
Binarize peak counts before LSI.
- lsi_key:
Key used to store the LSI embedding in
atac.obsm.- lsi_n_iter:
Number of LSI iterations during TF-IDF/LSI preprocessing.
- lsi_components:
Number of LSI components to compute.
- drop_first_lsi_component:
Drop the first LSI component (often depth-associated).
- per_cluster_union:
Use a per-cluster union strategy for peak selection in LSI.
- sample_cells_pre:
Fit the SVD on this many cells before projecting all cells.
- include_gene_body:
Include gene-body peaks when computing gene activity.
- weight_by_distance:
Weight peak contributions by distance to TSS.
- tss_decay_bp:
Distance (bp) used for TSS-based decay weights.
- promoter_priority:
Prefer promoter peaks when both promoter and gene-body peaks overlap.
- ga_layer:
Name of the layer to store the GA matrix.
- knn_neighbors:
Number of neighbors for KNN smoothing in ATAC space.
- rna_name_col:
Column in
rna.varwith gene names (fallbacks torna.var_names).- verbose:
Emit progress logging when True.
- copy_atac:
Copy
atacbefore mutation (LSI embedding writes).
Notes#
Side effect: writes LSI embedding to atac.obsm[lsi_key].
Set copy_atac=True if you don’t want atac modified in-place.
Examples#
Basic usage:
>>> import scbiot as scb # download gtf from GENCODE: https://www.gencodegenes.org/human/ >>> gtf_file = f"{dir}/inputs/gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz" >>> adata_ga = scb.pp.create_gene_activity(adata_atac, adata_gex, gtf_file=gtf_file, verbose=True)
- Parameters:
gtf_file (str)
promoter_up (int)
promoter_down (int)
batch_key (str | None)
top_peaks (int)
var_features_key (str)
normalize_var_features_output (bool)
make_binary (bool)
lsi_key (str)
lsi_n_iter (int)
lsi_components (int)
drop_first_lsi_component (bool)
per_cluster_union (bool)
sample_cells_pre (int | None)
include_gene_body (bool)
weight_by_distance (bool)
tss_decay_bp (int)
promoter_priority (bool)
ga_layer (str)
knn_neighbors (int)
rna_name_col (str)
verbose (bool)
copy_atac (bool)