scbiot.pp.coembed_pca

Contents

scbiot.pp.coembed_pca#

scbiot.pp.coembed_pca(adata_reference, adata_query, *, out_key='X_pca_shared', label='modality', mode=None, keys=None, reference_layer=None, query_layer=None, n_top_genes=4000, n_components=50, flavor='cell_ranger', reference_norm_layer='rna_log1p', query_norm_layer='ga_log1p', batch_key=None, genes=None, pca_solver='randomized', projection_chunk_size=4096, label_key=None, unlabeled_category='Unknown', flag_outlier=True, outlier_k=30, outlier_z=3.0, min_cluster_size=None, store_qc_cols=False, verbose=True)#

Shared PCA coembedding (reference-fitted PCA, query projection) and optional per-cluster outlier flagging.

Parameters#

adata_reference

Reference modality AnnData (e.g., scRNA / GEX). PCA is fitted on this object.

adata_query

Query modality AnnData (e.g., gene-activity / scATAC GA). PCA is projected onto this object.

out_key

.obsm key for the shared PCA coordinates written into both input objects and the returned joint AnnData.

label

Column name created in the joint AnnData (ad.concat) indicating modality (e.g., “modality”).

mode

If “paired”, use join=”outer” when concatenating; otherwise use join=”inner”.

keys

Two strings naming the modalities for ad.concat (default: (“reference”,”query”)).

reference_layer

Input layer for reference counts. If None, uses .X. If provided but missing, falls back to .X.

query_layer

Input layer for query counts. If None, auto-picks (“ga_smooth” -> “ga” -> .X).

n_top_genes

Number of HVGs to select per modality before intersecting (shared genes only).

n_components

Target number of PCA components (actual k may be smaller if limited by cells/genes).

reference_norm_layer

Output layer name for normalized+log1p reference counts (cached).

query_norm_layer

Output layer name for normalized+log1p query counts (cached).

batch_key

Optional adata_reference.obs column used for batch-aware HVG selection (Scanpy HVG batch_key).

genes

Optional explicit gene list. If provided, skips HVG selection and uses these genes (must exist in both).

pca_solver

scikit-learn PCA solver (e.g., “randomized”, “full”, “auto”).

projection_chunk_size

If set, project query in chunks of this size to reduce memory spikes.

label_key

adata.obs column containing cell labels (e.g., “cell_type”) used for reference labels and outlier flagging.

unlabeled_category

Label value treated as “unlabeled” query and used to relabel outliers (excluded from per-cluster flagging).

flag_outlier

If True, run per-cluster kNN outlier detection on the final joint embedding and set outliers to unlabeled_category.

outlier_k

Number of neighbors for within-cluster kNN distance (uses k+1 internally to exclude self).

outlier_z

Robust cutoff multiplier for MAD threshold: thr = median + outlier_z * 1.4826 * MAD.

min_cluster_size

Minimum labeled cells per cluster required to run outlier detection. If None, uses max(outlier_k+2, 10).

store_qc_cols

If True, store QC columns in adata_joint.obs: - {label_key}__knn_mean, {label_key}__knn_outlier and summary in adata_joint.uns.

verbose

If True, print per-cluster outlier stats.

Returns#

AnnData

Joint AnnData concatenating reference and query with adata_joint.obsm[out_key] stacked.

Notes#

  • This function DOES NOT clip/modify embeddings for outliers; it only relabels them when flag_outlier=True.

  • By default, this sets adata_query.obs[label_key] = unlabeled_category (so outlier flagging targets reference clusters).

Examples#

>>> adata = scb.pp.coembed_pca(
...     adata_gex, adata_ga,
...     label="modality",
...     keys=("reference", "query"),
...     reference_layer="counts",
...     query_layer="ga_smooth",
...     out_key="X_shared_pca",
...     label_key="cell_type",
...     unlabeled_category="Unknown",
...     flag_outlier=True,
... )    
Parameters:
  • adata_reference (anndata.AnnData)

  • adata_query (anndata.AnnData)

  • out_key (str)

  • label (str)

  • mode (Literal['paired', 'unpaired'] | None)

  • keys (Sequence[str] | None)

  • reference_layer (str | None)

  • query_layer (str | None)

  • n_top_genes (int)

  • n_components (int)

  • flavor (str | None)

  • reference_norm_layer (str)

  • query_norm_layer (str)

  • batch_key (str | None)

  • genes (Sequence[str] | None)

  • pca_solver (str)

  • projection_chunk_size (int | None)

  • label_key (str | None)

  • unlabeled_category (str)

  • flag_outlier (bool)

  • outlier_k (int)

  • outlier_z (float)

  • store_qc_cols (bool)

  • verbose (bool)

Return type:

anndata.AnnData