scbiot.pp.coembed_pca#
- scbiot.pp.coembed_pca(adata_reference, adata_query, *, out_key='X_pca_shared', label='modality', mode=None, keys=None, reference_layer=None, query_layer=None, n_top_genes=4000, n_components=50, flavor='cell_ranger', reference_norm_layer='rna_log1p', query_norm_layer='ga_log1p', batch_key=None, genes=None, pca_solver='randomized', projection_chunk_size=4096, label_key=None, unlabeled_category='Unknown', flag_outlier=True, outlier_k=30, outlier_z=3.0, min_cluster_size=None, store_qc_cols=False, verbose=True)#
Shared PCA coembedding (reference-fitted PCA, query projection) and optional per-cluster outlier flagging.
Parameters#
- adata_reference
Reference modality AnnData (e.g., scRNA / GEX). PCA is fitted on this object.
- adata_query
Query modality AnnData (e.g., gene-activity / scATAC GA). PCA is projected onto this object.
- out_key
.obsm key for the shared PCA coordinates written into both input objects and the returned joint AnnData.
- label
Column name created in the joint AnnData (ad.concat) indicating modality (e.g., “modality”).
- mode
If “paired”, use join=”outer” when concatenating; otherwise use join=”inner”.
- keys
Two strings naming the modalities for ad.concat (default: (“reference”,”query”)).
- reference_layer
Input layer for reference counts. If None, uses .X. If provided but missing, falls back to .X.
- query_layer
Input layer for query counts. If None, auto-picks (“ga_smooth” -> “ga” -> .X).
- n_top_genes
Number of HVGs to select per modality before intersecting (shared genes only).
- n_components
Target number of PCA components (actual k may be smaller if limited by cells/genes).
- reference_norm_layer
Output layer name for normalized+log1p reference counts (cached).
- query_norm_layer
Output layer name for normalized+log1p query counts (cached).
- batch_key
Optional adata_reference.obs column used for batch-aware HVG selection (Scanpy HVG batch_key).
- genes
Optional explicit gene list. If provided, skips HVG selection and uses these genes (must exist in both).
- pca_solver
scikit-learn PCA solver (e.g., “randomized”, “full”, “auto”).
- projection_chunk_size
If set, project query in chunks of this size to reduce memory spikes.
- label_key
adata.obs column containing cell labels (e.g., “cell_type”) used for reference labels and outlier flagging.
- unlabeled_category
Label value treated as “unlabeled” query and used to relabel outliers (excluded from per-cluster flagging).
- flag_outlier
If True, run per-cluster kNN outlier detection on the final joint embedding and set outliers to unlabeled_category.
- outlier_k
Number of neighbors for within-cluster kNN distance (uses k+1 internally to exclude self).
- outlier_z
Robust cutoff multiplier for MAD threshold: thr = median + outlier_z * 1.4826 * MAD.
- min_cluster_size
Minimum labeled cells per cluster required to run outlier detection. If None, uses max(outlier_k+2, 10).
- store_qc_cols
If True, store QC columns in adata_joint.obs: - {label_key}__knn_mean, {label_key}__knn_outlier and summary in adata_joint.uns.
- verbose
If True, print per-cluster outlier stats.
Returns#
- AnnData
Joint AnnData concatenating reference and query with adata_joint.obsm[out_key] stacked.
Notes#
This function DOES NOT clip/modify embeddings for outliers; it only relabels them when flag_outlier=True.
By default, this sets adata_query.obs[label_key] = unlabeled_category (so outlier flagging targets reference clusters).
Examples#
>>> adata = scb.pp.coembed_pca( ... adata_gex, adata_ga, ... label="modality", ... keys=("reference", "query"), ... reference_layer="counts", ... query_layer="ga_smooth", ... out_key="X_shared_pca", ... label_key="cell_type", ... unlabeled_category="Unknown", ... flag_outlier=True, ... )
- Parameters:
adata_reference (anndata.AnnData)
adata_query (anndata.AnnData)
out_key (str)
label (str)
mode (Literal['paired', 'unpaired'] | None)
reference_layer (str | None)
query_layer (str | None)
n_top_genes (int)
n_components (int)
flavor (str | None)
reference_norm_layer (str)
query_norm_layer (str)
batch_key (str | None)
pca_solver (str)
projection_chunk_size (int | None)
label_key (str | None)
unlabeled_category (str)
flag_outlier (bool)
outlier_k (int)
outlier_z (float)
store_qc_cols (bool)
verbose (bool)
- Return type:
anndata.AnnData