scbiot.ot.integrate_centroids#
- scbiot.ot.integrate_centroids(adata_full, obsm_key='X_pca', batch_key='batch', out_key='scBIOT', modality='auto', strength=0.5, conservation=0.5, prototypes=0.5, sharpen=0.5, supervision=0.5, projector=0.5, approximate=False, reference='auto', label_key=None, unlabeled_category='unknown', random_state=0, n_centroids_per_batch=2048, max_samples_per_batch=500000, k_interp=8, chunk_size=500000, use_gpu=True, gpu_device=0, ot_backend='torch', max_iter=15, verbose=True, tmp_path=None)#
Centroid-level OT + FAISS interpolation for very large datasets (e.g. Tahoe 100M), implemented in a memory-friendly way:
Does NOT materialize the full X_pca as a single numpy array.
Builds centroids by reading only the subset of rows needed per batch.
Interpolates back to all cells in chunks, reading each chunk directly from adata.obsm.
Parameters#
- adata_full
Full AnnData with all cells (e.g. Tahoe 100M). Must have obsm[obsm_key] and obs[batch_key].
- obsm_key
Embedding key to use as the OT space (typically “X_pca”).
- batch_key
Batch column in adata_full.obs.
- out_key
Output key to store the integrated coordinates in adata_full.obsm[out_key].
- modality
Modality hint (“auto”, “rna”, “atac”) passed through to normalization and OT.
- n_centroids_per_batch
Target number of centroids per batch (upper bound; smaller batches get fewer).
- max_samples_per_batch
Maximum number of raw cells per batch used to fit KMeans (for speed).
- k_interp
Number of nearest centroids used when interpolating the displacement field.
- chunk_size
Number of cells per chunk when interpolating over the full dataset.
- use_gpu / gpu_device
Controls FAISS GPU usage (if available) and OT backend choice inside integrate_ot.
- tmp_path
If provided, the full integrated embedding is stored as a memmap file at this path to limit peak RAM usage. If None, a regular in-memory numpy array is used.
- strength / conservation / prototypes / sharpen / supervision / projector
Semantic 0–1 knobs forwarded to integrate_ot on centroid embeddings.
- approximate / reference
OT solver and reference selection controls forwarded to integrate_ot.
- max_iter
Maximum number of outer optimization iterations (forwarded to integrate_ot).
Returns#
- adata_full
The same AnnData, with integrated coordinates stored in adata_full.obsm[out_key].
- metrics
Metrics dictionary returned by integrate_ot, augmented with n_centroids.
- Parameters:
adata_full (Any)
obsm_key (str)
batch_key (str)
out_key (str)
modality (str)
strength (float)
conservation (float)
prototypes (float)
sharpen (float)
supervision (float)
projector (float)
approximate (bool)
reference (str)
label_key (str | None)
unlabeled_category (Any)
random_state (int)
n_centroids_per_batch (int)
max_samples_per_batch (int)
k_interp (int)
chunk_size (int)
use_gpu (bool)
gpu_device (int)
ot_backend (str)
max_iter (int)
verbose (bool)
tmp_path (str | None)
- Return type: