scbiot.ot.integrate_centroids

scbiot.ot.integrate_centroids#

scbiot.ot.integrate_centroids(adata_full, obsm_key='X_pca', batch_key='batch', out_key='scBIOT', modality='auto', strength=0.5, conservation=0.5, prototypes=0.5, sharpen=0.5, supervision=0.5, projector=0.5, approximate=False, reference='auto', label_key=None, unlabeled_category='unknown', random_state=0, n_centroids_per_batch=2048, max_samples_per_batch=500000, k_interp=8, chunk_size=500000, use_gpu=True, gpu_device=0, ot_backend='torch', max_iter=15, verbose=True, tmp_path=None)#

Centroid-level OT + FAISS interpolation for very large datasets (e.g. Tahoe 100M), implemented in a memory-friendly way:

  • Does NOT materialize the full X_pca as a single numpy array.

  • Builds centroids by reading only the subset of rows needed per batch.

  • Interpolates back to all cells in chunks, reading each chunk directly from adata.obsm.

Parameters#

adata_full

Full AnnData with all cells (e.g. Tahoe 100M). Must have obsm[obsm_key] and obs[batch_key].

obsm_key

Embedding key to use as the OT space (typically “X_pca”).

batch_key

Batch column in adata_full.obs.

out_key

Output key to store the integrated coordinates in adata_full.obsm[out_key].

modality

Modality hint (“auto”, “rna”, “atac”) passed through to normalization and OT.

n_centroids_per_batch

Target number of centroids per batch (upper bound; smaller batches get fewer).

max_samples_per_batch

Maximum number of raw cells per batch used to fit KMeans (for speed).

k_interp

Number of nearest centroids used when interpolating the displacement field.

chunk_size

Number of cells per chunk when interpolating over the full dataset.

use_gpu / gpu_device

Controls FAISS GPU usage (if available) and OT backend choice inside integrate_ot.

tmp_path

If provided, the full integrated embedding is stored as a memmap file at this path to limit peak RAM usage. If None, a regular in-memory numpy array is used.

strength / conservation / prototypes / sharpen / supervision / projector

Semantic 0–1 knobs forwarded to integrate_ot on centroid embeddings.

approximate / reference

OT solver and reference selection controls forwarded to integrate_ot.

max_iter

Maximum number of outer optimization iterations (forwarded to integrate_ot).

Returns#

adata_full

The same AnnData, with integrated coordinates stored in adata_full.obsm[out_key].

metrics

Metrics dictionary returned by integrate_ot, augmented with n_centroids.

Parameters:
  • adata_full (Any)

  • obsm_key (str)

  • batch_key (str)

  • out_key (str)

  • modality (str)

  • strength (float)

  • conservation (float)

  • prototypes (float)

  • sharpen (float)

  • supervision (float)

  • projector (float)

  • approximate (bool)

  • reference (str)

  • label_key (str | None)

  • unlabeled_category (Any)

  • random_state (int)

  • n_centroids_per_batch (int)

  • max_samples_per_batch (int)

  • k_interp (int)

  • chunk_size (int)

  • use_gpu (bool)

  • gpu_device (int)

  • ot_backend (str)

  • max_iter (int)

  • verbose (bool)

  • tmp_path (str | None)

Return type:

Tuple[Any, Dict[str, float | int]]