Maxspin

Installation

The python package can be installed with:

pip install maxspin

Basic Usage

This package operates on AnnData objects from the anndata package.

We assume the existence of a spatial neighborhood graph. A simple and effective way of doing this is Delaunay triangulation, for example using squidpy.

import squidpy as sq

sq.gr.spatial_neighbors(adata, delaunay=True, coord_type="generic")

Spatial information can then be measured using the spatial_information function.

from maxspin import spatial_information

spatial_information(adata, prior=None)

This adds a spatial_information column to the var metadata.

Similarly, pairwise spatial information can be computed with pairwise_spatial_information. This function will test every pair of genes, which is pretty impractical for large numbers of genes, so it’s a good idea to subset the AnnData object before calling this.

from maxspin import pairwise_spatial_information

pairwise_spatial_information(adata, prior=None)

For a more detailed example, check out the tutorial.

Interpreting the spatial information score

The method compute a score for every cell/spot that’s in [0,1], like a correlation but typically much smaller, and sums them to arrive at a spatial information score that is then in [0, ncells]. It’s possible to normalize for the number of cells by just dividing, but by default a pattern involving more cells is considered more spatially coherent, hence the sum.

Normalization

There are different ways spatial information can be computed. By default, no normalization is done and spatial information is computed on absolute counts. Uncertainty is incorporated using a Gamma-Poisson model.

If prior=None is used, the method makes no attempt to account for estimation uncertainty and computes spatial information directly on whatever is in adata.X.

The recommended way to run spatial_information is with some kind of normalized estimate of expression with some uncertainty estimation. There are two recommended ways of doing this: SCVI and Vanity.

SCVI

SCVI is a convenient and versatile probabilistic model of sequencing experiments, from which we can sample from the posterior to get normalized point estimates with uncertainty.

Using Maxspin with SCVI looks something like this:

import scvi
import numpy as np
from maxspin import spatial_information

scvi.model.SCVI.setup_anndata(adata)
model = scvi.model.SCVI(adata, n_latent=20)

# Sample log-expression values from the posterior.
posterior_samples = np.log(model.get_normalized_expression(return_numpy=True, return_mean=False, n_samples=20, library_size="latent"))
adata_scvi = adata.copy()
adata_scvi.X = np.mean(posterior_samples, axis=0)
adata_scvi.layers["std"] = np.std(posterior_samples, axis=0)

spatial_information(adata_scvi, prior="gaussian")

The tutorial has a more in depth example of using SCVI.

Vanity

I developed the normalization method vanity in part as convenient way to normalize spatial transcriptomics data in a way that provides uncertainty estimates. The preferred way of running vanity + maxspin is then:

from maxspin import spatial_information
from vanity import normalize_vanity

normalize_vanity(adata)
spatial_information(adata, prior="gaussian")

Compared to SCVI, this model more aggressively shrinks low expression genes, which might cause it to miss something very subtle, but is less likely to detect spurious patterns.

API

maxspin.spatial_information(adatas: ~anndata._core.anndata.AnnData | ~typing.List[~anndata._core.anndata.AnnData], layer: str | None = None, nwalksteps: int = 1, stepsize: int = 5, lr: float = 0.01, nepochs: int = 8000, binsizes: ~typing.List[int] = [4, 8, 16], binweights: ~typing.Callable | ~typing.List[float] = <function <lambda>>, max_unimproved_count: int | None = 50, seed: int = 0, prior: str | None = 'gamma', std_layer: str = 'std', prior_k: float = 0.01, prior_theta: float = 1.0, prior_a: float = 1.0, estimate_scales: bool = False, chunk_size: int | None = None, alpha_layer: str = 'alt', beta_layer: str = 'ref', receiver_signals: ~typing.Any | None = None, resample_frequency: int = 10, nevalsamples: int = 1000, preload: bool = True, quiet: bool = False)

Compute spatial information for each gene in an an AnnData, or list of AnnData objects.

If a list of of AnnData objects is used, the spatial information score is computed jointly across each.

Every adata must have a spatial neighborhood graph provided. The easiest way to do this is with:

squidpy.gr.spatial_neighbors(adata, delaunay=True, coord_type="generic")

Spatial information scores, which are added to the adata.var[“spatial_information”], represent a lower bound on spatial auto-mutual information. They are normalized so that 0 represents a total lack of spatial coherence, and increasing positive numbers more spatial coherence.

The bound on mutual information is computed by training extremely simple classifiers on pairs of nearby cells/spots. The classifier is trained to recognize when spatial arrangement has been shuffled. Informally, spatial information is then defined as how easy it is to tell when expression have been shuffled across spatial positions. In a highly spatially coherent expression pattern, the distribution of pairs of nearby values shifts dramatically. In the lack of any spatial coherence, this distribution does not change.

Nearby pairs of nodes are sampled by performing random walks on the neighborhood graph. The length of the walks partially controls the scale of the spatial patterns that are detected. Longer walks will tend to recognize only very broad spatial patterns, while short walks only very precise ones. In that way, spatial information in not necessarily comparable when two different walk lengths are used.

Parameters:
  • adatas – Either a single AnnData objects, or a list of AnnData objects. If a list is given, they must all have the same set of genes in the same order. Each AnnData must have a spatial neighborhood graph provided in obsp[“spatial_connectivities”] This can be done with the squidpy.gr.spatial_neighbors function.

  • layer – Name of layer to use. If None, the X matrix is used.

  • nwalksteps – Random walks take this many steps. Lengthening walks (by increasing this parameter or stepsize) will make the test less sensitive to smaller scale spatial variations, but more sensitive to large scale variations.

  • stepsize – Each random walk step follows this many edges on the neighborhood graph.

  • lr – Optimizer learning rate.

  • nepochs – Run the optimization step for this many iterations.

  • nevalsamples – Estimate MI bound by resampling expression and random walks this many times.

  • max_unimproved_count – Early stopping criteria for optimization. If the the MI lower bound has not been improved for any gene for this many iterations, stop iterating.

  • seed – Random number generator seed.

  • prior – Account for uncertainty in expression estimates by resampling expression while training. If None, do no resampling. If “gamma”, use a model appropriate for absolute counts, if “dirichlet” use a model appropriate for proprotional counts. If set to “beta” in conjunction with setting alpha_layer and beta_layer to two separate count layers, use a model suitable for testing for spatial patterns in allelic balance. If “gaussian”, the model expects a matrix nammed std in layers holding standard deviations for the estimates held in X.

  • std_layer – Name of layer containing standard deviation estimates for the expression values in X. Should be the same shape as X.

  • prior_k – Set the k in a Gamma(k, θ) if prior is “gamma”,

  • prior_theta – Set the θ in Gamma(k, θ) if prior is “gamma”,

  • prior_a – Use either a Beta(a, a) or Dirichlet(a, a, a, …) prior depending on the value of prior.

  • estimate_scales – When using a dirichlet prior, try to estimate scales to adjust proportions to capture relative absolute abundance

  • chunk_size – How many genes to score at a time, mainly affecting memory usage. When None, a reasonable number will be chosen to avoid using too much memory.

  • alpha_layer – When using a beta prior, gives the name of the layer (in adata.layers) for the α parameter.

  • beta_layer – When using a beta prior, gives the name of the layer (in adata.layers) for the β parameter.

  • receiver_signals – Instead of computing auto-spatial mutual information for each gene, compute the spatial mutual information between each gene and the given signal, which must be either 1-dimensional or the same number of dimensions as there are genes. This signal is not resampled during training.

  • resample_frequency – Resample expression values after this many iterations. This tends to be computationally expensive, so is not done every iteration during optimization.

  • preload – If multiple AnnData objects are used, load everything into GPU memory at once, rather than as needed. This is considerably faster when GPU memory is not an issue.

  • quiet – Don’t print stuff to stdout while running.

Returns:

  • anndata.AnnData.var[“spatial_information”]: Lower bound on spatial

    information for each gene.

  • anndata.AnnData.var[“spatial_information_pvalue”]: P-values from

    testing the null hypothesis of no spatial organization.

  • anndata.AnnData.var[“spatial_information_log_pvalue”]: Log transformed p-values.

  • anndata.AnnData.layers[“spatial_information_acc”]: Per spot/cell

    classifier accuracy. Useful for visualizing what regions were inferred to have high spatial coherence..

Return type:

Modifies each adata with with the following keys

maxspin.pairwise_spatial_information(adatas: ~anndata._core.anndata.AnnData | ~typing.List[~anndata._core.anndata.AnnData], layer: str | None = None, nwalksteps: int = 2, stepsize: int = 5, lr: float = 0.01, nepochs: int = 8000, binsizes: ~typing.List[int] = [4, 8, 16], binweights: ~typing.Callable | ~typing.List[float] = <function <lambda>>, nevalsamples: int = 1000, max_unimproved_count: int | None = 50, seed: int = 0, prior: str | None = 'gamma', std_layer: str = 'std', prior_k: float = 0.01, prior_theta: float = 10.0, prior_a: float = 1.0, chunk_size: int | None = None, alpha_layer: str = 'alt', beta_layer: str = 'ref', resample_frequency: int = 10, preload: bool = True, quiet: bool = False)

Compute pairwise spatial information for each pair of genes in an AnnData or list of AnnData objects.

Parameters:
  • adatas – Either a single AnnData objects, or a list of AnnData objects. If a list is given, they must all have the same set of genes in the same order. Each AnnData must have a spatial neighborhood graph provided in obsp[“spatial_connectivities”] This can be done with the squidpy.gr.spatial_neighbors function.

  • layer – Name of layer to use. If None, the X matrix is used.

  • nwalksteps – Random walks take this many steps. Lengthening walks (by increasing this parameter or stepsize) will make the test less sensitive to smaller scale spatial variations, but more sensitive to large scale variations.

  • stepsize – Each random walk step follows this many edges on the neighborhood graph.

  • lr – Optimizer learning rate.

  • nepochs – Run the optimization step for this many iterations.

  • nevalsamples – Estimate MI bound by resampling expression and random walks this many times.

  • max_unimproved_iount – Early stopping criteria for optimization. If the the MI lower bound has not been improved for any gene for this many iterations, stop iterating.

  • seed – Random number generator seed.

  • prior – Account for uncertainty in expression estimates by resampling expression while training. If None, do no resampling. If “gamma”, use a model appropriate for absolute counts. If set to “beta” in conjunction with setting alpha_layer and beta_layer to two separate count layers, use a model suitable for testing for spatial patterns in allelic balance. If “gaussian”, the model expects a matrix nammed std in obsm holding standard deviations for the estimates held in X.

  • prior_k – Set the k in a Gamma(k, θ) if prior is “gamma”,

  • prior_theta – Set the θ in Gamma(k, θ) if prior is “gamma”,

  • prior_a – Use a Beta(a, a) prior.

  • chunk_size – How many genes to score at a time, mainly affecting memory usage. When None, a reasonable number will be chosen to avoid using too much memory.

  • resample_frequency – Resample expression values after this many iterations. This tends to be computationally expensive, so is not done every iteration during optimization.

  • preload – If multiple AnnData objects are used, load everything into GPU memory at once, rather than as needed. This is considerably faster when GPU memory is not an issue.

  • quiet – Don’t print stuff to stdout while running.

Returns:

  • anndata.AnnData.varp[“pairwise_spatial_information”]: Pairwise

    spatial information matrix between genes.

Return type:

Modifies each adata adding

Indices and tables