CellRank 2: unified fate mapping in multiview single-cell data (2024)

CellRank 2: a unified framework to probabilistically model cellular state changes

To overcome the inherent limitations of RNA velocity and unify TI across different data views, we developed CellRank 2; our framework describes cellular dynamics probabilistically, as proposed in our earlier work¹⁴. Specifically, we introduced the first probabilistic modeling framework that automatically determines the direction of cellular state changes to extend TI beyond normal development. Generalizing this paradigm to different biological priors and guaranteeing applicability in many scenarios required us to rethink the CellRank structure entirely. To this end, we base our new version on three key principles:

1.
Robustness: fate restriction is a gradual, noisy process requiring probabilistic treatment. Therefore, we use Markov chains to describe stochastic fate transitions, with each state of the Markov chain representing one cell.
2.
Modularity: quantifying transition probabilities between cells is independent of analyzing them. Thus, we modularized the CellRank framework into kernels to compute transition probabilities and estimators to analyze transition probabilities. This structure guarantees flexibility in applications and is easily extensible.
3.
Scalability: we assume each cell can transition into a small set of possible descendant states. Consequently, transition matrices are sparse, and computations scale to vast cell numbers (Extended Data Fig. 2).

Innovations in CellRank 2

Our design principles allowed us to improve our original work in three major aspects:

1.
We design a modular interface that allows us to decouple the construction of a Markov chain from the process of formulating a hypothesis based on the Markov chain.
2.
We introduce the PseudotimeKernel, CytoTRACEKernel and RealTimeKernel, as well as a method to infer kinetic rates from metabolic-labeling data, to render CellRank 2 applicable beyond RNA velocity; we do so by using a pseudotime, a measure of developmental potential, time-course data and metabolic-labeling information, respectively.
3.
We make our framework faster by accelerating our main estimator by one order of magnitude, easier to use by refactoring our codebase and more interpretable by visualizing kernel dynamics via random walks.

Key outputs of CellRank 2

Although inputs to CellRank 2 are kernel-dependent (Extended Data Fig. 1), outputs are consistent across all kernels:

Initial, intermediate and terminal states of cellular trajectories.
Fate probabilities, quantifying how likely each cell is to reach each terminal (or intermediate) state.
Gene expression trends specific to each identified trajectory.
Putative driver genes of fate decisions through correlating gene expression with fate probability.
Dedicated visualization tools for all key outputs, for example, circular embeddings for fate probabilities, heatmaps for cascades of trajectory-specific gene expression and line plots for gene trends along different trajectories.

A conceptual overview of kernels in CellRank 2

Decoupling inference of transition probability from their analysis

The typical CellRank 2 workflow consists of two steps: (1) estimating cell–cell transition probabilities and (2) deriving biological insights based on these estimates. Previously, we tied these two steps together¹⁴ but realized that decoupling them yields a much more powerful and flexible modeling framework. Treating each step separately is possible as analyzing transition matrices is independent of their construction. For example, estimating transition probabilities based on RNA velocity or a pseudotime does not change how initial and terminal states are inferred or fate probabilities estimated. Consequently, modularizing our problem-specific framework generalizes the corresponding analysis tools to other data modalities. The two steps of our inference workflow are conceptualized by kernels and estimators, respectively.

Kernels estimate transition matrices $T\in {{\mathbb{R}}}^{{n}_{\rm{c}}\times {n}_{\rm{c}}}$ at a cellular resolution with n_c denoting the number of cells; row T_j,: represents the transition probabilities of cell j toward putative descendants. With CellRank 2, we provide means to quantify fate probabilities based on RNA velocity (VelocityKernel), pseudotime (PseudotimeKernel), a developmental potential (CytoTRACEKernel), experimental time points (RealTimeKernel) and metabolic labeling (metabolic-labeling-based vector field with the VelocityKernel).

Initial state identification

Kernel-derived transition matrices quantify probabilities of cell transitions to putative progenitor states. To estimate initial states, instead, we work with the transposed transition matrix, thereby quantifying transition probabilities from progenitor cells to their putative ancestors. Each kernel automatically row-normalizes the transposed transition matrix.

Kernel combination

Different data modalities may capture different aspects of biological processes. To take advantage of multiple data modalities, kernels can be combined to quantify the likely state change in a single, aggregated transition matrix. Consider two kernels k⁽¹⁾ and k⁽²⁾ with corresponding transition matrices T⁽¹⁾ and T⁽²⁾, respectively. CellRank 2 allows combining the two kernels into a joint kernel k defined as k = αk⁽¹⁾ + (1 − α)k⁽²⁾ with a weight parameter α ∈ [0, 1]. The corresponding normalized transition matrix T is computed automatically and is thus, given by

$${T}_{\!jk}=\alpha T_{jk}^{\,(1)}+(1-\alpha ){T}_{\!jk}^{\,(2)}.$$

The terminal state identification score for kernel comparison

If terminal states of the studied system are known a priori, kernels can be compared by considering how well the kernels identify terminal states with an increasing number of macrostates; an optimal strategy identifies a new terminal state with every added macrostate until all terminal states have been identified. We summarize the performance of an arbitrary kernel relative to such an optimal identification with the TSI score: consider a system containing m terminal states and the function f that assigns each number of macrostates n the corresponding number of identified terminal states. In the case of a strategy that identifies terminal states optimally, f_opt describes the step function

$${f}_{{{{\rm{opt}}}}}(n)=\left\{\begin{array}{ll}n,\quad &n < m\\ m,\quad &n\ge m.\end{array}\right.$$

We define the TSI score for an arbitrary kernel κ as the area under the curve f_κ relative to the area under the curve f_opt, that is

$$\mathrm {TSI}(\kappa )=\frac{\sum\nolimits_{n = 1}^{{N}_{\mathrm {max}}}{f}_{\kappa }(n)}{\sum\nolimits_{n = 1}^{{N}_{\mathrm {max}}}{f}_{{{{\rm{opt}}}}}}=\frac{2}{m(1+{N}_{\mathrm {max}}-m)}\sum\limits_{n=1}^{{N}_{\mathrm {max}}}{f}_{\kappa }(n),$$

with maximum number of macrostates assessed N_max.

Kernel comparison via the cross-boundary correctness score

While the CellRank 2 framework aims at quantifying cell trajectories, correct transitions between coarse cell states, such as cell types, are sometimes known a priori. In such cases, the CBC score²⁹ can be used to compare two kernels: Consider two cell states $C_{1}$ and $C_{2}$, where $C_{2}$ is a progenitor state of $C_{1}$, a precomputed nearest-neighbor graph with weights w_jk between observations j and k and denote the neighborhood of observation j by $N(j\,)$. The representation of observation j is denoted by x_j; all cell representations are collected in the matrix X. We define the boundary of $C_{1}$ to $C_{2}$ as all cells with at least one neighbor in $C_{2}$ and denote it by ${\partial }_{1\to 2}C_{1}$, that is

$${\partial }_{1\to 2}C_{1}=\{j\in C_{1}| \exists k\in N(j\,):k\in C_{2}\}$$

For every boundary cell, we empirically define the velocity v(j) of observation $j\in C_{1}$ as

$$v(j\,)=\mathop{\sum}\limits_{k\in N(j\,)\cap C_{2}}{w}_{\!jk}\left({x}_{k}-{x}_{\!j}\right).$$

Similarly, for a given kernel κ, we estimate the velocity of observation j via

$${v}^{(\kappa )}(j\,)={T}_{\!j,:}^{\,(\kappa )}X-{x}_{\,\!j},$$

where ${T}_{\!j,:}^{\,(\kappa )}$ denotes the jth row of the transition matrix computed with kernel κ. The CBC score β^(κ)(j) of cell j under kernel κ is then given by the Pearson correlation between v(j) and v^(κ)(j).

To compare two kernels κ₁ and κ₂, for each observation, we compute the log ratio of the corresponding CBC scores ${\beta }^{({\kappa }_{1})}(j\,)$ and ${\beta }^{({\kappa }_{2})}(j\,)$. If the velocity estimate based on kernel κ₁ aligns more with the empirical estimate, the log ratio is positive and negative otherwise. A one-sided Welch’s t-test can be used to test if kernel κ₁ significantly outperforms kernel κ₂.

Visualizing kernel dynamics: random walks and projections

Although inferred transition probabilities are predominantly used for more in-depth data analyses based on estimators, we also provide means to visualize cellular dynamics directly based on the kernel output. These visualizations are intended to provide a preliminary understanding of the underlying dynamics and serve as a starting point for further analyses. Here, we enable studying the evolution of cellular state change either based on random walks in the high-dimensional gene expression space or a projection of the high-dimensional vector field onto a low-dimensional latent space representation of the data.

A conceptual overview of estimators in CellRank 2

Based on transition matrices provided by kernels, we enable data-driven knowledge discovery. To this end, estimators first identify initial, intermediate and terminal states using the precomputed transition matrices. States are identified using concepts and results from the rich theory of Markov chains. Following this, we enable visualizing trajectory-specific gene expression trends and cascades of gene activation¹⁴, clustering expression trends¹⁴ or arranging cells in a circular embedding^14,74 to summarize fate probabilities. We provide the necessary tools for each step of the downstream analysis as part of CellRank 2.

The generalized Perron cluster cluster analysis estimator

As in our previous work, we compute macrostates and classify them as initial, intermediate and terminal by coarse-graining the cell–cell transition matrix. This approach is based on generalized Perron cluster cluster analysis^19,20 (GPCCA), a method initially developed to study conformational protein dynamics.

Performance improvements of CellRank 2

Faster computation of fate probabilities

After estimating cell–cell transition probabilities through a kernel and identifying terminal states through an estimator, we assess cellular fate toward these terminal states. For each cell, we quantify its fate probability, that is, how likely it is to differentiate into one of the terminal states. Given our Markov-chain-based framework, fate probabilities can be computed in closed form using absorption probabilities; however, calculating absorption probabilities directly scales cubically in the number of cells. To overcome this computational burden, in our previous work, we reformulated the underlying problem as a set of linear systems. These linear systems are then solved in parallel using a sparsity-optimized iterative algorithm⁷⁵; this reformulation scales near-linearly¹⁴.

Even though our previously proposed reformulation for computing absorption probabilities achieved a significant increase in performance compared to a naive implementation, we still encountered increased runtimes when analyzing larger datasets (Extended Data Fig. 2a). To reduce the runtime further, we devised an alternative but equivalent approach: Given a terminal state, we previously identified n_f representative cells, computed absorption probabilities toward them, and aggregated them across the n_f representative cells to assign a single, lineage-specific probability. In CellRank 2, we first combine the n_f representative cells into a single pseudo-state and compute absorption probabilities toward it instead. While the corresponding results are mathematically equivalent, ignoring parallelization, this new approach is n_f times faster. Therefore, with n_f = 30 by default, our improved implementation results in a 30-fold speed-up.

Extensibility of CellRank 2

While we already provide multiple kernels tailored to different data modalities, current and future technologies provide additional sources of information. Concrete examples include spatially resolved time-course studies^59,60,61 and genetic lineage-tracing data^55,56,57,58, previously already integrated in the CellRank 2 ecosystem^23,24. Our modular interface makes CellRank 2 easily extensible toward (1) alternative single-cell data modalities by including new kernels and (2) alternative trajectory descriptions generating different hypotheses through new estimators.

The PseudotimeKernel: incorporating previous knowledge on differentiation

Aligning cells along a continuous pseudotime mimicking the underlying differentiation process has been studied in many use cases. In particular, a pseudotime can be computed for systems where a single, known initial cellular state develops unidirectionally into a set of unknown terminal states. Based on the assigned pseudotime values, we quantify transition probabilities between cells using the PseudotimeKernel.

Given a similarity-based nearest-neighbor graph with a corresponding adjacency matrix $\tilde{C}$, the PseudotimeKernel biases graph edges toward increasing pseudotime: consider a reference cell j, one of its neighbors k, the corresponding edge weight ${\tilde{C}}_{\!jk}$ and the difference between their pseudotimes Δt_jk. To favor cellular transitions toward increasing pseudotime, the PseudotimeKernel downweighs graph edges pointing into the reference cell’s pseudotemporal past while leaving the remaining edges unchanged. Edge weights are updated according to

$$C_{\!jk}={\tilde{C}}_{\!jk\;}f(\Delta {t}_{\!jk}),$$

with a function f implementing the thresholding scheme. In CellRank 2, we implement soft and hard thresholding. The soft scheme continuously downweighs edge weights according to

$$f(\Delta t)=\left\{\begin{array}{ll}\frac{2}{\root \nu \of {1+{e}^{b\Delta t}}},\quad &\Delta t < 0\\ 1,\quad &\Delta t\ge 0.\end{array}\right.$$

By default, the parameters b and ν are set to 10 and 0.5, respectively. This concept is similar to the scheme proposed by the TI method VIA²⁵. In contrast to soft thresholds, hard thresholding follows a stricter policy inspired by Palantir⁵, discarding most edges that point into the pseudotemporal past.

The CytoTRACEKernel: inferring directionality from developmental potential

CytoTRACE assigns each cell in a given dataset a developmental potential¹⁸. Score values range from 0 to 1, with 0 and 1 identifying mature and naive cells, respectively. Inverting the score, thus, defines a pseudotime for developmental datasets. In CellRank 2, the CytoTRACEKernel computes the CytoTRACE score and constructs the corresponding pseudotime to calculate a transition matrix as described for the PseudotimeKernel.

Adaptation of the CytoTRACE score

When calculating the CytoTRACE score on larger datasets, we found the score construction either intractable due to long runtimes (40,000 to 80,000 cells) or failed to compute the score at all (more than 80,000 cells) (Extended Data Fig. 2b). Thus, to ensure computational efficiency when reconstructing the CytoTRACE score for larger datasets, we sought an alternative, computationally efficient and numerically highly correlated approach.

Conceptually, CytoTRACE proposes that the number of expressed genes decreases with cellular maturity. This assumption is biologically motivated by less-developed cells regulating their chromatin less tightly¹⁸. The computation of the CytoTRACE score c with CellRank 2 is composed of three main steps (Extended Data Fig. 4a). Consider the gene expression matrix X and the smoothed gene expression matrix X^(smoothed) found by nearest-neighbor smoothing as implemented in scVelo⁸ or MAGIC⁷⁶. For each cell j, we compute the number of genes it expresses (${{\rm{GEC}}}\,\in {{\mathbb{N}}}^{{n}_{\rm{c}}}$), that is

$${{{\rm{GEC}}}}_{\!\,j}=\mathop{\sum}\limits_{k=1}^{{n}_{\rm{g}}}{\mathbb{1}}({X}_{\!jk} > 0),$$

with indicator function ${\mathbb{1}}(\cdot )$. The indicator function equates to one if its argument holds true and zero otherwise. Next, for each gene, we compute its Pearson correlation with GEC, select the top L genes (default 200) and subset X^(smoothed) to the identified L genes. Finally, we mean-aggregate each cell’s gene expression

$${\tilde{c}}_{\!\,j}=\mathop{\sum }\limits_{k=1}^{L}{\tilde{X}}_{\!jk},$$

with $\tilde{X}\in {{\mathbb{R}}}^{{n}_{\rm{c}}\times L}$ denoting the subsetted, smoothed gene expression matrix X^(smoothed). The CytoTRACE score c is then given by scaling $\tilde{c}$ to the unit interval

$${{c}_{\!j}}=\frac{{{\tilde{c}}_{\!j}}-\min \tilde{c}}{\max \tilde{c}},$$

and the corresponding pseudotime p_cyt by inverting c, that is

$${p}_{{{{\rm{cyt}}}}}=1-c.$$

Comparison of the CytoTRACE score construction

Considering the nearest-neighbor-smoothed gene expression matrix, instead of an alternative, computationally more costly imputation scheme, is the main difference between our adaptation and the original CytoTRACE proposal. To impute gene expression, the original implementation solves a non-negative least squares regression problem and simulates a diffusion process¹⁸.

To confirm that our adapted scheme yields numerically similar results, we compared the CytoTRACE scores of the original and our approach to ground truth time or stage labels on six datasets previously used to validate CytoTRACE¹⁸ (Extended Data Fig. 4b,c). The considered datasets are bone marrow⁷⁷ (using 10x and SmartSeq2), C. elegans embryogenesis⁷⁸ (subsetted to ciliated neurons, hypodermis and seam, or muscle and mesoderm) and zebrafish embryogenesis⁷⁹. For each dataset, the original CytoTRACE study derived ground truth time labels using either embryo time, stages (C. elegans and zebrafish embryogenesis) or a manual assignment (bone marrow). The concordance of each approach with ground truth was confirmed by calculating the Spearman rank correlation between the CytoTRACE score and ground truth time or stage labels.

The RealTimeKernel: resolving non-equilibria systems through time-series data

Commonly used single-cell sequencing protocols are destructive by design and offer, thus, only a discrete temporal resolution. Recent advances allow reconstructing transcriptomic changes across experimental time points using OT¹⁵; however, these approaches focus only on inter-time-point information; conversely, the RealTimeKernel incorporates both inter- and intra-time-point transitions to draw a more complete picture of cellular dynamics.

To quantify inter-time-point transitions, the RealTimeKernel relies on WOT¹⁵. For each tuple of consecutive time points t_j and t_j+1, WOT identifies a transport map ${\pi }_{{t}_{j},{t}_{j+1}}$, assigning each cell at time t_j its likely future state at time t_j+1. In addition, we rely on transcriptomic similarity to study transcriptomic change within a single time point t_j. We combine WOT-based inter-time point transport maps ${\pi }_{{t}_{j},{t}_{j+1}}$ with similarity-based intra-time-point transition matrices ${\tilde{T}}_{{t}_{j},{t}_{j}}$ in a global transition matrix T which contains cells from all time points. In the global transition matrix T, we place WOT-computed transport maps on the first off-diagonal, modeling transitions between subsequent time points, and similarity-based transition matrices on the diagonal, modeling transitions within each time point (Fig. 4a). We normalize each row to sum to one, giving rise to a Markov chain description of the system.

Thresholding transport maps for scalability

In the single-cell domain, most OT-based approaches, including WOT, rely on entropic regularization³⁶ to speed up the computation of transport maps; however, entropic regularization leads to dense transport maps ${\pi }_{{t}_{j},{t}_{j+1}}$, rendering downstream computations based on the RealTimeKernel extremely expensive for larger datasets (Extended Data Fig. 6c); most of the entries found in ${\pi }_{{t}_{j},{t}_{j+1}}$ are extremely small, though. As a result, these entries contribute only marginally to the observed dynamics (Extended Data Fig. 6d).

To ensure fast RealTimeKernel-based computations, we devised an adaptive thresholding scheme resulting in sparse transition matrices. Transition probabilities falling below a certain threshold are set to zero, all others are kept unchanged. Per default, we identify the smallest threshold τ that does not remove all transitions for any cell, that is

$$\tau =\mathop{\min}\limits_{j,k\in \{1,\ldots ,{n}_{\rm{c}}\}}{T}_{\!jk},\,\,\,{{\mbox{s.t.}}}\,\forall j\in \{1,\ldots ,{n}_{\rm{c}}\}\mathop{\sum}\limits_{k\in \{1,\ldots ,{n}_{\rm{c}}\}}{\mathbb{1}}(\tau \ge T_{\!jk})\ge 1,$$

with indicator function ${\mathbb{1}}(\cdot )$. Alternatively, the same heuristic can be applied for each time point independently or a user-defined threshold may be used. Following thresholding, we re-normalize the transition matrix such that rows sum to one again.

To verify that thresholding the transition matrix does not alter biological findings, we compared fate probabilities derived from the original and the thresholded transition matrix on a dataset of MEF reprogramming¹⁵. For each terminal state, we computed the Pearson correlation between fate probabilities estimated by each approach (Extended Data Fig. 6d).

Estimating cellular fate from time-resolved single-cell RNA sequencing data

Traditional single-cell sequencing protocols include cell lysis and are, thus, destructive by nature. Consequently, the transcriptome can only be measured once, resulting in snapshot data. Recently, metabolic-labeling approaches have been extended to single-cell resolution, providing an opportunity to overcome this challenge by measuring newly synthesized mRNA in a given time window⁸⁰. To label transcripts, current protocols rely on the nucleoside analogs 4-thiouridine (4sU; scSLAM-seq¹², sci-fate¹⁰, NASC-seq⁸¹, scNT-seq¹¹, Well-TEMP-seq⁸² and others⁸³) or 5-ethynyl-uridine (5EU; scEU-seq⁹, spinDrop⁶⁵, TEMPOmap¹³ and others⁶⁶).

Our study considers two types of labeling experiments: pulse and chase⁹. Pulse experiments consist of labeling n cell cultures, starting at times t_j, j ∈ {1, …, n}. Conversely, in chase experiments, cells are exposed to nucleoside analogs for long enough (for example, more than 24 h), resulting in only labeled transcripts. Following, these labeled transcripts are washed out, starting at times t_j. Similar to the pulse experiment, chase experiments include, in general, washing out at n different times. Finally, in both types of experiments, all cells are sequenced at a time t_f, naturally defining the labeling time (or duration) by ${\tau }_{l}^{(j\,)}={t}_{f}-{t}_{j}$.

Pulse and chase experiments allow measuring the production of mRNA. Here, we estimate cell-specific transcription and degradation rates, similar to a previous proposal in the scEU-seq study⁹. Specifically, for a particular gene, we assume mRNA levels r to evolve according to

$$\dot{r}=\alpha -\gamma r,$$

with transcription rate α and degradation rate γ. The corresponding solution is given by

$$r(t)={r}_{0}{e}^{-\gamma t}+\frac{\alpha }{\gamma }\left(1-{e}^{-\gamma t}\right).$$

Note that here, we assume gene-specific models, that is, gene–gene interactions are neglected. In the following, we will identify mRNA measurements from pulse and chase experiments by the superscripts (p) and (c), respectively.

Pulse experiments

Pulse experiments study the production of labeled RNA. As labeling starts at t_k, and no labeled transcripts exist before, the abundance of labeled mRNA r_l at times t_k and ${\tau }_{l}^{(k)}$ is given by

$${r}_{l}^{({\mathrm{p}})}({t}_{k}| \alpha ,\gamma )=0,$$

and

$${r}_{l}^{({\mathrm{p}})}\left({\tau }_{l}^{(k)}| \alpha ,\gamma\right)=\frac{\alpha }{\gamma }\left(1-{e}^{-\gamma {\tau }_{l}^{(k)}}\right).$$

Chase experiments

In chase experiments, mRNA degradation is studied by washing out labeled transcripts. Thus, labeled mRNA ${r}_{l}^{({\mathrm{c}})}$ at time ${\tau }_{l}^{(k)}$ follows

$${r}_{l}^{({\mathrm{c}})}\left({\tau }_{l}^{(k)}| \alpha ,\gamma ,{r}_{0}\right)={r}_{0}-\frac{\alpha }{\gamma }\left(1-{e}^{-\gamma {\tau }_{l}^{(k)}}\right),$$

where r₀ corresponds to the mRNA level when starting to wash out labeled transcripts. Before washing out labeled mRNA, no unlabeled transcripts are present, and thus, their abundance at time ${\tau }_{l}^{(k)}$ is modeled as

$${r}_{u}^{({\mathrm{c}})}\left({\tau }_{l}^{(k)}| \alpha ,\gamma\right)=\frac{\alpha }{\gamma }\left(1-{e}^{-\gamma {\tau }_{l}^{(k)}}\right).$$

Parameter inference

Considering measurements from both chase and pulse experiments, we denote the respective set of cells by $C$ and $P$. To estimate cell j and gene g specific model parameters α^(j, g), γ^(j, g), and ${r}_{0}^{(j,g)}$, we proceed as follows:

1.
Consider cell j and its principal component analysis (PCA) representation ${z}_{j}^{({{{\rm{PCA}}}})}$. For each labeling duration k, we determine the distance in PCA space between the reference cell j to each cell with labeling duration ${\tau }_{l}^{(k)}$. For each gene g, we then identify the 20 nearest cells with non-trivial expression in g. These cells, as well as all closer neighbors (with zero counts), define the set $N_{g}^{(k)}$, which we consider for parameter inference.
2.
To estimate model parameters, we minimize the quadratic loss ℓ defined as
$$\begin{array}{l}\ell\left({r}_{0}^{(j,g)},{\alpha }^{(j,g)},{\gamma }^{(j,g)}\right)=\mathop{\sum}\limits_{k}\mathop{\sum}\limits_{j\in N_{g}^{(k)}}\left[{r}_{l,\,j}\left({\tau }_{l}^{(k)}\right)-{\mathbb{1}}(j\in C){r}_{l}^{({\mathrm{c}})}\left({\tau }_{l}^{(k)}| \alpha ,\gamma ,{r}_{0}\right)\right.\\\qquad\qquad\qquad\qquad\quad\left.-{\mathbb{1}}(j\in P){r}_{l}^{({\mathrm{p}})}\left({\tau }_{l}^{(k)}| \alpha ,\gamma\right)\right]^{2}.\end{array}$$

Here, ${\mathbb{1}}(x\in X)$ denotes the characteristic function equaling to 1 if $x\in X$, and 0 otherwise. We note that estimating the parameters of a pulse (chase) experiment requires at least two (three) labeling durations.

Our approach differs from the scEU-seq study⁹ mainly in two ways. First, we base our analysis on total RNA, not spliced RNA. We reasoned that this approach circumvents limitations of identifying unspliced and spliced counts. Second, we infer rates for all genes and not only those changing substantially during development.

Method comparison

To benchmark the performance of different approaches, we identified and ranked potential drivers of every lineage using each approach. We compared this ranking to a curated list of known lineage markers and regulators. If the literature-based gene set were complete, an optimal method would rank the corresponding genes highest. Consequently, for each method, we quantified its performance as follows. First, consider a lineage, a set of known drivers $D$ and a method m. Further, denote the set of genes by $G$, and for $g\in G$, identify its assigned rank by a superscript, for example, g^(j) for the jth ranked gene g. Next, for each threshold $N\in {\mathbb{N}}$ and $N\le |G|$, we computed how many known markers/regulators were ranked among the top N genes with

$${\varphi }^{(m)}(N\,)=\left|\left\{{\,g}^{\,(j\,)}| \;j\le N\wedge {g}^{\,(j\,)}\in D\right\}\right| .$$

Thus, we call an assignment optimal when

$${\varphi }^{({{{\rm{opt}}}})}(N\,)=\left\{\begin{array}{ll}{\varphi }^{(m)}(N\,)+1,\quad &N < |D| \\ |D| ,\quad &\,{{\mbox{otherwise}}}\,.\end{array}\right.$$

Next, for each method m, we computed the area under the curve, AUC(m) of φ^(m), that is

$$\,{{\rm{AUC}}}\,(m)=\mathop{\sum}\limits_{N=1}^{|G| }{\varphi}^{(m)}(N)$$

and its relative area under the curve, AUC_rel(m) as

$${{{\rm{AUC}}}}_{{{{\rm{rel}}}}}(m)=\frac{\,{{\rm{AUC}}}(m)}{{{{\rm{AUC}}}}^{* }},$$

with

$${{{\mbox{AUC}}}}^{* }=\frac{|D| (|D| +1)}{2}+(|G| - |D| )|D| ,$$

that is, the area under the curve of an optimal assignment.

Datasets

Unless stated otherwise, all functions were run with default parameters. We ran our analyses in Python, relying on the standard single-cell biology tools Scanpy⁸⁴ and AnnData⁸⁵; we specify other relevant packages where applicable. For Scanpy-based workflows⁸⁴, we computed PCA embeddings, neighbor graphs and UMAP embeddings⁸⁶ with the scanpy.tl.pca, scanpy.pp.neighbors and scanpy.tl.umap functions, respectively.

For analyses based on CellRank 2 kernels, the kernel method compute_transition_matrix computed transition probabilities, and the kernel method cbc the CBC score. We used the GPCCA estimator functions to compute macrostates (compute_macrostates)²⁰ and the TSI score (tsi), define terminal states (set_terminal_states), compute fate probabilities (compute_fate_probabilities) and identify lineage-correlated genes (compute_lineage_drivers). To order the putative regulators according to their peak expression in pseudotime, we first fitted generalized additive models to describe gene expression change over pseudotime with cellrank.models.GAM. Following this, we visualized the putative cascade of regulation with the cellrank.pl.heatmap function.

Human hematopoiesis

All analyses were conducted on the dataset preprocessed by the original study²⁶, subsetted to the normoblast, dendritic and monocyte lineages according to the provided cell type annotation (‘HSC’, ‘MK/E progenitors’, ‘Proerythroblast’, ‘Erythroblast’, ‘Normoblast’, ‘cDC2’, ‘pDC’, ‘G/M prog’ and ‘CD14⁺ Mono’).

Pseudotime-based analysis

After subsetting the data, we computed the nearest-neighbor graph on the precomputed MultiVI^84,87 latent space and the UMAP embedding with Scanpy. Following this, we computed 15 diffusion components⁸⁸ (scanpy.tl.diffmap) to then assign diffusion pseudotime values using Scanpy’s dpt^1,88 function with n_dcs=6. We identified the root cell as the hematopoietic stem cell with the largest fifth diffusion component.

We computed the transition matrix with CellRank 2’s PseudotimeKernel and thresholding_scheme=‘soft’ and computed six macrostates²⁰. We defined the terminal states ‘pDC’, ‘CD14⁺ Mono’, ‘Normoblast’ and ‘cDC2’ that corresponded to the four macrostates with the largest macrostate purity. After quantifying fate probabilities, we identified putative pDC lineage drivers with our correlation-based procedure, restricted to the hematopoietic stem cell (HSC) and pDC clusters (lineages=[‘pDC’] and clusters=[‘HSC’, ‘pDC’]). We quantified the corresponding gene trends in the same way as described in our previous work¹⁴.

RNA velocity-based analysis

To infer RNA velocity, we generally followed the instructions provided by scVelo’s⁸ tutorials. First, we filtered for genes expressed in at least 20 cells in both unspliced and spliced counts with scVelo’s scvelo.pp.filter_genes function and normalized counts with scvelo.pp.normalize_per_cell. The neighbor graph was again computed on the MultiVI latent space, followed by count imputation through first-order moments with scVelo’s scvelo.pp.moments function. We then inferred RNA velocity with the scvelo.tl.recover_dymanics function.

To quantify cellular fate, we computed transition matrices with the VelocityKernel and ConnectivityKernel and combined them with 0.8 and 0.2 weight, respectively, as proposed by the CellRank 1 workflow. We computed macrostates and fate probabilities using the GPCCA estimator as described for the pseudotime-based analysis. As the RNA-velocity-based analysis did not identify the cDC cluster as a macrostate (Extended Data Fig. 3b,f), we computed three macrostates corresponding to the terminal states normoblasts, monocytes and pDCs.

Embryoid body development

Data preprocessing

We followed Scanpy’s workflow to process the raw count matrix. As a first step, we filtered out genes expressed in fewer than ten cells (scanpy.pp.filter_genes with min_cells=10). Following, we removed cells with more than 17,500 counts, cells for which more than 15% of counts originate from mitochondrial genes and cells expressing more than 3,500 genes. Following, we size normalized cells to 10,000 (scanpy.pp.normalize_total with total_sum=1e4), applied a log1p-transformation (scanpy.pp.log1p) and annotated highly variable genes with scanpy.pp.highly_variable_genes. We based all further analyses on these highly variable genes and the marker genes identified by the study introducing the embryoid body development dataset³². The neighbor graph was computed for 30 neighbors using 30 principal components (PCs).

CytoTRACEKernel analysis

To compute the CytoTRACE score¹⁸, we first imputed the normalized count matrix by first-order moments with scVelo’s scvelo.pp.moments function; the score itself was calculated with the compute_cytotrace method of the CytoTRACEKernel. We computed the transition matrix with the soft thresholding scheme (thresholding_scheme=‘soft’) and nu=5. Putative drivers of the endoderm lineage were identified by focusing on the stem cell and endoderm clusters (lineages=[‘EN-1’] and clusters=[‘ESC’]).

Pseudotime construction

To compute DPT¹, we calculated diffusion components (scanpy.tl.diffmap) and identified the putative root cell as the minimum in the first diffusion component. We then assigned DPT values using Scanpy’s dpt function.

For the Palantir pseudotime⁵, we used the corresponding Python package and followed the steps outlined in its documentation. As a first step, we computed the first five diffusion components with palantir.utils.run_diffusion_maps with n_components=5. Following this, we identified the multi-scale space of the data (palantir.utils.determine_multiscale_space) and imputed the data using MAGIC⁷⁶ (palantir.utils.run_magic_imputation). Finally, we computed the Palantir pseudotime via palantir.core.run_palantir using the same root cell as for our DPT analysis, and num_waypoints=500.

Mouse embryonic fibroblast reprogramming

Data preprocessing

For analyzing the dataset of MEF reprogramming toward induced pluripotent stem cells¹⁵, we subsetted to the serum condition and added the category ‘MEF/other’ to the cell set annotations. Then, we computed the PCA embedding and nearest-neighbor graph.

WOT-based analysis

To construct transport maps, we used the wot package¹⁵ and followed the provided tutorials. First, we instantiated an OT model (wot.ot.OTModel with day_field=‘day’) and computed the transport maps next (compute_all_transport_maps). We defined the target cell sets based on the provided cell type annotation and quantified WOT-based fates toward the last experimental time point through the OT model’s fates function with at_time=18.

RealTimeKernel-based analysis

For our RealTimeKernel-based analysis, we relied on the transport maps computed with wot. When constructing the transition matrix, we considered within-time-point transitions for every experimental time point and weighed them by 0.2 (self_transitions=‘all’, conn_weight=0.2).

To construct the real-time-informed pseudotime, we symmetrized the global transition matrix and row-normalized it. The symmetrized matrix defined the _transitions_sym attribute of Scanpy’s DPT class. Following, we computed diffusion components with the DPT class’ compute_eigen function. The root cell for DPT was identified as an extremum of the most immature cell state within the first experimental time point in diffusion space. Here, we selected the maximum in the first diffusion component. Finally, we computed DPT itself with scanpy.tl.dpt.

We computed fate probabilities toward the four terminal states according to our canonical pipeline (identification of four macrostates followed by fate quantification).

Pseudotime construction

To assign each observation its DPT¹ value, irrespective of experimental time points, we computed diffusion maps (scanpy.tl.diffmap) and identified the root cell as the maximum value in the first diffusion component. Then, we computed DPT with scanpy.tl.dpt.

For constructing the Palantir pseudotime⁵, irrespective of experimental time points, we followed the same steps as described for the embryoid body development data.

Pharyngeal endoderm development

Data preprocessing

The pharyngeal endoderm development dataset provided by the original study³⁷ had already been filtered for high-quality cells and genes. Consequently, we directly quantified highly expressed genes using Scanpy’s highly_variable_genes function. Then, we computed the PCA embedding and nearest-neighbor graph based on 30 PCs and 30 neighbors (n_pcs=30, n_neighbors=30).

RealTimeKernel-based analysis

To study the pharyngeal endoderm development dataset with the RealTimeKernel, we followed the WOT tutorials to compute transport maps. First, we instantiated an OT model (wot.ot.OTModel with day_field=‘day’) and computed the transport maps next (compute_all_transport_maps). For the RealTimeKernel, we considered within-time-point transitions for every experimental time point and weighed them by 0.1 (self_transitions=‘all’, conn_weight=0.1).

We estimated terminal states using the GPCCA estimator with default settings by computing 13 macrostates and selecting the known terminal clusters. After calculating fate probabilities, for each lineage, we identified lineage-correlated genes as candidate driver genes with GPCCA.compute_lineage_drivers by restricting the analysis to progenitors of the corresponding lineage and excluding cell cycle, mitochondrial, ribosomal and hemoglobin genes⁸⁹.

To study mTEC development, we subsetted to the early thymus, ultimobranchial body (UBB), parathyroid, cTEC and mTEC clusters and processed the data as described for the entire dataset. We computed the UMAP embedding using Scanpy’s umap function. To compute the transition matrix, we proceeded in the same manner as described for the entire dataset. For TSI, fate quantification, and driver analysis, we followed the standard CellRank 2 pipeline.

WOT-based analysis

To identify putative drivers of the mTEC lineage with WOT, we used the same transport maps as in our RealTimeKernel-based analysis. We defined the target cell sets based on the provided cell type annotation considering only observations from the last time point and computed the pullback distribution from the mTEC cluster at embryonic day (E) 12.5 to E10.5 cells as it consists of progenitor cells (pull_back). The sequence of ancestor distributions was quantified with the transport model’s trajectories method. WOT identifies putative drivers of the mTEC lineage as genes differentially expressed in cells most fated toward the mTEC cluster. We used the wot.tmap.diff_exp function to construct the corresponding gene ranking.

Classical differential expression analysis

As an alternative means to identify putative drivers of the mTEC lineage based on the fate probabilities assigned by our RealTimeKernel-based analysis, we defined two groups of cells within the general progenitor pool, those with mTEC fate probability greater than 0.5 and all other progenitor cells. We then identified differentially expressed genes between putative mTEC progenitors and all others with Scanpy’s rank_genes_groups function.

Intestinal organoids

Data preprocessing

To preprocess the dataset of intestinal organoids, we first excluded dimethylsulfoxide control cells and cells labeled as tuft cells. Following, we removed genes with fewer than 50 counts, size normalized total and labeled counts, and identified the 2,000 most highly variable genes with scvelo.pp.filter_and_normalize. The neighbor graph was constructed based on 30 PCs and 30 neighbors. Finally, we computed first-order-smoothed labeled and total mRNA counts.

Parameter estimation

To estimate kinetic rate parameters, we made use of our new inference scheme for metabolic-labeling data implemented as part of the scVelo package. We first masked observations according to their labeling time with scvelo.inference.get_labeling_time_mask. Next, we computed pairwise distances between observations in PCA space and sorted observations in ascending order for each time point using scvelo.inference.get_obs_dist_argsort. This information allowed us to identify, for each cell and gene, how many neighbors to consider during parameter estimation to include 20 non-zero observations smoothed by first-order moments. This calculation was performed via scvelo.inference.get_n_neighbors. Finally, we estimated model parameters based on smoothed labeled counts with scvelo.inference.get_parameters.

Labeling velocity-based analysis

To quantify cell-specific fates, we first computed labeled velocities based on the estimated parameters. Then, we computed a transition matrix by combining the VelocityKernel and ConnectivityKernel with a 0.8 and 0.2 weight, respectively. We then inferred 12 macrostates and fate probabilities toward the known terminal states. Lineage-specific drivers were identified by restricting the correlation-based analysis to the corresponding terminal state and stem cell cluster. For putative driver gene ranking based on gene expression, we correlated fate probabilities with smoothed labeled counts.

Dynamo-based analysis

To analyze the intestinal organoid data with dynamo¹⁶, we followed the tutorials provided in the documentation of the Python package. As a first step, this required us to compute the ratio of new to total RNA with dynamo.preprocessing.utils.calc_new_to_total_ratio followed by first-order moment imputation of total and new RNA using dynamo.tl.moments with our connectivity matrix and group=‘time’. Dynamo’s dynamo.tl.dynamics estimated the velocities with function arguments model=‘deterministic’, tkey=‘time’ and assumption_mRNA=‘ss’.

Following velocity estimation, we quantified fixed points by following dynamo’s corresponding pipeline. First, we projected the high-dimensional velocity field of new RNA onto the UMAP embedding using dynamo.tl.cell_velocities with ekey=‘M_n’ and vkey=‘velocity_N’. Fixed points were then identified by calling dynamo.tl.VectorField with basis=‘umap’ and dynamo.vf.topography. As a final step, we identified all stable fixed points.

Given the stable fixed points of the system, we identified lineage-correlated genes regulating cell differentiation toward them using dynamo’s least action path analysis. As a first step, this workflow required us to compute a UMAP embedding based on new RNA with dynamo.tl.reduceDimension and layer=‘X_new’, followed by the projection of the velocity field onto the PCA space (dynamo.tl.cell_velocities with basis=‘pca’) and learning a vector field function based on this projection (dynamo.tl.VectorField with basis=‘pca’). Next, we defined terminal states as the 30 nearest neighbors in UMAP space of each stable fixed point. For initial states, we computed the 30 nearest neighbors of unstable fixed points of the stem cell cluster. To compute the least action paths and account for uncertainty in initial and terminal state assignment, we randomly sampled ten pairs of initial and terminal cells and estimated the paths between them with dynamo.pd.least_action. Dynamo’s dynamo.pd.GeneTrajectory class then identified genes associated with the emergence of a terminal state. The pairwise sampling of initial and terminal cells defined the confidence bands of dynamo’s gene rankings shown in Fig. 5d.

RNA velocity-based analysis

Conventional RNA velocity was estimated with scVelo’s dynamical model⁸ by running scvelo.tl.recover_dynamics. To execute this function, we first preprocessed the raw data with scvelo.pp.filter_and_normalize to remove genes expressed in fewer than 50 cells (min_counts=50), size-normalizing spliced and unspliced counts and subsetting to the 2,000 most highly variable genes (n_top_genes=2000). Then, we computed the PCA embedding, calculated the neighbor graph with 30 PCs and 30 neighbors and smoothed unspliced and spliced counts by first-order moments (scvelo.pp.moments).

We combined the VelocityKernel and ConnectivityKernel weighted by 0.8 and 0.2, respectively, to estimate the cell–cell transition matrix. Next, we identified terminal states and corresponding fates and lineage-correlated gene rankings following the canonical CellRank 2 pipeline.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.