Introducing mRNA bias into simulations with scaling factors
Alexander Dietrich
11/29/2021
Source:vignettes/simulator_scaling_factors.Rmd
simulator_scaling_factors.Rmd
Using scaling factors
This package allows the user to introduce an mRNA bias into
pseudo-bulk RNA-seq samples. Different cell-types contain different
amounts of mRNA (Dendritic cells for examples contain much more than
Neutrophils); this bias can be added into simulations artificially in
different ways.
The scaling factors will always be applied on the single-cell dataset
first, altering its expression profiles accordingly, and then the
pseudo-bulk samples are generated by summing up the count data from the
sampled cells.
# Example data
counts <- Matrix::Matrix(matrix(stats::rpois(3e5, 5), ncol = 300), sparse = TRUE)
tpm <- Matrix::Matrix(matrix(stats::rpois(3e5, 5), ncol = 300), sparse = TRUE)
tpm <- Matrix::t(1e6 * Matrix::t(tpm) / Matrix::colSums(tpm))
colnames(counts) <- paste0("cell_", rep(1:300))
colnames(tpm) <- paste0("cell_", rep(1:300))
rownames(counts) <- paste0("gene_", rep(1:1000))
rownames(tpm) <- paste0("gene_", rep(1:1000))
annotation <- data.frame(
"ID" = paste0("cell_", rep(1:300)),
"cell_type" = c(rep("T cells CD4", 150), rep("T cells CD8", 150)),
"spikes" = stats::runif(300),
"add_1" = stats::runif(300),
"add_2" = stats::runif(300)
)
ds <- SimBu::dataset(
annotation = annotation,
count_matrix = counts,
name = "test_dataset"
)
#> Filtering genes...
#> Created dataset.
Pre-defined scaling factors
Some studies have proposed scaling factors for immune cells, such as EPIC (Racle et al. 2017) or quanTIseq (Finotello et al. 2019), deconvolution tools which correct for the mRNA bias internally using these values:
epic <- data.frame(
type = c(
"B cells",
"Macrophages",
"Monocytes",
"Neutrophils",
"NK cells",
"T cells",
"T cells CD4",
"T cells CD8",
"T helper cells",
"T regulatory cells",
"otherCells",
"default"
),
mRNA = c(
0.4016,
1.4196,
1.4196,
0.1300,
0.4396,
0.3952,
0.3952,
0.3952,
0.3952,
0.3952,
0.4000,
0.4000
)
)
epic
#> type mRNA
#> 1 B cells 0.4016
#> 2 Macrophages 1.4196
#> 3 Monocytes 1.4196
#> 4 Neutrophils 0.1300
#> 5 NK cells 0.4396
#> 6 T cells 0.3952
#> 7 T cells CD4 0.3952
#> 8 T cells CD8 0.3952
#> 9 T helper cells 0.3952
#> 10 T regulatory cells 0.3952
#> 11 otherCells 0.4000
#> 12 default 0.4000
quantiseq <- data.frame(
type = c(
"B cells",
"Macrophages",
"MacrophagesM2",
"Monocytes",
"Neutrophils",
"NK cells",
"T cells CD4",
"T cells CD8",
"T regulatory cells",
"Dendritic cells",
"T cells"
),
mRNA = c(
65.66148,
138.11520,
119.35447,
130.65455,
27.73634,
117.71584,
63.87200,
70.25659,
72.55110,
140.76091,
68.89323
)
)
quantiseq
#> type mRNA
#> 1 B cells 65.66148
#> 2 Macrophages 138.11520
#> 3 MacrophagesM2 119.35447
#> 4 Monocytes 130.65455
#> 5 Neutrophils 27.73634
#> 6 NK cells 117.71584
#> 7 T cells CD4 63.87200
#> 8 T cells CD8 70.25659
#> 9 T regulatory cells 72.55110
#> 10 Dendritic cells 140.76091
#> 11 T cells 68.89323
When you want to apply one of these scaling factors into your
simulation (therefore in-/decreasing the expression signals for the
cell-types), we can use the scaling_factor
parameter. Note,
that these pre-defined scaling factors only offer values for a certain
number of cell types, and your annotation in the provided dataset has to
match these names 1:1. All cell types from your dataset which are not
present in this scaling factor remain unscaled and a warning message
will appear.
sim_epic <- SimBu::simulate_bulk(
data = ds,
scenario = "random",
scaling_factor = "epic",
nsamples = 10,
ncells = 100,
BPPARAM = BiocParallel::MulticoreParam(workers = 4),
run_parallel = TRUE
)
#> Using parallel generation of simulations.
#> Using EPIC scaling factors.
#> Finished simulation.
We can also try out some custom scaling factors, for example
increasing the expression levels for a single cell-type (T cells CD8) by
10-fold compared to the rest. All cell-types which are not mentioned in
the named list given to custom_scaling_vector
will be
transformed with a scaling factor of 1, meaning nothing changes for
them.
sim_extreme_b <- SimBu::simulate_bulk(
data = ds,
scenario = "random",
scaling_factor = "custom",
custom_scaling_vector = c("T cells CD8" = 10),
nsamples = 10,
ncells = 100,
BPPARAM = BiocParallel::MulticoreParam(workers = 4),
run_parallel = TRUE
)
#> Using parallel generation of simulations.
#> Using custom scaling factors.
#> Warning in merge_scaling_factor(data = data, scaling_factor_values =
#> custom_scaling_vector, : For some cell type(s) in the dataset, no scaling
#> factor is available when using custom: T cells CD4. This cell type will not be
#> re-scaled.
#> Finished simulation.
Important: Watch out that the cell-type annotation names in your dataset are the same as in the scaling factor! Otherwise the scaling factor will not be applied or even worse, applied to a different cell-type.
Dataset specific scaling factors
You can also choose to calculate scaling factors, which are depending on your provided single-cell dataset. Compared to the previous section, this will give a unique value for each cell rather than a cell-type, making it possibly more sensitive.
Reads and genes
Two straight forward approaches would be the number of reads or number of expressed genes/features. As these values are easily obtainable from the provided count data, SimBu already calculates them during dataset generation.
sim_reads <- SimBu::simulate_bulk(
data = ds,
scenario = "random",
scaling_factor = "read_number", # use number of reads as scaling factor
nsamples = 10,
ncells = 100,
BPPARAM = BiocParallel::MulticoreParam(workers = 4),
run_parallel = TRUE
)
#> Using parallel generation of simulations.
#> Finished simulation.
sim_genes <- SimBu::simulate_bulk(
data = ds,
scenario = "random",
scaling_factor = "expressed_genes", # use number of expressed genes column as scaling factor
nsamples = 10,
ncells = 100,
BPPARAM = BiocParallel::MulticoreParam(workers = 4),
run_parallel = TRUE
)
#> Using parallel generation of simulations.
#> Finished simulation.
These options would also allow you to use other numerical
measurements you have for the single cells as scaling factors, such as
weight or size for example. Lets pretend, add_1
and
add_2
are such measurements. With the
additional_cols
parameter, they can be added to the SimBu
dataset and we can use them as scaling factor as well:
utils::head(annotation)
#> ID cell_type spikes add_1 add_2
#> 1 cell_1 T cells CD4 0.3808018 0.58311363 0.36420542
#> 2 cell_2 T cells CD4 0.9840723 0.68522140 0.32524211
#> 3 cell_3 T cells CD4 0.3830169 0.06389669 0.79255977
#> 4 cell_4 T cells CD4 0.8189628 0.15752521 0.03187535
#> 5 cell_5 T cells CD4 0.1328689 0.06110139 0.95851031
#> 6 cell_6 T cells CD4 0.8095495 0.72865441 0.79412270
ds <- SimBu::dataset(
annotation = annotation,
count_matrix = counts,
name = "test_dataset",
additional_cols = c("add_1", "add_2")
) # add columns to dataset
#> Filtering genes...
#> Created dataset.
sim_genes <- SimBu::simulate_bulk(
data = ds,
scenario = "random",
scaling_factor = "add_1", # use add_1 column as scaling factor
nsamples = 10,
ncells = 100,
BPPARAM = BiocParallel::MulticoreParam(workers = 4),
run_parallel = TRUE
)
#> Using parallel generation of simulations.
#> Scaling by custom column in annotation table; if no scaling is wished instead, use 'NONE'.
#> Finished simulation.
Spike-ins
One other numerical measurement can be spike-ins. Usually the number
of reads mapped to spike-in molecules per cell is given in the cell
annotations. If this is the case, they can be stored in the dataset
annotation using the spike_in_col
parameter, where you
indicate the name of the column from the annotation
dataframe in which the spike-in information is stored. To calculate a
scaling factor from this, the number of reads are also necessary, so we
will add this information as well (as above using the
read_number_col
parameter).
The scaling factor with spike-ins is calculated as the “% of reads
NOT mapped to spike-in reads”, or:
(n_reads - n_spike_in)/n_reads
for each cell. We apply it
like this:
ds <- SimBu::dataset(
annotation = annotation,
count_matrix = counts,
name = "test_dataset",
spike_in_col = "spikes"
) # give the name in the annotation file, that contains spike-in information
sim_spike <- SimBu::simulate_bulk(
data = ds,
scenario = "random",
scaling_factor = "spike_in", # use spike-in scaling factor
nsamples = 10,
ncells = 100,
BPPARAM = BiocParallel::MulticoreParam(workers = 4),
run_parallel = TRUE
)
Census - estimate mRNA counts per cell
Census
is an approach which tries to convert TPM counts into relative
transcript counts. This basically means, you get the mRNA counts per
cell, which can differ between cell-types.
(Qiu
et al. 2017) state in their paper, that it should only be
applied to TPM/FPKM normalized data, but I tried it out with raw
expression counts as well, which worked as well.
Census calculates a vector with a scaling value for each cell in a
sample. You can switch this feature on, by setting the
scaling_factor
parameter to census
.
ds <- SimBu::dataset(
annotation = annotation,
count_matrix = counts,
tpm_matrix = tpm,
name = "test_dataset"
)
#> Filtering genes...
#> Created dataset.
sim_census <- SimBu::simulate_bulk(
data = ds,
scenario = "random",
scaling_factor = "census", # use census scaling factor
nsamples = 10,
ncells = 100,
BPPARAM = BiocParallel::MulticoreParam(workers = 4),
run_parallel = TRUE
)
#> Using parallel generation of simulations.
#> Finished simulation.
In our analysis we found, that Census is basically a complicated way of estimating the number of expressed genes per cell. It will remain to the user to decide if he/she wants to use census or simply the number of expressed genes (as shown above) as scaling factor.
utils::sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] base
#>
#> loaded via a namespace (and not attached):
#> [1] SummarizedExperiment_1.26.1 SimBu_1.5.2
#> [3] xfun_0.34 bslib_0.4.1
#> [5] purrr_1.0.1 lattice_0.20-45
#> [7] vctrs_0.6.3 htmltools_0.5.3
#> [9] stats4_4.2.1 yaml_2.3.6
#> [11] grDevices_4.2.1 rlang_1.1.1
#> [13] pkgdown_2.0.6 jquerylib_0.1.4
#> [15] glue_1.6.2 BiocParallel_1.30.4
#> [17] BiocGenerics_0.42.0 matrixStats_0.62.0
#> [19] GenomeInfoDbData_1.2.8 lifecycle_1.0.3
#> [21] stringr_1.5.0 MatrixGenerics_1.8.1
#> [23] zlibbioc_1.42.0 ragg_1.2.4
#> [25] proxyC_0.3.3 codetools_0.2-18
#> [27] memoise_2.0.1 evaluate_0.17
#> [29] Biobase_2.56.0 knitr_1.40
#> [31] IRanges_2.30.1 fastmap_1.1.1
#> [33] GenomeInfoDb_1.34.9 parallel_4.2.1
#> [35] methods_4.2.1 Rcpp_1.0.10
#> [37] cachem_1.0.7 DelayedArray_0.22.0
#> [39] desc_1.4.2 S4Vectors_0.34.0
#> [41] RcppParallel_5.1.5 jsonlite_1.8.4
#> [43] XVector_0.36.0 systemfonts_1.0.4
#> [45] fs_1.5.2 textshaping_0.3.6
#> [47] stats_4.2.1 graphics_4.2.1
#> [49] datasets_4.2.1 digest_0.6.30
#> [51] stringi_1.7.12 GenomicRanges_1.48.0
#> [53] grid_4.2.1 rprojroot_2.0.3
#> [55] cli_3.6.1 tools_4.2.1
#> [57] bitops_1.0-7 magrittr_2.0.3
#> [59] sass_0.4.2 RCurl_1.98-1.12
#> [61] Matrix_1.6-2 utils_4.2.1
#> [63] sparseMatrixStats_1.8.0 rmarkdown_2.17
#> [65] rstudioapi_0.14 R6_2.5.1
#> [67] compiler_4.2.1