As of TileDB-SOMA 1.15 we’re proud to support a more intuitive and extensible notion of shape.
Please also see the Academy tutorial.
Let’s load the bundled SOMAExperiment
containing a
subsetted version of the 10X genomics PBMC
dataset provided by SeuratObject. This will return a
SOMAExperiment
object.
library(tiledbsoma)
exp <- load_dataset("soma-exp-pbmc-small")
exp
#> <SOMAExperiment>
#> uri: /tmp/RtmpjwKO7K/soma-exp-pbmc-small
The obs
dataframe has a domain
, which is a
soft limit on what values can be written to it. (You’ll get an error if
you try to read or write soma_joinid
values outside this
range, which is an important data-integrity reassurance.)
The domain
we see here matches with the data populated
inside of it. (This will usually be the case. It might not, if you’ve
created the dataframe but not written any data to it yet — at that point
it’s empty but it still has a shape.)
If you have more data — more cells — to add to the experiment later, you will be able resize the obs, up to the maxdomain which is a hard limit.
head(as.data.frame(exp$obs$read()$concat()))
#> soma_joinid orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8
#> 1 0 SeuratProject 70 47 0
#> 2 1 SeuratProject 85 52 0
#> 3 2 SeuratProject 87 50 1
#> 4 3 SeuratProject 127 56 0
#> 5 4 SeuratProject 173 53 0
#> 6 5 SeuratProject 70 48 0
#> letter.idents groups RNA_snn_res.1 obs_id
#> 1 A g2 0 ATGCCAGAACGACT
#> 2 A g1 0 CATGGCCTGTGCAT
#> 3 B g2 0 GAACCTGATGAACC
#> 4 A g2 0 TGACTGGATTCTCA
#> 5 A g2 0 AGTCAGACTGCACA
#> 6 A g1 0 TCTGATACACGTGT
The var
dataframe’s domain
is similar:
Likewise, the N-dimensional arrays within the experiment have their shapes as well.
There’s an important difference: while the dataframe domain gives you
the inclusive lower and upper bounds for soma_joinid
writes, the shape for the N-dimensional arrays is the upper bound plus
1.
Since there are 80 cells here and 230 genes here, X
’s
shape reflects that.
The other N-dimensional arrays are similar:
list(
obsm$get("X_pca")$shape(),
obsm$get("X_pca")$maxshape()
)
#> [[1]]
#> integer64
#> [1] 80 19
#>
#> [[2]]
#> integer64
#> [1] 9223372036854775728 9223372036854775789
list(
obsp$get("RNA_snn")$shape(),
obsp$get("RNA_snn")$maxshape()
)
#> [[1]]
#> integer64
#> [1] 80 80
#>
#> [[2]]
#> integer64
#> [1] 9223372036854775728 9223372036854775728
In particular, the X
array in this experiment — and in
most experiments — is sparse. That means there needn’t be a number in
every row or cell of the matrix. Nonetheless, the shape serves as a soft
limit for reads and writes: you’ll get an exception trying to read or
write outside of these.
As a general rule you’ll see the following:
X
array’s shape is nobs x nvarobsm
array’s shape is nobs
x some
number, maybe 20obsp
array’s shape is nobs
x
nobs
varm
array’s shape is nvar
x some
number, maybe 20varp
array’s shape is nvar
x
nvar
In the SOMA data model, the SOMASparseNDArray
and
SOMADenseNDArray
objects always have int64 dimensions named
soma_dim_0
, soma_dim_1
, and up, and they have
a numeric soma_data
attribute for the contents of the
array. Furthermore, this is always the case.
exp$ms$get("RNA")$X$get("data")$schema()
#> Schema
#> soma_dim_0: int64 not null
#> soma_dim_1: int64 not null
#> soma_data: double not null
For dataframes, though, while there must be a
soma_joinid
column of type int64, you can have one or more
other index columns in addtion — or, soma_joinid
can be a
non-index column.
exp$obs$schema()
#> Schema
#> soma_joinid: int64 not null
#> orig.ident: dictionary<values=string, indices=int8>
#> nCount_RNA: double
#> nFeature_RNA: int32
#> RNA_snn_res.0.8: dictionary<values=string, indices=int8>
#> letter.idents: dictionary<values=string, indices=int8>
#> groups: large_string
#> RNA_snn_res.1: dictionary<values=string, indices=int8>
#> obs_id: large_string
But really, dataframes are capable of more than that, via the index-column names you specify at creation time.
Let’s create a couple dataframes, with the same data, but different choices of index-column names.
asch <- arrow::schema(
arrow::field("soma_joinid", arrow::int64(), nullable = FALSE),
arrow::field("mystring", arrow::large_utf8(), nullable = FALSE),
arrow::field("myint", arrow::int32(), nullable = FALSE),
arrow::field("myfloat", arrow::float32(), nullable = FALSE)
)
soma_joinid = c(0, 1)
mystring = c("hello", "world")
myint = c(33, 44)
myfloat = c(4.5, 5.5)
tbl <- arrow::arrow_table(
soma_joinid = c(soma_joinid),
mystring = c(mystring),
myint = c(myint),
myfloat = c(myfloat)
)
sdf1 <- SOMADataFrameCreate(
sdfuri1,
asch,
index_column_names = c("soma_joinid", "mystring"),
domain = list(soma_joinid = c(0, 9), mystring = NULL)
)
sdf1$write(tbl)
sdf1$close()
Now let’s look at the domain
and maxdomain
for these dataframes.
Here we see the soma_joinid
slot of the dataframe’s
domain is as requested.
Another point is that domain cannot be specified for string-type index columns. You can set them at create one of two ways:
or
and in either case the domain slot for a string-typed index column
will read back as ('', '')
.
sdf1$maxdomain()
#> $soma_joinid
#> integer64
#> [1] 0 9223372036854773759
#>
#> $mystring
#> [1] "" ""
Now let’s look at our other dataframe. Here soma_joinid
is not an index column at all. This is fine, as long as within the data
you write to it, the index-column values uniquely identify each row.
sdf2 <- SOMADataFrameCreate(
sdfuri2,
asch,
index_column_names = c("myfloat", "myint"),
domain = list(myfloat = c(0, 9999), myint = c(-1000, 1000))
)
sdf2$write(tbl)
sdf2$close()
The domain reads back as written:
In the TileDB-SOMA Python API, there is a method for resizing all the dataframes and arrays within an experiment. At present we do not yet offer a corresponding method in the TileDB-SOMA R API, for the simple reason that there is low demand for it. Nonetheless, for completeness, we offer here guidance on how to resizes dataframes and arrays within a TileDB-SOMA experiment.
For N-dimensional arrays that have been upgraded, or that were created using TileDB-SOMA 1.15 or higher, simply do the following:
$tiledbsoma_has_upgraded_shape()
reports
FALSE
, invoke the $tiledbsoma_upgrade_shape()
method.$.resize()
method.Let’s do a fresh unpack of a pre-1.15 experiment:
exp <- load_dataset("soma-exp-pbmc-small-pre-1.15")
exp
#> <SOMAExperiment>
#> uri: /tmp/RtmpjwKO7K/soma-exp-pbmc-small-pre-1.15
Here we see that the X array has not been upgraded, and that its shape reports the same as maxshape:
Given that pre-1.15 TileDB-SOMA-R arrays were created with a maxshape leaving no room for growth, these arrays cannot have their shape resized any further. From 1.15 onward, of course, as we’ve see above, arrays are created with room for growth and you can resize them upward.