Quick overview

Contents

Attention

This repository has been archived. Please use xarray.DataTree instead.

Quick overview#

DataTrees#

DataTree is a tree-like container of xarray.DataArray objects, organised into multiple mutually alignable groups. You can think of it like a (recursive) dict of xarray.Dataset objects.

Let’s first make some example xarray datasets (following on from xarray’s quick overview page):

In [1]: import numpy as np

In [2]: import xarray as xr

In [3]: data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]})

In [4]: ds = xr.Dataset(dict(foo=data, bar=("x", [1, 2]), baz=np.pi))

In [5]: ds
Out[5]: 
<xarray.Dataset> Size: 88B
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int64 16B 10 20
Dimensions without coordinates: y
Data variables:
    foo      (x, y) float64 48B 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
    bar      (x) int64 16B 1 2
    baz      float64 8B 3.142

In [6]: ds2 = ds.interp(coords={"x": [10, 12, 14, 16, 18, 20]})

In [7]: ds2
Out[7]: 
<xarray.Dataset> Size: 248B
Dimensions:  (x: 6, y: 3)
Coordinates:
  * x        (x) int64 48B 10 12 14 16 18 20
Dimensions without coordinates: y
Data variables:
    foo      (x, y) float64 144B 0.4691 -0.2829 -1.509 ... -1.136 1.212 -0.1732
    bar      (x) float64 48B 1.0 1.2 1.4 1.6 1.8 2.0
    baz      float64 8B 3.142

In [8]: ds3 = xr.Dataset(
   ...:     dict(people=["alice", "bob"], heights=("people", [1.57, 1.82])),
   ...:     coords={"species": "human"},
   ...: )
   ...: 

In [9]: ds3
Out[9]: 
<xarray.Dataset> Size: 76B
Dimensions:  (people: 2)
Coordinates:
  * people   (people) <U5 40B 'alice' 'bob'
    species  <U5 20B 'human'
Data variables:
    heights  (people) float64 16B 1.57 1.82

Now we’ll put this data into a multi-group tree:

In [10]: from datatree import DataTree

In [11]: dt = DataTree.from_dict({"simulation/coarse": ds, "simulation/fine": ds2, "/": ds3})

In [12]: dt
Out[12]: 
DataTree('None', parent=None)
│   Dimensions:  (people: 2)
│   Coordinates:
│     * people   (people) <U5 40B 'alice' 'bob'
│       species  <U5 20B 'human'
│   Data variables:
│       heights  (people) float64 16B 1.57 1.82
└── DataTree('simulation')
    ├── DataTree('coarse')
    │       Dimensions:  (x: 2, y: 3)
    │       Coordinates:
    │         * x        (x) int64 16B 10 20
    │       Dimensions without coordinates: y
    │       Data variables:
    │           foo      (x, y) float64 48B 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
    │           bar      (x) int64 16B 1 2
    │           baz      float64 8B 3.142
    └── DataTree('fine')
            Dimensions:  (x: 6, y: 3)
            Coordinates:
              * x        (x) int64 48B 10 12 14 16 18 20
            Dimensions without coordinates: y
            Data variables:
                foo      (x, y) float64 144B 0.4691 -0.2829 -1.509 ... -1.136 1.212 -0.1732
                bar      (x) float64 48B 1.0 1.2 1.4 1.6 1.8 2.0
                baz      float64 8B 3.142

This creates a datatree with various groups. We have one root group, containing information about individual people. (This root group can be named, but here is unnamed, so is referred to with "/", same as the root of a unix-like filesystem.) The root group then has one subgroup simulation, which contains no data itself but does contain another two subgroups, named fine and coarse.

The (sub-)sub-groups fine and coarse contain two very similar datasets. They both have an "x" dimension, but the dimension is of different lengths in each group, which makes the data in each group unalignable. In the root group we placed some completely unrelated information, showing how we can use a tree to store heterogenous data.

The constraints on each group are therefore the same as the constraint on dataarrays within a single dataset.

We created the sub-groups using a filesystem-like syntax, and accessing groups works the same way. We can access individual dataarrays in a similar fashion

In [13]: dt["simulation/coarse/foo"]
Out[13]: 
<xarray.DataArray 'foo' (x: 2, y: 3)> Size: 48B
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
       [-1.13563237,  1.21211203, -0.17321465]])
Coordinates:
  * x        (x) int64 16B 10 20
Dimensions without coordinates: y

and we can also pull out the data in a particular group as a Dataset object using .ds:

In [14]: dt["simulation/coarse"].ds
Out[14]: 
<xarray.DatasetView> Size: 88B
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int64 16B 10 20
Dimensions without coordinates: y
Data variables:
    foo      (x, y) float64 48B 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
    bar      (x) int64 16B 1 2
    baz      float64 8B 3.142

Operations map over subtrees, so we can take a mean over the x dimension of both the fine and coarse groups just by

In [15]: avg = dt["simulation"].mean(dim="x")

In [16]: avg
Out[16]: 
DataTree('simulation', parent=None)
├── DataTree('coarse')
│       Dimensions:  (y: 3)
│       Dimensions without coordinates: y
│       Data variables:
│           foo      (y) float64 24B -0.3333 0.4646 -0.8411
│           bar      float64 8B 1.5
│           baz      float64 8B 3.142
└── DataTree('fine')
        Dimensions:  (y: 3)
        Dimensions without coordinates: y
        Data variables:
            foo      (y) float64 24B -0.3333 0.4646 -0.8411
            bar      float64 8B 1.5
            baz      float64 8B 3.142

Here the "x" dimension used is always the one local to that sub-group.

You can do almost everything you can do with Dataset objects with DataTree objects (including indexing and arithmetic), as operations will be mapped over every sub-group in the tree. This allows you to work with multiple groups of non-alignable variables at once.

Note

If all of your variables are mutually alignable (i.e. they live on the same grid, such that every common dimension name maps to the same length), then you probably don’t need DataTree, and should consider just sticking with xarray.Dataset.