Data Structures#

Note

This page builds on the information given in xarray’s main page on data structures, so it is suggested that you are familiar with those first.

DataTree#

DataTree is xarray’s highest-level data structure, able to organise heterogeneous data which could not be stored inside a single Dataset object. This includes representing the recursive structure of multiple groups within a netCDF file or Zarr Store.

Each DataTree object (or “node”) contains the same data that a single xarray.Dataset would (i.e. DataArray objects stored under hashable keys), and so has the same key properties:

  • dims: a dictionary mapping of dimension names to lengths, for the variables in this node,

  • data_vars: a dict-like container of DataArrays corresponding to variables in this node,

  • coords: another dict-like container of DataArrays, corresponding to coordinate variables in this node,

  • attrs: dict to hold arbitary metadata relevant to data in this node.

A single DataTree object acts much like a single Dataset object, and has a similar set of dict-like methods defined upon it. However, DataTree’s can also contain other DataTree objects, so they can be thought of as nested dict-like containers of both xarray.DataArray’s and DataTree’s.

A single datatree object is known as a “node”, and its position relative to other nodes is defined by two more key properties:

  • children: An ordered dictionary mapping from names to other DataTree objects, known as its’ “child nodes”.

  • parent: The single DataTree object whose children this datatree is a member of, known as its’ “parent node”.

Each child automatically knows about its parent node, and a node without a parent is known as a “root” node (represented by the parent attribute pointing to None). Nodes can have multiple children, but as each child node has at most one parent, there can only ever be one root node in a given tree.

The overall structure is technically a connected acyclic undirected rooted graph, otherwise known as a “Tree”.

Note

Technically a DataTree with more than one child node forms an “Ordered Tree”, because the children are stored in an Ordered Dictionary. However, this distinction only really matters for a few edge cases involving operations on multiple trees simultaneously, and can safely be ignored by most users.

DataTree objects can also optionally have a name as well as attrs, just like a DataArray. Again these are not normally used unless explicitly accessed by the user.

Creating a DataTree#

One way to create a DataTree from scratch is to create each node individually, specifying the nodes’ relationship to one another as you create each one.

The DataTree constructor takes:

  • data: The data that will be stored in this node, represented by a single xarray.Dataset, or a named xarray.DataArray.

  • parent: The parent node (if there is one), given as a DataTree object.

  • children: The various child nodes (if there are any), given as a mapping from string keys to DataTree objects.

  • name: A string to use as the name of this node.

Let’s make a single datatree node with some example data in it:

In [1]: from datatree import DataTree

In [2]: ds1 = xr.Dataset({"foo": "orange"})

In [3]: dt = DataTree(name="root", data=ds1)  # create root node

In [4]: dt
Out[4]: 
DataTree('root', parent=None)
    Dimensions:  ()
    Data variables:
        foo      <U6 'orange'

At this point our node is also the root node, as every tree has a root node.

We can add a second node to this tree either by referring to the first node in the constructor of the second:

In [5]: ds2 = xr.Dataset({"bar": 0}, coords={"y": ("y", [0, 1, 2])})

# add a child by referring to the parent node
In [6]: node2 = DataTree(name="a", parent=dt, data=ds2)

or by dynamically updating the attributes of one node to refer to another:

# add a second child by first creating a new node ...
In [7]: ds3 = xr.Dataset({"zed": np.NaN})

In [8]: node3 = DataTree(name="b", data=ds3)

# ... then updating its .parent property
In [9]: node3.parent = dt

Our tree now has three nodes within it:

In [10]: dt
Out[10]: 
DataTree('root', parent=None)
│   Dimensions:  ()
│   Data variables:
│       foo      <U6 'orange'
├── DataTree('a')
│       Dimensions:  (y: 3)
│       Coordinates:
│         * y        (y) int64 0 1 2
│       Data variables:
│           bar      int64 0
└── DataTree('b')
        Dimensions:  ()
        Data variables:
            zed      float64 nan

It is at tree construction time that consistency checks are enforced. For instance, if we try to create a cycle the constructor will raise an error:

In [11]: dt.parent = node3
InvalidTreeError: Cannot set parent, as intended parent is already a descendant of this node.

Alternatively you can also create a DataTree object from

DataTree Contents#

Like xarray.Dataset, DataTree implements the python mapping interface, but with values given by either xarray.DataArray objects or other DataTree objects.

In [12]: dt["a"]
Out[12]: 
DataTree('a', parent="root")
    Dimensions:  (y: 3)
    Coordinates:
      * y        (y) int64 0 1 2
    Data variables:
        bar      int64 0

In [13]: dt["foo"]
Out[13]: 
<xarray.DataArray 'foo' ()>
array('orange', dtype='<U6')

Iterating over keys will iterate over both the names of variables and child nodes.

We can also access all the data in a single node through a dataset-like view

In [14]: dt["a"].ds
Out[14]: 
<xarray.DatasetView>
Dimensions:  (y: 3)
Coordinates:
  * y        (y) int64 0 1 2
Data variables:
    bar      int64 0

This demonstrates the fact that the data in any one node is equivalent to the contents of a single xarray.Dataset object. The DataTree.ds property returns an immutable view, but we can instead extract the node’s data contents as a new (and mutable) xarray.Dataset object via DataTree.to_dataset():

In [15]: dt["a"].to_dataset()
Out[15]: 
<xarray.Dataset>
Dimensions:  (y: 3)
Coordinates:
  * y        (y) int64 0 1 2
Data variables:
    bar      int64 0

Like with Dataset, you can access the data and coordinate variables of a node separately via the data_vars and coords attributes:

In [16]: dt["a"].data_vars
Out[16]: 
Data variables:
    bar      int64 0

In [17]: dt["a"].coords
Out[17]: 
Coordinates:
  * y        (y) int64 0 1 2

Dictionary-like methods#

We can update a datatree in-place using Python’s standard dictionary syntax, similar to how we can for Dataset objects. For example, to create this example datatree from scratch, we could have written:

# TODO update this example using .coords and .data_vars as setters,

In [18]: dt = DataTree(name="root")

In [19]: dt["foo"] = "orange"

In [20]: dt["a"] = DataTree(data=xr.Dataset({"bar": 0}, coords={"y": ("y", [0, 1, 2])}))

In [21]: dt["a/b/zed"] = np.NaN

In [22]: dt
Out[22]: 
DataTree('root', parent=None)
│   Dimensions:  ()
│   Data variables:
│       foo      <U6 'orange'
└── DataTree('a')
    │   Dimensions:  (y: 3)
    │   Coordinates:
    │     * y        (y) int64 0 1 2
    │   Data variables:
    │       bar      int64 0
    └── DataTree('b')
            Dimensions:  ()
            Data variables:
                zed      float64 nan

To change the variables in a node of a DataTree, you can use all the standard dictionary methods, including values, items, __delitem__, get and DataTree.update(). Note that assigning a DataArray object to a DataTree variable using __setitem__ or update will automatically align the array(s) to the original node’s indexes.

If you copy a DataTree using the copy() function or the DataTree.copy() method it will copy the subtree, meaning that node and children below it, but no parents above it. Like for Dataset, this copy is shallow by default, but you can copy all the underlying data arrays by calling dt.copy(deep=True).