Data Structures#

Note

This page builds on the information given in xarray’s main page on data structures, so it is suggested that you are familiar with those first.

DataTree#

:py:class:DataTree is xarray’s highest-level data structure, able to organise heterogeneous data which could not be stored inside a single Dataset object. This includes representing the recursive structure of multiple groups within a netCDF file or Zarr Store.

Each DataTree object (or “node”) contains the same data that a single xarray.Dataset would (i.e. DataArray objects stored under hashable keys), and so has the same key properties:

  • dims: a dictionary mapping of dimension names to lengths, for the variables in this node,

  • data_vars: a dict-like container of DataArrays corresponding to variables in this node,

  • coords: another dict-like container of DataArrays, corresponding to coordinate variables in this node,

  • attrs: dict to hold arbitary metadata relevant to data in this node.

A single DataTree object acts much like a single Dataset object, and has a similar set of dict-like methods defined upon it. However, DataTree’s can also contain other DataTree objects, so they can be thought of as nested dict-like containers of both xarray.DataArray’s and DataTree’s.

A single datatree object is known as a “node”, and its position relative to other nodes is defined by two more key properties:

  • children: An ordered dictionary mapping from names to other DataTree objects, known as its’ “child nodes”.

  • parent: The single DataTree object whose children this datatree is a member of, known as its’ “parent node”.

Each child automatically knows about its parent node, and a node without a parent is known as a “root” node (represented by the parent attribute pointing to None). Nodes can have multiple children, but as each child node has at most one parent, there can only ever be one root node in a given tree.

The overall structure is technically a connected acyclic undirected rooted graph, otherwise known as a “Tree”.

Note

Technically a DataTree with more than one child node forms an “Ordered Tree”, because the children are stored in an Ordered Dictionary. However, this distinction only really matters for a few edge cases involving operations on multiple trees simultaneously, and can safely be ignored by most users.

DataTree objects can also optionally have a name as well as attrs, just like a DataArray. Again these are not normally used unless explicitly accessed by the user.

Creating a DataTree#

There are two ways to create a DataTree from scratch. The first is to create each node individually, specifying the nodes’ relationship to one another as you create each one.

The DataTree constructor takes:

  • data: The data that will be stored in this node, represented by a single xarray.Dataset, or a named xarray.DataArray.

  • parent: The parent node (if there is one), given as a DataTree object.

  • children: The various child nodes (if there are any), given as a mapping from string keys to DataTree objects.

  • name: A string to use as the name of this node.

Let’s make a datatree node without anything in it:

In [1]: from datatree import DataTree

# create root node
In [2]: node1 = DataTree(name="Oak")

In [3]: node1
Out[3]: DataTree('Oak', parent=None)

At this point our node is also the root node, as every tree has a root node.

We can add a second node to this tree either by referring to the first node in the constructor of the second:

# add a child by referring to the parent node
In [4]: node2 = DataTree(name="Bonsai", parent=node1)

or by dynamically updating the attributes of one node to refer to another:

# add a grandparent by updating the .parent property of an existing node
In [5]: node0 = DataTree(name="General Sherman")

In [6]: node1.parent = node0

Our tree now has three nodes within it, and one of the two new nodes has become the new root:

In [7]: node0
Out[7]: 
DataTree('General Sherman', parent=None)
└── DataTree('Oak')
    └── DataTree('Bonsai')

Is is at tree construction time that consistency checks are enforced. For instance, if we try to create a cycle the constructor will raise an error:

In [8]: node0.parent = node2
---------------------------------------------------------------------------
TreeError                                 Traceback (most recent call last)
Cell In [8], line 1
----> 1 node0.parent = node2

File ~/checkouts/readthedocs.org/user_builds/xarray-datatree/conda/latest/lib/python3.10/site-packages/datatree/datatree.py:355, in DataTree.parent(self, new_parent)
    353 if new_parent and self.name is None:
    354     raise ValueError("Cannot set an unnamed node as a child of another node")
--> 355 self._set_parent(new_parent, self.name)

File ~/checkouts/readthedocs.org/user_builds/xarray-datatree/conda/latest/lib/python3.10/site-packages/datatree/treenode.py:102, in TreeNode._set_parent(self, new_parent, child_name)
    100 old_parent = self._parent
    101 if new_parent is not old_parent:
--> 102     self._check_loop(new_parent)
    103     self._detach(old_parent)
    104     self._attach(new_parent, child_name)

File ~/checkouts/readthedocs.org/user_builds/xarray-datatree/conda/latest/lib/python3.10/site-packages/datatree/treenode.py:115, in TreeNode._check_loop(self, new_parent)
    110     raise TreeError(
    111         f"Cannot set parent, as node {self} cannot be a parent of itself."
    112     )
    114 if self._is_descendant_of(new_parent):
--> 115     raise TreeError(
    116         "Cannot set parent, as intended parent is already a descendant of this node."
    117     )

TreeError: Cannot set parent, as intended parent is already a descendant of this node.

The second way is to build the tree from a dictionary of filesystem-like paths and corresponding xarray.Dataset objects.

This relies on a syntax inspired by unix-like filesystems, where the “path” to a node is specified by the keys of each intermediate node in sequence, separated by forward slashes. The root node is referred to by "/", so the path from our current root node to its grand-child would be "/Oak/Bonsai". A path specified from the root (as opposed to being specified relative to an arbitrary node in the tree) is sometimes also referred to as a “fully qualified name”.

If we have a dictionary where each key is a valid path, and each value is either valid data or None, we can construct a complex tree quickly using the alternative constructor :py:func::DataTree.from_dict:

In [9]: d = {
   ...:     "/": xr.Dataset({"foo": "orange"}),
   ...:     "/a": xr.Dataset({"bar": 0}, coords={"y": ("y", [0, 1, 2])}),
   ...:     "/a/b": xr.Dataset({"zed": np.NaN}),
   ...:     "a/c/d": None,
   ...: }
   ...: 

In [10]: dt = DataTree.from_dict(d)

In [11]: dt
Out[11]: 
DataTree('None', parent=None)
│   Dimensions:  ()
│   Data variables:
│       foo      <U6 'orange'
└── DataTree('a')
    │   Dimensions:  (y: 3)
    │   Coordinates:
    │     * y        (y) int64 0 1 2
    │   Data variables:
    │       bar      int64 0
    ├── DataTree('b')
    │       Dimensions:  ()
    │       Data variables:
    │           zed      float64 nan
    └── DataTree('c')
        └── DataTree('d')

Notice that this method will also create any intermediate empty node necessary to reach the end of the specified path (i.e. the node labelled “c” in this case.)

Finally if you have a file containing data on disk (such as a netCDF file or a Zarr Store), you can also create a datatree by opening the file using :py:func::~datatree.open_datatree.

DataTree Contents#

Like xarray.Dataset, DataTree implements the python mapping interface, but with values given by either xarray.DataArray objects or other DataTree objects.

In [12]: dt["a"]
Out[12]: 
DataTree('a', parent="None")
│   Dimensions:  (y: 3)
│   Coordinates:
│     * y        (y) int64 0 1 2
│   Data variables:
│       bar      int64 0
├── DataTree('b')
│       Dimensions:  ()
│       Data variables:
│           zed      float64 nan
└── DataTree('c')
    └── DataTree('d')

In [13]: dt["foo"]
Out[13]: 
<xarray.DataArray 'foo' ()>
array('orange', dtype='<U6')

Iterating over keys will iterate over both the names of variables and child nodes.

We can also access all the data in a single node through a dataset-like view

In [14]: dt["a"].ds
Out[14]: 
<xarray.DatasetView>
Dimensions:  (y: 3)
Coordinates:
  * y        (y) int64 0 1 2
Data variables:
    bar      int64 0

This demonstrates the fact that the data in any one node is equivalent to the contents of a single xarray.Dataset object. The DataTree.ds property returns an immutable view, but we can instead extract the node’s data contents as a new (and mutable) xarray.Dataset object via .to_dataset():

In [15]: dt["a"].to_dataset()
Out[15]: 
<xarray.Dataset>
Dimensions:  (y: 3)
Coordinates:
  * y        (y) int64 0 1 2
Data variables:
    bar      int64 0

Like with Dataset, you can access the data and coordinate variables of a node separately via the data_vars and coords attributes:

In [16]: dt["a"].data_vars
Out[16]: 
Data variables:
    bar      int64 0

In [17]: dt["a"].coords
Out[17]: 
Coordinates:
  * y        (y) int64 0 1 2

Dictionary-like methods#

We can update the contents of the tree in-place using a dictionary-like syntax.

We can update a datatree in-place using Python’s standard dictionary syntax, similar to how we can for Dataset objects. For example, to create this example datatree from scratch, we could have written:

# TODO update this example using .coords and .data_vars as setters,

In [18]: dt = DataTree()

In [19]: dt["foo"] = "orange"

In [20]: dt["a"] = DataTree(data=xr.Dataset({"bar": 0}, coords={"y": ("y", [0, 1, 2])}))

In [21]: dt["a/b/zed"] = np.NaN

In [22]: dt["a/c/d"] = DataTree()

In [23]: dt
Out[23]: 
DataTree('None', parent=None)
│   Dimensions:  ()
│   Data variables:
│       foo      <U6 'orange'
└── DataTree('a')
    │   Dimensions:  (y: 3)
    │   Coordinates:
    │     * y        (y) int64 0 1 2
    │   Data variables:
    │       bar      int64 0
    ├── DataTree('b')
    │       Dimensions:  ()
    │       Data variables:
    │           zed      float64 nan
    └── DataTree('c')
        └── DataTree('d')

To change the variables in a node of a DataTree, you can use all the standard dictionary methods, including values, items, __delitem__, get and update(). Note that assigning a DataArray object to a DataTree variable using __setitem__ or update will automatically align the array(s) to the original node’s indexes.

If you copy a DataTree using the :py:func::copy function or the copy() it will copy the entire tree, including all parents and children. Like for Dataset, this copy is shallow by default, but you can copy all the data by calling dt.copy(deep=True).