Attention
This repository has been archived. Please use xarray.DataTree instead.
Data Structures#
Note
This page builds on the information given in xarray’s main page on data structures, so it is suggested that you are familiar with those first.
DataTree#
DataTree
is xarray’s highest-level data structure, able to organise heterogeneous data which
could not be stored inside a single Dataset
object. This includes representing the recursive structure of multiple
groups within a netCDF file or Zarr Store.
Each DataTree
object (or “node”) contains the same data that a single xarray.Dataset
would (i.e. DataArray
objects
stored under hashable keys), and so has the same key properties:
dims
: a dictionary mapping of dimension names to lengths, for the variables in this node,data_vars
: a dict-like container of DataArrays corresponding to variables in this node,coords
: another dict-like container of DataArrays, corresponding to coordinate variables in this node,attrs
: dict to hold arbitary metadata relevant to data in this node.
A single DataTree
object acts much like a single Dataset
object, and has a similar set of dict-like methods
defined upon it. However, DataTree
’s can also contain other DataTree
objects, so they can be thought of as nested dict-like
containers of both xarray.DataArray
’s and DataTree
’s.
A single datatree object is known as a “node”, and its position relative to other nodes is defined by two more key properties:
children
: An ordered dictionary mapping from names to otherDataTree
objects, known as its’ “child nodes”.parent
: The singleDataTree
object whose children this datatree is a member of, known as its’ “parent node”.
Each child automatically knows about its parent node, and a node without a parent is known as a “root” node
(represented by the parent
attribute pointing to None
).
Nodes can have multiple children, but as each child node has at most one parent, there can only ever be one root node in a given tree.
The overall structure is technically a connected acyclic undirected rooted graph, otherwise known as a “Tree”.
Note
Technically a DataTree
with more than one child node forms an “Ordered Tree”,
because the children are stored in an Ordered Dictionary. However, this distinction only really matters for a few
edge cases involving operations on multiple trees simultaneously, and can safely be ignored by most users.
DataTree
objects can also optionally have a name
as well as attrs
, just like a DataArray
.
Again these are not normally used unless explicitly accessed by the user.
Creating a DataTree#
One way to create a DataTree
from scratch is to create each node individually,
specifying the nodes’ relationship to one another as you create each one.
The DataTree
constructor takes:
data
: The data that will be stored in this node, represented by a singlexarray.Dataset
, or a namedxarray.DataArray
.parent
: The parent node (if there is one), given as aDataTree
object.children
: The various child nodes (if there are any), given as a mapping from string keys toDataTree
objects.name
: A string to use as the name of this node.
Let’s make a single datatree node with some example data in it:
In [1]: from datatree import DataTree
In [2]: ds1 = xr.Dataset({"foo": "orange"})
In [3]: dt = DataTree(name="root", data=ds1) # create root node
In [4]: dt
Out[4]:
DataTree('root', parent=None)
Dimensions: ()
Data variables:
foo <U6 24B 'orange'
At this point our node is also the root node, as every tree has a root node.
We can add a second node to this tree either by referring to the first node in the constructor of the second:
In [5]: ds2 = xr.Dataset({"bar": 0}, coords={"y": ("y", [0, 1, 2])})
# add a child by referring to the parent node
In [6]: node2 = DataTree(name="a", parent=dt, data=ds2)
or by dynamically updating the attributes of one node to refer to another:
# add a second child by first creating a new node ...
In [7]: ds3 = xr.Dataset({"zed": np.NaN})
AttributeError: `np.NaN` was removed in the NumPy 2.0 release. Use `np.nan` instead.
In [8]: node3 = DataTree(name="b", data=ds3)
NameError: name 'ds3' is not defined
# ... then updating its .parent property
In [9]: node3.parent = dt
NameError: name 'node3' is not defined
Our tree now has three nodes within it:
In [10]: dt
Out[10]:
DataTree('root', parent=None)
│ Dimensions: ()
│ Data variables:
│ foo <U6 24B 'orange'
└── DataTree('a')
Dimensions: (y: 3)
Coordinates:
* y (y) int64 24B 0 1 2
Data variables:
bar int64 8B 0
It is at tree construction time that consistency checks are enforced. For instance, if we try to create a cycle the constructor will raise an error:
In [11]: dt.parent = node3
NameError: name 'node3' is not defined
Alternatively you can also create a DataTree
object from
An
xarray.Dataset
usingDataset.to_node()
(not yet implemented),A dictionary mapping directory-like paths to either
DataTree
nodes or data, usingDataTree.from_dict()
,A netCDF or Zarr file on disk with
open_datatree()
. See reading and writing files.
DataTree Contents#
Like xarray.Dataset
, DataTree
implements the python mapping interface, but with values given by either xarray.DataArray
objects or other DataTree
objects.
In [12]: dt["a"]
Out[12]:
DataTree('a', parent="root")
Dimensions: (y: 3)
Coordinates:
* y (y) int64 24B 0 1 2
Data variables:
bar int64 8B 0
In [13]: dt["foo"]
Out[13]:
<xarray.DataArray 'foo' ()> Size: 24B
array('orange', dtype='<U6')
Iterating over keys will iterate over both the names of variables and child nodes.
We can also access all the data in a single node through a dataset-like view
In [14]: dt["a"].ds
Out[14]:
<xarray.DatasetView> Size: 32B
Dimensions: (y: 3)
Coordinates:
* y (y) int64 24B 0 1 2
Data variables:
bar int64 8B 0
This demonstrates the fact that the data in any one node is equivalent to the contents of a single xarray.Dataset
object.
The DataTree.ds
property returns an immutable view, but we can instead extract the node’s data contents as a new (and mutable)
xarray.Dataset
object via DataTree.to_dataset()
:
In [15]: dt["a"].to_dataset()
Out[15]:
<xarray.Dataset> Size: 32B
Dimensions: (y: 3)
Coordinates:
* y (y) int64 24B 0 1 2
Data variables:
bar int64 8B 0
Like with Dataset
, you can access the data and coordinate variables of a node separately via the data_vars
and coords
attributes:
In [16]: dt["a"].data_vars
Out[16]:
Data variables:
bar int64 8B 0
In [17]: dt["a"].coords
Out[17]:
Coordinates:
* y (y) int64 24B 0 1 2
Dictionary-like methods#
We can update a datatree in-place using Python’s standard dictionary syntax, similar to how we can for Dataset objects. For example, to create this example datatree from scratch, we could have written:
# TODO update this example using .coords
and .data_vars
as setters,
In [18]: dt = DataTree(name="root")
In [19]: dt["foo"] = "orange"
In [20]: dt["a"] = DataTree(data=xr.Dataset({"bar": 0}, coords={"y": ("y", [0, 1, 2])}))
In [21]: dt["a/b/zed"] = np.NaN
AttributeError: `np.NaN` was removed in the NumPy 2.0 release. Use `np.nan` instead.
In [22]: dt
Out[22]:
DataTree('root', parent=None)
│ Dimensions: ()
│ Data variables:
│ foo <U6 24B 'orange'
└── DataTree('a')
Dimensions: (y: 3)
Coordinates:
* y (y) int64 24B 0 1 2
Data variables:
bar int64 8B 0
To change the variables in a node of a DataTree
, you can use all the standard dictionary
methods, including values
, items
, __delitem__
, get
and
DataTree.update()
.
Note that assigning a DataArray
object to a DataTree
variable using __setitem__
or update
will
automatically align the array(s) to the original node’s indexes.
If you copy a DataTree
using the copy()
function or the DataTree.copy()
method it will copy the subtree,
meaning that node and children below it, but no parents above it.
Like for Dataset
, this copy is shallow by default, but you can copy all the underlying data arrays by calling dt.copy(deep=True)
.