Working With Hierarchical Data#

Why Hierarchical Data?#

Many real-world datasets are composed of multiple differing components, and it can often be be useful to think of these in terms of a hierarchy of related groups of data. Examples of data which one might want organise in a grouped or hierarchical manner include:

  • Simulation data at multiple resolutions,

  • Observational data about the same system but from multiple different types of sensors,

  • Mixed experimental and theoretical data,

  • A systematic study recording the same experiment but with different parameters,

  • Heterogenous data, such as demographic and metereological data,

or even any combination of the above.

Often datasets like this cannot easily fit into a single xarray.Dataset object, or are more usefully thought of as groups of related xarray.Dataset objects. For this purpose we provide the DataTree class.

This page explains in detail how to understand and use the different features of the DataTree class for your own hierarchical data needs.

Node Relationships#

Creating a Family Tree#

The three main ways of creating a DataTree object are described briefly in Creating a DataTree. Here we go into more detail about how to create a tree node-by-node, using a famous family tree from the Simpsons cartoon as an example.

Let’s start by defining nodes representing the two siblings, Bart and Lisa Simpson:

In [1]: bart = DataTree(name="Bart")

In [2]: lisa = DataTree(name="Lisa")

Each of these node objects knows their own name, but they currently have no relationship to one another. We can connect them by creating another node representing a common parent, Homer Simpson:

In [3]: homer = DataTree(name="Homer", children={"Bart": bart, "Lisa": lisa})

Here we set the children of Homer in the node’s constructor. We now have a small family tree

In [4]: homer
Out[4]: 
DataTree('Homer', parent=None)
├── DataTree('Bart')
└── DataTree('Lisa')

where we can see how these individual Simpson family members are related to one another. The nodes representing Bart and Lisa are now connected - we can confirm their sibling rivalry by examining the siblings property:

In [5]: list(bart.siblings)
Out[5]: ['Lisa']

But oops, we forgot Homer’s third daughter, Maggie! Let’s add her by updating Homer’s children property to include her:

In [6]: maggie = DataTree(name="Maggie")

In [7]: homer.children = {"Bart": bart, "Lisa": lisa, "Maggie": maggie}

In [8]: homer
Out[8]: 
DataTree('Homer', parent=None)
├── DataTree('Bart')
├── DataTree('Lisa')
└── DataTree('Maggie')

Let’s check that Maggie knows who her Dad is:

In [9]: maggie.parent.name
Out[9]: 'Homer'

That’s good - updating the properties of our nodes does not break the internal consistency of our tree, as changes of parentage are automatically reflected on both nodes.

These children obviously have another parent, Marge Simpson, but DataTree nodes can only have a maximum of one parent. Genealogical family trees are not even technically trees in the mathematical sense - the fact that distant relatives can mate makes it a directed acyclic graph. Trees of DataTree objects cannot represent this.

Homer is currently listed as having no parent (the so-called “root node” of this tree), but we can update his parent property:

In [10]: abe = DataTree(name="Abe")

In [11]: homer.parent = abe

Abe is now the “root” of this tree, which we can see by examining the root property of any node in the tree

In [12]: maggie.root.name
Out[12]: 'Abe'

We can see the whole tree by printing Abe’s node or just part of the tree by printing Homer’s node:

In [13]: abe
Out[13]: 
DataTree('Abe', parent=None)
└── DataTree('Homer')
    ├── DataTree('Bart')
    ├── DataTree('Lisa')
    └── DataTree('Maggie')

In [14]: homer
Out[14]: 
DataTree('Homer', parent="Abe")
├── DataTree('Bart')
├── DataTree('Lisa')
└── DataTree('Maggie')

We can see that Homer is aware of his parentage, and we say that Homer and his children form a “subtree” of the larger Simpson family tree.

In episode 28, Abe Simpson reveals that he had another son, Herbert “Herb” Simpson. We can add Herbert to the family tree without displacing Homer by assign()-ing another child to Abe:

In [15]: herbert = DataTree(name="Herb")

In [16]: abe.assign({"Herbert": herbert})
Out[16]: 
DataTree('Abe', parent=None)
├── DataTree('Homer')
│   ├── DataTree('Bart')
│   ├── DataTree('Lisa')
│   └── DataTree('Maggie')
└── DataTree('Herbert')

Note

This example shows a minor subtlety - the returned tree has Homer’s brother listed as "Herbert", but the original node was named “Herbert”. Not only are names overriden when stored as keys like this, but the new node is a copy, so that the original node that was reference is unchanged (i.e. herbert.name == "Herb" still). In other words, nodes are copied into trees, not inserted into them. This is intentional, and mirrors the behaviour when storing named xarray.DataArray objects inside datasets.

Certain manipulations of our tree are forbidden, if they would create an inconsistent result. In episode 51 of the show Futurama, Philip J. Fry travels back in time and accidentally becomes his own Grandfather. If we try similar time-travelling hijinks with Homer, we get a InvalidTreeError raised:

In [17]: abe.parent = homer
InvalidTreeError: Cannot set parent, as intended parent is already a descendant of this node.

Ancestry in an Evolutionary Tree#

Let’s use a different example of a tree to discuss more complex relationships between nodes - the phylogenetic tree, or tree of life.

In [18]: vertebrates = DataTree.from_dict(
   ....:     name="Vertebrae",
   ....:     d={
   ....:         "/Sharks": None,
   ....:         "/Bony Skeleton/Ray-finned Fish": None,
   ....:         "/Bony Skeleton/Four Limbs/Amphibians": None,
   ....:         "/Bony Skeleton/Four Limbs/Amniotic Egg/Hair/Primates": None,
   ....:         "/Bony Skeleton/Four Limbs/Amniotic Egg/Hair/Rodents & Rabbits": None,
   ....:         "/Bony Skeleton/Four Limbs/Amniotic Egg/Two Fenestrae/Dinosaurs": None,
   ....:         "/Bony Skeleton/Four Limbs/Amniotic Egg/Two Fenestrae/Birds": None,
   ....:     },
   ....: )
   ....: 

In [19]: primates = vertebrates["/Bony Skeleton/Four Limbs/Amniotic Egg/Hair/Primates"]

In [20]: dinosaurs = vertebrates[
   ....:     "/Bony Skeleton/Four Limbs/Amniotic Egg/Two Fenestrae/Dinosaurs"
   ....: ]
   ....: 

We have used the from_dict() constructor method as an alternate way to quickly create a whole tree, and Filesystem-like Paths (to be explained shortly) to select two nodes of interest.

In [21]: vertebrates
Out[21]: 
DataTree('Vertebrae', parent=None)
├── DataTree('Sharks')
└── DataTree('Bony Skeleton')
    ├── DataTree('Ray-finned Fish')
    └── DataTree('Four Limbs')
        ├── DataTree('Amphibians')
        └── DataTree('Amniotic Egg')
            ├── DataTree('Hair')
            │   ├── DataTree('Primates')
            │   └── DataTree('Rodents & Rabbits')
            └── DataTree('Two Fenestrae')
                ├── DataTree('Dinosaurs')
                └── DataTree('Birds')

This tree shows various families of species, grouped by their common features (making it technically a “Cladogram”, rather than an evolutionary tree).

Here both the species and the features used to group them are represented by DataTree node objects - there is no distinction in types of node. We can however get a list of only the nodes we used to represent species by using the fact that all those nodes have no children - they are “leaf nodes”. We can check if a node is a leaf with is_leaf(), and get a list of all leaves with the leaves property:

In [22]: primates.is_leaf
Out[22]: True

In [23]: [node.name for node in vertebrates.leaves]
Out[23]: 
['Sharks',
 'Ray-finned Fish',
 'Amphibians',
 'Primates',
 'Rodents & Rabbits',
 'Dinosaurs',
 'Birds']

Pretending that this is a true evolutionary tree for a moment, we can find the features of the evolutionary ancestors (so-called “ancestor” nodes), the distinguishing feature of the common ancestor of all vertebrate life (the root node), and even the distinguishing feature of the common ancestor of any two species (the common ancestor of two nodes):

In [24]: [node.name for node in primates.ancestors]
Out[24]: 
['Vertebrae',
 'Bony Skeleton',
 'Four Limbs',
 'Amniotic Egg',
 'Hair',
 'Primates']

In [25]: primates.root.name
Out[25]: 'Vertebrae'

In [26]: primates.find_common_ancestor(dinosaurs).name
Out[26]: 'Amniotic Egg'

We can only find a common ancestor between two nodes that lie in the same tree. If we try to find the common evolutionary ancestor between primates and an Alien species that has no relationship to Earth’s evolutionary tree, an error will be raised.

In [27]: alien = DataTree(name="Xenomorph")

In [28]: primates.find_common_ancestor(alien)
NotFoundInTreeError: Cannot find common ancestor because nodes do not lie within the same tree

Manipulating Trees#

Subsetting Tree Nodes#

We can subset our tree to select only nodes of interest in various ways.

Similarly to on a real filesystem, matching nodes by common patterns in their paths is often useful. We can use DataTree.match() for this:

In [48]: dt = DataTree.from_dict(
   ....:     {
   ....:         "/a/A": None,
   ....:         "/a/B": None,
   ....:         "/b/A": None,
   ....:         "/b/B": None,
   ....:     }
   ....: )
   ....: 

In [49]: result = dt.match("*/B")

In [50]: result
Out[50]: 
DataTree('None', parent=None)
├── DataTree('a')
│   └── DataTree('B')
└── DataTree('b')
    └── DataTree('B')

We can also subset trees by the contents of the nodes. DataTree.filter() retains only the nodes of a tree that meet a certain condition. For example, we could recreate the Simpson’s family tree with the ages of each individual, then filter for only the adults: First lets recreate the tree but with an age data variable in every node:

In [51]: simpsons = DataTree.from_dict(
   ....:     d={
   ....:         "/": xr.Dataset({"age": 83}),
   ....:         "/Herbert": xr.Dataset({"age": 40}),
   ....:         "/Homer": xr.Dataset({"age": 39}),
   ....:         "/Homer/Bart": xr.Dataset({"age": 10}),
   ....:         "/Homer/Lisa": xr.Dataset({"age": 8}),
   ....:         "/Homer/Maggie": xr.Dataset({"age": 1}),
   ....:     },
   ....:     name="Abe",
   ....: )
   ....: 

In [52]: simpsons
Out[52]: 
DataTree('Abe', parent=None)
│   Dimensions:  ()
│   Data variables:
│       age      int64 83
├── DataTree('Herbert')
│       Dimensions:  ()
│       Data variables:
│           age      int64 40
└── DataTree('Homer')
    │   Dimensions:  ()
    │   Data variables:
    │       age      int64 39
    ├── DataTree('Bart')
    │       Dimensions:  ()
    │       Data variables:
    │           age      int64 10
    ├── DataTree('Lisa')
    │       Dimensions:  ()
    │       Data variables:
    │           age      int64 8
    └── DataTree('Maggie')
            Dimensions:  ()
            Data variables:
                age      int64 1

Now let’s filter out the minors:

In [53]: simpsons.filter(lambda node: node["age"] > 18)
Out[53]: 
DataTree('Abe', parent=None)
│   Dimensions:  ()
│   Data variables:
│       age      int64 83
├── DataTree('Herbert')
│       Dimensions:  ()
│       Data variables:
│           age      int64 40
└── DataTree('Homer')
        Dimensions:  ()
        Data variables:
            age      int64 39

The result is a new tree, containing only the nodes matching the condition.

(Yes, under the hood filter() is just syntactic sugar for the pattern we showed you in Iterating over trees !)

Tree Contents#

Hollow Trees#

A concept that can sometimes be useful is that of a “Hollow Tree”, which means a tree with data stored only at the leaf nodes. This is useful because certain useful tree manipulation operations only make sense for hollow trees.

You can check if a tree is a hollow tree by using the is_hollow property. We can see that the Simpson’s family is not hollow because the data variable "age" is present at some nodes which have children (i.e. Abe and Homer).

In [54]: simpsons.is_hollow
Out[54]: False

Computation#

DataTree objects are also useful for performing computations, not just for organizing data.

Operations and Methods on Trees#

To show how applying operations across a whole tree at once can be useful, let’s first create a example scientific dataset.

In [55]: def time_stamps(n_samples, T):
   ....:     """Create an array of evenly-spaced time stamps"""
   ....:     return xr.DataArray(
   ....:         data=np.linspace(0, 2 * np.pi * T, n_samples), dims=["time"]
   ....:     )
   ....: 

In [56]: def signal_generator(t, f, A, phase):
   ....:     """Generate an example electrical-like waveform"""
   ....:     return A * np.sin(f * t.data + phase)
   ....: 

In [57]: time_stamps1 = time_stamps(n_samples=15, T=1.5)

In [58]: time_stamps2 = time_stamps(n_samples=10, T=1.0)

In [59]: voltages = DataTree.from_dict(
   ....:     {
   ....:         "/oscilloscope1": xr.Dataset(
   ....:             {
   ....:                 "potential": (
   ....:                     "time",
   ....:                     signal_generator(time_stamps1, f=2, A=1.2, phase=0.5),
   ....:                 ),
   ....:                 "current": (
   ....:                     "time",
   ....:                     signal_generator(time_stamps1, f=2, A=1.2, phase=1),
   ....:                 ),
   ....:             },
   ....:             coords={"time": time_stamps1},
   ....:         ),
   ....:         "/oscilloscope2": xr.Dataset(
   ....:             {
   ....:                 "potential": (
   ....:                     "time",
   ....:                     signal_generator(time_stamps2, f=1.6, A=1.6, phase=0.2),
   ....:                 ),
   ....:                 "current": (
   ....:                     "time",
   ....:                     signal_generator(time_stamps2, f=1.6, A=1.6, phase=0.7),
   ....:                 ),
   ....:             },
   ....:             coords={"time": time_stamps2},
   ....:         ),
   ....:     }
   ....: )
   ....: 

In [60]: voltages
Out[60]: 
DataTree('None', parent=None)
├── DataTree('oscilloscope1')
│       Dimensions:    (time: 15)
│       Coordinates:
│         * time       (time) float64 0.0 0.6732 1.346 2.02 ... 7.405 8.078 8.752 9.425
│       Data variables:
│           potential  (time) float64 0.5753 1.155 -0.06141 ... -0.9753 -0.8987 0.5753
│           current    (time) float64 1.01 0.8568 -0.6285 -1.136 ... -1.191 -0.4074 1.01
└── DataTree('oscilloscope2')
        Dimensions:    (time: 10)
        Coordinates:
          * time       (time) float64 0.0 0.6981 1.396 2.094 ... 4.189 4.887 5.585 6.283
        Data variables:
            potential  (time) float64 0.3179 1.549 1.04 -0.637 ... 1.578 0.4555 -1.179
            current    (time) float64 1.031 1.552 0.3297 -1.263 ... 1.259 -0.3356 -1.553

Most xarray computation methods also exist as methods on datatree objects, so you can for example take the mean value of these two timeseries at once:

In [61]: voltages.mean(dim="time")
Out[61]: 
DataTree('None', parent=None)
├── DataTree('oscilloscope1')
│       Dimensions:    ()
│       Data variables:
│           potential  float64 0.03835
│           current    float64 0.06732
└── DataTree('oscilloscope2')
        Dimensions:    ()
        Data variables:
            potential  float64 0.169
            current    float64 0.1025

This works by mapping the standard xarray.Dataset.mean() method over the dataset stored in each node of the tree one-by-one.

The arguments passed to the method are used for every node, so the values of the arguments you pass might be valid for one node and invalid for another

In [62]: voltages.isel(time=12)
IndexError: index 12 is out of bounds for axis 0 with size 10
Raised whilst mapping function over node with path /oscilloscope2

Notice that the error raised helpfully indicates which node of the tree the operation failed on.

Arithmetic Methods on Trees#

Arithmetic methods are also implemented, so you can e.g. add a scalar to every dataset in the tree at once. For example, we can advance the timeline of the Simpsons by a decade just by

In [63]: simpsons + 10
Out[63]: 
DataTree('Abe', parent=None)
│   Dimensions:  ()
│   Data variables:
│       age      int64 93
├── DataTree('Herbert')
│       Dimensions:  ()
│       Data variables:
│           age      int64 50
└── DataTree('Homer')
    │   Dimensions:  ()
    │   Data variables:
    │       age      int64 49
    ├── DataTree('Bart')
    │       Dimensions:  ()
    │       Data variables:
    │           age      int64 20
    ├── DataTree('Lisa')
    │       Dimensions:  ()
    │       Data variables:
    │           age      int64 18
    └── DataTree('Maggie')
            Dimensions:  ()
            Data variables:
                age      int64 11

See that the same change (fast-forwarding by adding 10 years to the age of each character) has been applied to every node.

Mapping Custom Functions Over Trees#

You can map custom computation over each node in a tree using DataTree.map_over_subtree(). You can map any function, so long as it takes xarray.Dataset objects as one (or more) of the input arguments, and returns one (or more) xarray datasets.

Note

Functions passed to map_over_subtree() cannot alter nodes in-place. Instead they must return new xarray.Dataset objects.

For example, we can define a function to calculate the Root Mean Square of a timeseries

In [64]: def rms(signal):
   ....:     return np.sqrt(np.mean(signal**2))
   ....: 

Then calculate the RMS value of these signals:

In [65]: voltages.map_over_subtree(rms)
Out[65]: 
DataTree('None', parent=None)
├── DataTree('oscilloscope1')
│       Dimensions:    ()
│       Data variables:
│           potential  float64 0.8331
│           current    float64 0.8602
└── DataTree('oscilloscope2')
        Dimensions:    ()
        Data variables:
            potential  float64 1.099
            current    float64 1.158

We can also use the map_over_subtree() decorator to promote a function which accepts datasets into one which accepts datatrees.

Operating on Multiple Trees#

The examples so far have involved mapping functions or methods over the nodes of a single tree, but we can generalize this to mapping functions over multiple trees at once.

Comparing Trees for Isomorphism#

For it to make sense to map a single non-unary function over the nodes of multiple trees at once, each tree needs to have the same structure. Specifically two trees can only be considered similar, or “isomorphic”, if they have the same number of nodes, and each corresponding node has the same number of children. We can check if any two trees are isomorphic using the DataTree.isomorphic() method.

In [66]: dt1 = DataTree.from_dict({"a": None, "a/b": None})

In [67]: dt2 = DataTree.from_dict({"a": None})

In [68]: dt1.isomorphic(dt2)
Out[68]: False

In [69]: dt3 = DataTree.from_dict({"a": None, "b": None})

In [70]: dt1.isomorphic(dt3)
Out[70]: False

In [71]: dt4 = DataTree.from_dict({"A": None, "A/B": xr.Dataset({"foo": 1})})

In [72]: dt1.isomorphic(dt4)
Out[72]: True

If the trees are not isomorphic a TreeIsomorphismError will be raised. Notice that corresponding tree nodes do not need to have the same name or contain the same data in order to be considered isomorphic.

Arithmetic Between Multiple Trees#

Arithmetic operations like multiplication are binary operations, so as long as we have two isomorphic trees, we can do arithmetic between them.

In [73]: currents = DataTree.from_dict(
   ....:     {
   ....:         "/oscilloscope1": xr.Dataset(
   ....:             {
   ....:                 "current": (
   ....:                     "time",
   ....:                     signal_generator(time_stamps1, f=2, A=1.2, phase=1),
   ....:                 ),
   ....:             },
   ....:             coords={"time": time_stamps1},
   ....:         ),
   ....:         "/oscilloscope2": xr.Dataset(
   ....:             {
   ....:                 "current": (
   ....:                     "time",
   ....:                     signal_generator(time_stamps2, f=1.6, A=1.6, phase=0.7),
   ....:                 ),
   ....:             },
   ....:             coords={"time": time_stamps2},
   ....:         ),
   ....:     }
   ....: )
   ....: 

In [74]: currents
Out[74]: 
DataTree('None', parent=None)
├── DataTree('oscilloscope1')
│       Dimensions:  (time: 15)
│       Coordinates:
│         * time     (time) float64 0.0 0.6732 1.346 2.02 ... 7.405 8.078 8.752 9.425
│       Data variables:
│           current  (time) float64 1.01 0.8568 -0.6285 -1.136 ... -1.191 -0.4074 1.01
└── DataTree('oscilloscope2')
        Dimensions:  (time: 10)
        Coordinates:
          * time     (time) float64 0.0 0.6981 1.396 2.094 ... 4.189 4.887 5.585 6.283
        Data variables:
            current  (time) float64 1.031 1.552 0.3297 -1.263 ... 1.259 -0.3356 -1.553

In [75]: currents.isomorphic(voltages)
Out[75]: True

We could use this feature to quickly calculate the electrical power in our signal, P=IV.

In [76]: power = currents * voltages

In [77]: power
Out[77]: 
DataTree('None', parent=None)
├── DataTree('oscilloscope1')
│       Dimensions:  (time: 15)
│       Coordinates:
│         * time     (time) float64 0.0 0.6732 1.346 2.02 ... 7.405 8.078 8.752 9.425
│       Data variables:
│           current  (time) float64 1.02 0.7341 0.395 1.292 ... 0.01505 1.419 0.166 1.02
└── DataTree('oscilloscope2')
        Dimensions:  (time: 10)
        Coordinates:
          * time     (time) float64 0.0 0.6981 1.396 2.094 ... 4.189 4.887 5.585 6.283
        Data variables:
            current  (time) float64 1.062 2.408 0.1087 1.594 ... 1.585 0.1126 2.412