library(data.tree)
Working with data.tree
data.tree
is used to organize and manipulate hierarchical data structures in RMINC
. I’m still not completely comfortable using it, so I’m documenting my process for figuring out how data.tree
works.
This blog uses toy example that comes with data.tree
, but figuring out how things work with the toy example should help me understand how the hierarchical anatomy tree is manipulated in RMINC
.
Examples from https://cran.r-project.org/web/packages/data.tree/vignettes/data.tree.html
Definitions
The building block of a data tree is a Node
object. All operations are based around the Node
. Each node can store multiple attributes. An attribute can be 1) a field/active associated with the node; 2) a method, or default function, that comes with all nodes made via data.tree
; 3) a function that you have made and want to tack onto the node.
For example:
$Get("attribute") node
In this example, Get
is a function attribute associated with a node. Because it’s a function, it can take different arguments. In this case, the argument is “attribute”. With Get
, I’m asking to grab the values under attribute
for node
and its children.
If you’re interested in another attribute, such as a variable associated with the node,
$position # grab the position node
This syntax will only grab the position
field associated with the specific node. If one is interested in grabbing the position
node of the specific field and its children, then the Get
function comes into play.
$Get("position") node
In Summary 1. Attributes: can be 1) variables associated with a node (like a cell in a column of name x of a data frame); 2) default methods of all nodes; 3) custom methods tacked onto a node. 2. To get the attributes, the syntax is node$name_of_attribute
.
Tree creation
Create a tree programmatically
Making a tree programmatically using the example data provided by data.tree
.
# initialize a new node named "Acme Inc."
<- Node$new("Acme Inc.")
acme acme
levelName
1 Acme Inc.
# adding a child of the acme node, or a branch from it, called "Accounting"
<- acme$AddChild("Accounting")
accounting # accounting is now its own node entity
accounting
levelName
1 Accounting
# and acme now has an accounting child node appended to it
acme
levelName
1 Acme Inc.
2 °--Accounting
# so on and so forth, we add more child nodes. AddChild is one of the default functions/methods
# that come with each node.
<- accounting$AddChild("New Software")
software <- accounting$AddChild("New Accounting Standards")
standards
<- acme$AddChild("Research")
research <- research$AddChild("New Product Line")
newProductLine <- research$AddChild("New Labs")
newLabs
<- acme$AddChild("IT")
it <- it$AddChild("Outsource")
outsource <- it$AddChild("Go agile")
agile <- it$AddChild("Switch to R")
goToR
print(acme)
levelName
1 Acme Inc.
2 ¦--Accounting
3 ¦ ¦--New Software
4 ¦ °--New Accounting Standards
5 ¦--Research
6 ¦ ¦--New Product Line
7 ¦ °--New Labs
8 °--IT
9 ¦--Outsource
10 ¦--Go agile
11 °--Switch to R
The whole tree, acme, is a node. It started as a node. Subsequent nodes added are tacked onto the initial node, acme, to build a tree.
Notice their classes are the same.
class(acme)
[1] "Node" "R6"
class(it) # randomly chosen node that is not acme
[1] "Node" "R6"
This is different from a data table or data frame, where objects can exist as independent variables, but once they are put into a data frame, the object that stores them, or the data frame, is an entirely different class than the independent variables.
The same node can be referred to in multiple ways. For example, it
and acme$IT
refer to the same thing.
The tree root is always that first node where everything else is built upon. In this case, the root is acme
. Leaves are the final branches in a tree. We can see which node is a root or a leaf by using the function attributes isLeaf
or isRoot
.
$isRoot acme
[1] TRUE
$Accounting$isRoot acme
[1] FALSE
$Accounting$isLeaf acme
[1] FALSE
$Accounting$`New Software`$isLeaf acme
[1] TRUE
Currently, the only attributes associated with the nodes are their names and some default functions that are associated with all nodes. Let’s take a look at the syntax.
$Get("name") # grab it's attribute function Get and use it to retrieve the name associated with node it and its children. it
IT Outsource Go agile Switch to R
"IT" "Outsource" "Go agile" "Switch to R"
$name # grab it node's attribute name, which will show only its name on the node it
[1] "IT"
"name"]] # this also grabs just the node's attribute, name, at that specific node. it[[
[1] "IT"
Adding custom attributes to a tree
Think of these like adding columns of data to associate with each node i.e. if you’re working with brain regions, volumes
is an attribute that you can add to the tree to describe the volumes associated with each subject, for different brain regions, in your data.
Here, I’m continuing the acme
example given by Christoph Glur’s vignette.
Reminder of current tree structure.
acme
levelName
1 Acme Inc.
2 ¦--Accounting
3 ¦ ¦--New Software
4 ¦ °--New Accounting Standards
5 ¦--Research
6 ¦ ¦--New Product Line
7 ¦ °--New Labs
8 °--IT
9 ¦--Outsource
10 ¦--Go agile
11 °--Switch to R
Simple addition of a value under cost
to a node.
$Accounting$`New Software`$cost <- 1000
acme
# to display appended values, need to use the print function and data column of interest. showing two different ways to get this information
print(acme, "cost")
levelName cost
1 Acme Inc. NA
2 ¦--Accounting NA
3 ¦ ¦--New Software 1000
4 ¦ °--New Accounting Standards NA
5 ¦--Research NA
6 ¦ ¦--New Product Line NA
7 ¦ °--New Labs NA
8 °--IT NA
9 ¦--Outsource NA
10 ¦--Go agile NA
11 °--Switch to R NA
$Get("cost") acme
Acme Inc. Accounting New Software
NA NA 1000
New Accounting Standards Research New Product Line
NA NA NA
New Labs IT Outsource
NA NA NA
Go agile Switch to R
NA NA
Even though I only assigned cost
value to one node, it seems that all the other nodes can also have cost
printed alongside them, but the values of cost
are just assumed to be missing for those nodes.
Let’s what happens when I assign more values to the same node.
$Accounting$`New Software`$cost <- c(2000, 3000)
acmeprint(acme, "cost")
levelName cost
1 Acme Inc.
2 ¦--Accounting
3 ¦ ¦--New Software 2000, 3000
4 ¦ °--New Accounting Standards
5 ¦--Research
6 ¦ ¦--New Product Line
7 ¦ °--New Labs
8 °--IT
9 ¦--Outsource
10 ¦--Go agile
11 °--Switch to R
That seems to replace the previous value on the tree, but shows us that multiple values can be added to the same node as a list.
How do I append a value onto the node?
# create a function called addValue, which takes in a node, a set of values, and append those values to the existing variable "cost" associated with the node. append these values to the beginning of the vector
<- function(node, values) {
addValue return(c(values,node$cost))
}
# using Get to call the function addValue at the specific node
$Accounting$`New Software`$Get(addValue, values = c(1000, 1500)) acme
New Software
[1,] 1000
[2,] 1500
[3,] 2000
[4,] 3000
Okay, this seems to have added values onto the specific node (acme > Accounting > New Software)
, but did this reflect on the tree?
$Get("cost") acme
$`Acme Inc.`
[1] NA
$Accounting
[1] NA
$`New Software`
[1] 2000 3000
$`New Accounting Standards`
[1] NA
$Research
[1] NA
$`New Product Line`
[1] NA
$`New Labs`
[1] NA
$IT
[1] NA
$Outsource
[1] NA
$`Go agile`
[1] NA
$`Switch to R`
[1] NA
Doesn’t seem to be reflected on the tree. How do I get it to reflect on the tree?
# to the acme tree, find the the value cost at any node whose name equal "New Software" and add 1000 and 1500 to the existing values in cost
$Do(function(node) node$cost <- addValue(node, c(1000,1500)),
acmefilterFun = function(x) x$name == "New Software")
print(acme, "cost")
levelName cost
1 Acme Inc.
2 ¦--Accounting
3 ¦ ¦--New Software 1000, 1500, 2000, 3000
4 ¦ °--New Accounting Standards
5 ¦--Research
6 ¦ ¦--New Product Line
7 ¦ °--New Labs
8 °--IT
9 ¦--Outsource
10 ¦--Go agile
11 °--Switch to R
Ah! Seems like the Get
attribute only applies a function to the node temporarily, while the Do
attribute applies a function to the node permanently and allows it to reflect on the tree.
Okay, just for fun, I’m going to add in some values to this example to reflect scenarios that I might see when working with MRI data.
# at every node that is a leaf not "New Software", add a vector with NA and 3 random numbers between 1000 and 3000 to cost.
$Do(function(node) node$cost <- c(NA, sample(1000:3000, size = 3, replace = TRUE)),
acmefilterFun = function(x) isLeaf(x))
# create another vector, office, that identifies the ID of different acme offices.
$Do(function(node) node$office <- c(LETTERS[1:4]))
acme
print(acme, "cost", "office")
levelName cost office
1 Acme Inc. A, B, C, D
2 ¦--Accounting A, B, C, D
3 ¦ ¦--New Software NA, 1669, 2872, 1593 A, B, C, D
4 ¦ °--New Accounting Standards NA, 2129, 2124, 1498 A, B, C, D
5 ¦--Research A, B, C, D
6 ¦ ¦--New Product Line NA, 2673, 2390, 2783 A, B, C, D
7 ¦ °--New Labs NA, 1821, 2466, 1119 A, B, C, D
8 °--IT A, B, C, D
9 ¦--Outsource NA, 1422, 1392, 2479 A, B, C, D
10 ¦--Go agile NA, 2535, 1610, 2490 A, B, C, D
11 °--Switch to R NA, 2803, 2663, 2723 A, B, C, D
Aha! Seems like I can make a new attribute and apply a function to it at the same time (in the case of our office
variable attribute).
I’m assigning cost
values to only the leaves because this setup mimics what I usually get with mouse MRI data, where the volume values are only given for the most detailed levels of segmentations (leaves) and added together to create volumes at the highest levels of segmentations.
Especially with younger mice, many times we can get no good segmentations for regions that are on the mouse atlas, but do not exist in mice at younger ages. Hence why setting NA
as one of the elements in the vectors for cost
will mimic the structure of missing volumes as well.
Summing values in a data tree
In RMINC
, the function addVolumesToHierarchy
“propagates volumes up the tree by summing.” This means that the children node of a tree are added together to create volumes for the parent nodes of a tree. I want to understand how this works. Below is my attempt at recreating a similar function with the toy data.
First, I’m using the function Aggregate
to do the summation. I’m still applying this to our acme
tree example.
At this point, I am curious how the vectors of cost
will be added together across different leaves nodes. Perhaps they’ll be added together just like how R
normally adds two vectors?
Adding just two simple vectors for demonstration purposes.
<- sample(1:10, replace = TRUE)
a <- sample(1:10, replace = TRUE)
b
a
[1] 6 6 7 6 6 9 4 7 9 7
b
[1] 10 2 5 6 9 2 9 7 8 5
+ b a
[1] 16 8 12 12 15 11 13 14 17 12
Now to aggregate the at each node. Here, I want the costs in the children’s nodes to be added together and appended to the parent’s nodes, until we reach the root node, acme
.
# before Aggregate
print(acme, "cost")
levelName cost
1 Acme Inc.
2 ¦--Accounting
3 ¦ ¦--New Software NA, 1669, 2872, 1593
4 ¦ °--New Accounting Standards NA, 2129, 2124, 1498
5 ¦--Research
6 ¦ ¦--New Product Line NA, 2673, 2390, 2783
7 ¦ °--New Labs NA, 1821, 2466, 1119
8 °--IT
9 ¦--Outsource NA, 1422, 1392, 2479
10 ¦--Go agile NA, 2535, 1610, 2490
11 °--Switch to R NA, 2803, 2663, 2723
# for every node in the tree, sum together the values of the children's nodes and tack them onto the parent's node (start at the children's nodes, or post-order). assign these values to cost
$Do(function(node) node$cost <- Aggregate(node, attribute = "cost", aggFun = sum), traversal = "post-order") acme
That doesn’t seem to work. Why? What if I just have one value for each cost?
$Do(function(node) node$cost_single <- 1, filterFun = isLeaf)
acmeprint(acme, "cost_single")
levelName cost_single
1 Acme Inc. NA
2 ¦--Accounting NA
3 ¦ ¦--New Software 1
4 ¦ °--New Accounting Standards 1
5 ¦--Research NA
6 ¦ ¦--New Product Line 1
7 ¦ °--New Labs 1
8 °--IT NA
9 ¦--Outsource 1
10 ¦--Go agile 1
11 °--Switch to R 1
$Do(function(node) node$cost_sum <- Aggregate(node, attribute = "cost_single", aggFun = sum), traversal = "post-order")
acmeprint(acme, "cost_single", "cost_sum")
levelName cost_single cost_sum
1 Acme Inc. NA 7
2 ¦--Accounting NA 2
3 ¦ ¦--New Software 1 1
4 ¦ °--New Accounting Standards 1 1
5 ¦--Research NA 2
6 ¦ ¦--New Product Line 1 1
7 ¦ °--New Labs 1 1
8 °--IT NA 3
9 ¦--Outsource 1 1
10 ¦--Go agile 1 1
11 °--Switch to R 1 1
Now that works. Seems like Aggregate
doesn’t take vectors. We need to make custom functions for adding the nodes from children to parent. The below function is modified from the function behind Aggregate
in the data.tree
package.
<- function(node) {
addVectors <- node$cost # obtain the values associated with cost at each node
result if(all(is.na(result))) # if all values are NA (you get a vector of NAs, which occur at non-leaf nodes)
<- Reduce(`+`, lapply(node$children, addVectors)) # then find the cost vectors associated with the children of that node and add them together
result return (result) # this is now the new result of cost summation at the parent node
}
lapply
is used because cost from node$children
is a list of vectors, and I want to retain that vector structure when using Reduce
to add them together.
Just to see this clearly. This will print a list of vectors associated with cost for the children of node IT
.
lapply(FindNode(acme, "IT")$children, addVectors)
$Outsource
[1] NA 1422 1392 2479
$`Go agile`
[1] NA 2535 1610 2490
$`Switch to R`
[1] NA 2803 2663 2723
And if I use Reduce
to add the vectors together.
Reduce(`+`, lapply(FindNode(acme, "IT")$children, addVectors))
[1] NA 6760 5665 7692
Now I’m applying the new function I just made. Does it work now?
$Do(function(node) node$cost_sum <- addVectors(node), traversal = "post-order")
acmeprint(acme, "cost", "cost_sum")
levelName cost
1 Acme Inc.
2 ¦--Accounting
3 ¦ ¦--New Software NA, 1669, 2872, 1593
4 ¦ °--New Accounting Standards NA, 2129, 2124, 1498
5 ¦--Research
6 ¦ ¦--New Product Line NA, 2673, 2390, 2783
7 ¦ °--New Labs NA, 1821, 2466, 1119
8 °--IT
9 ¦--Outsource NA, 1422, 1392, 2479
10 ¦--Go agile NA, 2535, 1610, 2490
11 °--Switch to R NA, 2803, 2663, 2723
cost_sum
1 NA, 15052, 15517, 14685
2 NA, 3798, 4996, 3091
3 NA, 1669, 2872, 1593
4 NA, 2129, 2124, 1498
5 NA, 4494, 4856, 3902
6 NA, 2673, 2390, 2783
7 NA, 1821, 2466, 1119
8 NA, 6760, 5665, 7692
9 NA, 1422, 1392, 2479
10 NA, 2535, 1610, 2490
11 NA, 2803, 2663, 2723
Nice!
Replacing NA
values with zero
Sometimes, segmentation is not perfect and will result in NA
in volumes for certain regions, especially with younger animals. One strategy to deal with this is to replace the NA
with zero (otherwise the total volumes will be NA
). I’ll try to demonstrate this with the toy example.
Making a function to do so.
<- function(node) {
replaceWithZero <- node$cost_sum # find the cost vector at the node
result if(any(is.na(result))) # is there any element on the vector that is NA?
which(is.na(result))] <- 0 # if that's true, find which element is NA and replace it with zero
result[return(result) # return the result
}
Applying the function.
$Do(function(node) node$cost_sum <- replaceWithZero(node))
acmeprint(acme,"cost", "cost_sum", "office")
levelName cost cost_sum
1 Acme Inc. 0, 15052, 15517, 14685
2 ¦--Accounting 0, 3798, 4996, 3091
3 ¦ ¦--New Software NA, 1669, 2872, 1593 0, 1669, 2872, 1593
4 ¦ °--New Accounting Standards NA, 2129, 2124, 1498 0, 2129, 2124, 1498
5 ¦--Research 0, 4494, 4856, 3902
6 ¦ ¦--New Product Line NA, 2673, 2390, 2783 0, 2673, 2390, 2783
7 ¦ °--New Labs NA, 1821, 2466, 1119 0, 1821, 2466, 1119
8 °--IT 0, 6760, 5665, 7692
9 ¦--Outsource NA, 1422, 1392, 2479 0, 1422, 1392, 2479
10 ¦--Go agile NA, 2535, 1610, 2490 0, 2535, 1610, 2490
11 °--Switch to R NA, 2803, 2663, 2723 0, 2803, 2663, 2723
office
1 A, B, C, D
2 A, B, C, D
3 A, B, C, D
4 A, B, C, D
5 A, B, C, D
6 A, B, C, D
7 A, B, C, D
8 A, B, C, D
9 A, B, C, D
10 A, B, C, D
11 A, B, C, D
That works!
Finding mean cost associated with each office
Making a function to find the mean cost
<- function(node) {
getMean <- node$cost_sum
result return(mean(result))
}
Applying the function.
$Do(function(node) node$meanCost <- getMean(node))
acmeprint(acme, "meanCost")
levelName meanCost
1 Acme Inc. 11313.50
2 ¦--Accounting 2971.25
3 ¦ ¦--New Software 1533.50
4 ¦ °--New Accounting Standards 1437.75
5 ¦--Research 3313.00
6 ¦ ¦--New Product Line 1961.50
7 ¦ °--New Labs 1351.50
8 °--IT 5029.25
9 ¦--Outsource 1323.25
10 ¦--Go agile 1658.75
11 °--Switch to R 2047.25
Building models
RMINC
has a few built in modeling functions, like mincLm
and mincLmer
. How might these functions work? I’ll try to build something similar with the toy example here.
First, I want to see if linear modeling with a data tree requires that I account for the location of a node on a tree (i.e. is it a child, parent, dependent on any other nodes?). Given that RMINC
can create linear models and append results directly on the tree, I used the functions for doing so in RMINC
as my reference.
I’m reading through the source code of the relevant functions in RMINC
. To keep this report short, I will not print out the results, but the functions to read the source codes are as followed.
library(RMINC)
getAnywhere(hanatLm)
getAnywhere(anatLm)
Based on the source codes of these functions, it doesn’t seem like I need to define hierarchical relationships between the nodes in the linear models. Proceed as usual to make a function for doing linear models at every node.
I want to model the relationship between the total cost (cost_sum
) and each acme office
. The results will be very wonky. I didn’t put in any underlying trends for the fake data.
<- function(node, attribute_y, attribute_x) {
makeLm
<- node[[attribute_y]] # grab attribute_y associated with each node
y <- node[[attribute_x]] # grab attribute_x associated with each node
x
<- lm(y ~ x) # model the two variables
model
<- summary(model) # print summary of the model
model_summary <- coef(model_summary)[2, "Pr(>|t|)"] # grab p-value associated with office
p_val
return(p_val)
}
Now apply the function. First, running it with Get
so that the tree won’t be permanently changed.
$Get(function(node) node$pval <- makeLm(node, attribute_y = "cost_sum", attribute_x = "office")) acme
Acme Inc. Accounting New Software
NaN NaN NaN
New Accounting Standards Research New Product Line
NaN NaN NaN
New Labs IT Outsource
NaN NaN NaN
Go agile Switch to R
NaN NaN
Of course the results would look weird. I only have one value per office after all. Can’t do any comparisons there. But seems like the function ran correctly at least.
Just for fun, I’ll make a new numeric, random variable, to run in a model.
$Do(function(node) node$randomNum <- sample(1:1000, 4))
acmeprint(acme, "cost_sum", "randomNum")
levelName cost_sum randomNum
1 Acme Inc. 0, 15052, 15517, 14685 294, 228, 454, 279
2 ¦--Accounting 0, 3798, 4996, 3091 532, 591, 694, 699
3 ¦ ¦--New Software 0, 1669, 2872, 1593 550, 464, 416, 481
4 ¦ °--New Accounting Standards 0, 2129, 2124, 1498 306, 767, 601, 64
5 ¦--Research 0, 4494, 4856, 3902 342, 84, 598, 586
6 ¦ ¦--New Product Line 0, 2673, 2390, 2783 956, 948, 428, 717
7 ¦ °--New Labs 0, 1821, 2466, 1119 927, 437, 586, 13
8 °--IT 0, 6760, 5665, 7692 510, 701, 991, 206
9 ¦--Outsource 0, 1422, 1392, 2479 272, 952, 35, 152
10 ¦--Go agile 0, 2535, 1610, 2490 695, 576, 926, 132
11 °--Switch to R 0, 2803, 2663, 2723 912, 253, 895, 787
Running another model on the data tree.
$Get(function(node) node$pval_randomNum <- makeLm(node, attribute_y = "cost_sum", attribute_x = "randomNum")) acme
Acme Inc. Accounting New Software
0.83038361 0.23949966 0.00501231
New Accounting Standards Research New Product Line
0.46061489 0.85684164 0.56481047
New Labs IT Outsource
0.66859012 0.94934781 0.92994499
Go agile Switch to R
0.50779430 0.53444781
Yay works.
Takeaways
In this toy example, I tried to emulate some of the more common data tree structures/operations that one can do in the hierarchical anatomy of RMINC
to get a better understanding of how to manipulate the hierarchical data tree in RMINC
.
Some other operations that one might do, which I did not demonstrate here, include pruning a tree to retain only the parent nodes (i.e. you might want just cerebellar volume, but not subregions of the cerebellum). To do this, one could use the Prune
function, or using the examples for how to make functions for data trees nodes, one could make up a function to remove nodes with names equal to certain children nodes.
I also learned that whether the data is in tree or data table formats, running linear models on the nodes work the same way.
Perhaps the main advantage of using a data tree then, in the context of mouse MRI, is that the volume of small subregions can be quickly calculated and summed to make the volumes of larger regions. Otherwise, analyses remain largely the same regardless of whether it’s a tree or data table.
Lastly, when using RMINC
, the volumes are appended onto the data tree, but subject meta information are located in a separate CSV. It’s important that the order of subjects in the CSV matches up with the order of subject values listed in the tree (i.e. the vector of values for cerebellum might be c(volume_1, volume_2, volume_3)
. That means the meta information CSV has to also list subjects as c(subject_1, subject_2, subject_3)
).
To check the order for how subjects volumes are appended onto a tree, one could do the following:
# find the output segmentation file for each subject, labeled with suffixes _voted
<- system("find /well/lerch/users/xrs336/all-cohorts-test-run-maget-070325/ -type f -name '*_voted.mnc'", intern = TRUE))
segmentation_files
# grabbing volumes of the segmentation files. the data is a matrix of volume columns and an input filename column.
<- anatGetAll(filenames = seg_files, defs = defs, method = "labels")
allvols
# grabbing just the vector of filenames from the input data matrix
<- data.frame(filename = attributes(allvols)$input)
tree_order
# this can now be used to ensure that the CSV containing the subject meta information has a filename column that follows the exact filename order as the matrix of volumes that will be appended to the data tree
left_join(tree_order, meta_info_csv, by = "filename")
Reference
To understand what attributes are associated with each node, simply find the node of interest, follow it with $
, and a pop up of attributes options will follow.
Use the [data.tree documentation] (https://cran.r-project.org/web/packages/data.tree/data.tree.pdf) to understand the function of each attribute.