Before we get into how we calculate phylogenies, we are going to first go over the tree datatype a little more in depth. This includes what the basic structure of a tree is, how we move around tree space, and how to read in a tree from a file.
Basic tree structure
A phylogenetic tree includes nodes and edges (also referred to as branches). The nodes include the tips and the putative ancestors (internal nodes). Branches represent an evolutionary lineage. The tips of the tree (external nodes) are more technically called OTUs or operational taxonomic units. In the standard model they are bifurcating like this
+ species1 + + + species2 : : + + species3 : : + species4 + + species5
This tree shows species a sister to b (most closely related to) with c being more closely related to a and b than to d.
There can be multifurcations or polytomies that often are the result of not having enough data to determine the relationship. This might look like this
+ species1 : ++ species2 : : + + species3 : : + species4 + + species5
In addition to drawing the trees in this way we can also draw trees in a circle.
This tree shows the same relationships as the first tree but in just a different way.
All of the trees presented above have some direction (from the root to the tips you have ancestral to derived). If we don’t have that information we would say that the tree is unrooted. It may appear that the tree is still rooted, but the rooting in this case would be arbitrary. An unrooted tree would look like this
+ species1 : ++ species2 : : : + species3 + :+ species4 : + species5
The unrooted tree can also be viewed as something like this
species1 +\ /+species4 \_____________/ /  \ species2 +/  \+species5 + species5
In order to root the tree, we need to identify ingroups and outgroups within our dataset. For example, if we were examining a set of species that were eukaryotes, we might include some bacteria in the analysis because we know that they won’t be in the eukaryotes. We know they are outside. We would consider the bacteria to be the outgroup and we could place a root between the bacteria and the eukaryotes.
+ Eukaryote1 + : + Eukaryote2 + : + Bacteria1 + + Bacteria2
Another aspect of trees is that the nodes can have meaningful lengths, branch lengths. These can either the genetic distance (in most cases, substitutions per site), character distance, or time. Here is an example tree with branches that are in substitutions per site
+ e + : + f + : + g + + h
Basic node class
In addition to the class that we wrote in order to manage sequences, we are going to use a node class to help manage trees. You will see in the node class, that because nodes are connected, we don’t actually need a tree class. We can just connect all the nodes and refer to the root. If we have the root node, then we can always get the other nodes from the root. Here is the basic one that is in the hackingtrees repository. There are a few functions that have programming concepts that may seem a little new like recursion (where the function is calling itself). This occurs in the leaves() function, the iternodes() function, and the get_newick_repr() function.
class Node:
def __init__(self):
self.label = ""
self.length = 0.0
self.parent = None
self.children = []
self.data = {}
self.istip = False
def add_child(self,child):
#make sure that the child is not already in there
assert child not in self.children
self.children.append(child)
child.parent = self
def remove_child(self,child):
#make sure that the child is in there
assert child in self.children
self.children.remove(child)
child.parent = None
def leaves(self,v=None):
if v == None:
v = []
if len(self.children) == 0:
v.append(self)
else:
for child in self.children:
child.leaves(v)
return v
def leaves_fancy(self):
return [n for n in self.iternodes() if n.istip ]
def iternodes(self,order="preorder"):
if order.lower() == "preorder":
yield self
for child in self.children:
for d in child.iternodes(order):
yield d
if order.lower() == "postorder":
yield self
def prune(self):
p = self.parent
if p != None:
p.remove_child(self)
return p
def get_newick_repr(self,showbl=False):
ret = ""
for i in range(len(self.children)):
if i == 0:
ret += "("
ret += self.children[i].get_newick_repr(showbl)
if i == len(self.children)1:
ret += ")"
else:
ret += ","
if self.label != None:
ret += self.label
if showbl == True:
ret += ":" + str(self.length)
return ret
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 
class Node:
def __init__(self):
self.label = ""
self.length = 0.0
self.parent = None
self.children = []
self.data = {}
self.istip = False
def add_child(self,child):
#make sure that the child is not already in there
assert child not in self.children
self.children.append(child)
child.parent = self
def remove_child(self,child):
#make sure that the child is in there
assert child in self.children
self.children.remove(child)
child.parent = None
def leaves(self,v=None):
if v == None:
v = []
if len(self.children) == 0:
v.append(self)
else:
for child in self.children:
child.leaves(v)
return v
def leaves_fancy(self):
return [n for n in self.iternodes() if n.istip ]
def iternodes(self,order="preorder"):
if order.lower() == "preorder":
yield self
for child in self.children:
for d in child.iternodes(order):
yield d
if order.lower() == "postorder":
yield self
def prune(self):
p = self.parent
if p != None:
p.remove_child(self)
return p
def get_newick_repr(self,showbl=False):
ret = ""
for i in range(len(self.children)):
if i == 0:
ret += "("
ret += self.children[i].get_newick_repr(showbl)
if i == len(self.children)1:
ret += ")"
else:
ret += ","
if self.label != None:
ret += self.label
if showbl == True:
ret += ":" + str(self.length)
return ret

Reading a newick string
Something that we will need to do in order to conduct analyses in the exercises that will follow is communicating trees. The most common way that we do this in phylogenetics is with a newick string. Let’s just look at what the newick string would be for this tree
+ species1 + + + species2 : : + + species3 : : + species4 + + species5 (((species1,species2),species3),(species4,species5));
That is the newick string for the whole tree but it can make a little more sense if we notice that we can make a newick string for each of the nodes in the tree. For example, here is the newick string for one of the clades
+ species1 + + species2 (species1,species2);
if we add species3
+ species1 + + + species2 : + species3 ((species1,species2),species3);
If we wanted to add branch lengths, we add them like node:length. So for the above tree
+ species1 + + + species2 : + species3 ((species1:0.1,species2:0.2):0.15,species3:0.2);