Every field is experiencing enormous growth in data collection and therefore the need process data.
In addition to the general subject of data processing below, this section contains the chapters on
For the rest of this chapter, we are going to talk about the more general issue of data processing.
First, I would say that you should check out the python tutorial if you haven’t done that yet.
To start, we are going to go over the common steps that one might go through when processing any data. Often the task is to get some data that is in a text file to be better formatted for our standard input readers (e.g., tree readers, alignment readers, maybe something in R). Specifically, we are going to go over basic processing of delimited files and some regex.
We can’t go over everything about data processing here but if you have a specific request, please leave it in a comment and either myself or someone here will address the comment.
Reading and writing files
We will need to read and write files all the time in python. It is extremely simple to read and write files. Here is a simple example of opening a file, reading each line and printing it to the screen. This is in file reading_file.py in the dataprocessing directory:
import sys
infile = open(sys.argv[1],"r")
for i in infile:
print i
infile.close()
1 2 3 4 5 6 |
import sys
infile = open(sys.argv[1],"r")
for i in infile:
print i
infile.close()
|
The new bits here are the open which is a command to open the file. It takes the name of the file (which here is sys.argv[1] so a filename from the command line) and “r” which says to open the file for reading. Other options are “w” for write and “a” for append. Then there is infile.close() which you need to do everytime you open a file. If any other parts of the code look foreign (like the “for”), check out the python tutorial. Here is the output if I did python reading_file.py reading_file.py
import sys infile = open(sys.argv[1],"r") for i in infile: print i infile.close()
You will notice that the output has all these extra lines at the end. That is because the line has an invisible character at the end that says ‘newline’. These newline characters can be different between Mac (especially OS9 programs), Linux, and Windows. Sometimes this will cause problems in the programs that you use. Anyway, we can get rid of these as we read in files if we do
import sys
infile = open(sys.argv[1],"r")
for i in infile:
print i.strip()
infile.close()
1 2 3 4 5 6 |
import sys
infile = open(sys.argv[1],"r")
for i in infile:
print i.strip()
infile.close()
|
The i.strip() returns a string that has all the whitespace on the front and the back removed. It doesn’t actually change the string, i.strip() just returns a string with no whitespace on the front or back. If you used i again after that it would still have the whitespace on the end. You could say j = i.strip() and then i would have the whitespace and j would not.
How do we write things to an outfile? We can do that by just by opening a file with the “w” or “a” options for writing or appending respectively. Here is an example in the file writing_file.py
outfile = open("test","w")
outfile.write("line 1")
outfile.write("line 2")
outfile.write("line 3")
outfile.close()
1 2 3 4 5 |
outfile = open("test","w")
outfile.write("line 1")
outfile.write("line 2")
outfile.write("line 3")
outfile.close()
|
This will make a file called test that has the following contents
line 1line 2line 3
Unfortunately, everything is on the same line. As with the line endings that we discussed above, we have to add the invisible ‘newline’ character which for linux is just “\n”. That changes the code to
outfile = open("test","w")
outfile.write("line 1\n")
outfile.write("line 2\n")
outfile.write("line 3\n")
outfile.close()
1 2 3 4 5 |
outfile = open("test","w")
outfile.write("line 1\n")
outfile.write("line 2\n")
outfile.write("line 3\n")
outfile.close()
|
The last thing that is really useful that we will go over here is a command that you can use on strings in python called split. This would be used if you have a text file that has values separated by commas and you want to do something to each value. There is a file in the repository called lonicera_japnoica.csv. This is a file that includes country data and latitude and longitude values for Lonicera japonica from GBIF. We are going to split each line by commas and store the country and latitudes and longitudes. This is all done in the file split_file.py
import sys
infile = open(sys.argv[1],"r")
countries = []
lats = []
longs = []
for i in infile:
i = i.strip() #take off whitespace
spls = i.split(",") #split with comma
countries.append(spls[0])
lats.append(spls[1])
longs.append(spls[2])
infile.close()
print "number of records: "+str(len(countries))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
import sys
infile = open(sys.argv[1],"r")
countries = []
lats = []
longs = []
for i in infile:
i = i.strip() #take off whitespace
spls = i.split(",") #split with comma
countries.append(spls[0])
lats.append(spls[1])
longs.append(spls[2])
infile.close()
print "number of records: "+str(len(countries))
|
You run this like this python split_file.py lonicera_japonica.csv. The output of this is
number of records: 3467
There are a bunch of cool things that we can do with this file like calculate maximums and minimums and unique values. With the file split_file2.py I will demonstrate calculating and printing unique country information
import sys
infile = open(sys.argv[1],"r")
countries = []
lats = []
longs = []
for i in infile:
i = i.strip() #take off whitespace
spls = i.split(",") #split with comma
countries.append(spls[0])
lats.append(spls[1])
longs.append(spls[2])
infile.close()
uniq_countries = set(countries) # a set only holds unique records
print "number of records: "+str(len(countries))
print "number of unique records: "+str(len(uniq_countries))
print "here are the first ten:"
count = 0
for i in uniq_countries:
if count < 10:
print i
else:
break
count += 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
import sys
infile = open(sys.argv[1],"r")
countries = []
lats = []
longs = []
for i in infile:
i = i.strip() #take off whitespace
spls = i.split(",") #split with comma
countries.append(spls[0])
lats.append(spls[1])
longs.append(spls[2])
infile.close()
uniq_countries = set(countries) # a set only holds unique records
print "number of records: "+str(len(countries))
print "number of unique records: "+str(len(uniq_countries))
print "here are the first ten:"
count = 0
for i in uniq_countries:
if count < 10:
print i
else:
break
count += 1
|
This produces the output
number of records: 3467 number of unique records: 61 here are the first ten: Brazil Madagascar Caribbean BR USA SOUTH AFRICA Panama JP Costa Rica
That first blank there is that obviously there are some records that don’t have a country and so blank is a country. The examples presented here give a taste of some basic ways to process simple text files. We will go into specific ways to process text files like sequence files and tree files later.
Regular expressions
Have you ever wanted to take a file and replace all the tabs with spaces or do something more complicated like reverse the genus and species names in a file? These kind of things (and much more) can be done with regular expressions. The text editors that I suggested here all have the capability to use regular expressions. If you go to search and replace in the respective programs you will see that as an option. I highly recommend checking out how to use those in your program of choice. Because these programs all handle regular expressions a little differently, I will let you find out how to do it in your favorite program. If you have specific questions, list them here.
You can also use regular expressions in python (check here), but I find that I can get most things done with split (see above).
Command line tools
There are more than a few command line tools that can help your process data. A few that I use often are awk, head, tail, grep, and cat (also, check out sed). I will give a few examples for each of these and the general uses for each.
awk
awk is a great and simple language that can be used at the command line. Just to give you a simple example, here is a way to get all the unique countries from that lonicera_japonica.csv file using awk and the command line and without python. At the command line we would type
awk -F ',' '{print $1}' lonicera_japonica.csv
1 |
awk -F ',' '{print $1}' lonicera_japonica.csv
|
This is saying that ‘,’ will be the delimited (you could change that to ‘\t’ to separate by tabs) and then in the second set of quotes we have the command, which here is just to print the first column. If you did $2 instead it would print the latitude. You can do more complicated things like print the latitude and longitude if the country equals GB.
awk -F ',' '{if($1=="GB"){print $2","$3}}' lonicera_japonica.csv
1 |
awk -F ',' '{if($1=="GB"){print $2","$3}}' lonicera_japonica.csv
|
What if you want to save the results to a file
awk -F ',' '{if($1=="GB"){print $2","$3}}' lonicera_japonica.csv > outfile
1 |
awk -F ',' '{if($1=="GB"){print $2","$3}}' lonicera_japonica.csv > outfile
|
Now outfile has all the results. Here are three other tools, the pipe |, the sort command, and the command uniq. If you use the | it will take the output that goes to the screen and redirect it as input to another command line too. So lets take the output from the first example with the countries, sort it, and then only print the uniq values. We would do that like this
awk -F ',' '{print $1}' lonicera_japonica.csv | sort | uniq
1 |
awk -F ',' '{print $1}' lonicera_japonica.csv | sort | uniq
|
The first few lines of that output is
1 Argentina Australia AUSTRALIA BE Bolivia BR BRA
You can see a bunch of mess in there including a blank (which we saw in the python case), a 1 for who knows what reason, and some abbreviations. These are things we would have to fix, but you can see that we found these in just a few seconds after analyzing the file.
head and tail
Head is a very simple command that looks at the first few lines of a file. So if you do
you will get
Country,Latitude,Longitude Japan,135.283,34.7667 Japan,135.317,34.8333 Japan,135.267,34.8667 Japan,134.533,34.6667 Japan,134.533,34.6667 Japan,135.15,34.9333 Japan,134.817,34.2333 Japan,134.7,34.2333 Japan,134.817,34.1667
You can give options like head -100 lonicera_japonica.csv which will give you the first 100 lines of the file. You can also use it with the pipe as above if you want to only look at the first few lines of some output from a command.
Tail is the opposite of head. You use it exactly the same way but it looks at the end of the file.
cat
Cat is short for concatenate. This simply concatenates files. You can use it like cat file1 file2 and that will print to the screen. So if you redirect you can put it to a file like cat file1 file2 > outfile . You can also use wildcards like cat * for all the files in a directory.
grep
Grep is a regular expression tool. I would recommend seeking out more information on regular expressions. But for a simple example, lets only print the lines from lonicera_japonica.csv that have GB in there
The lines will be printed and highlighted. You can see maybe piping these lines to awk or other tools. There are special codes you can use to do more sophisticated searches and I would encourage you to seek out more information.
At this point, we have gone over some basic information on data processing. We will be revisiting these tools, especially reading files, later, and you can always leave a comment or ask questions about things on which you want more information. At this point, you can continue to the next section on Sequence manipulation.