Data processing

2

Every field is experiencing enormous growth in data collection and therefore the need process data.

In addition to the general subject of data processing below, this section contains the chapters on

For the rest of this chapter, we are going to talk about the more general issue of data processing.

First, I would say that you should check out the python tutorial if you haven’t done that yet.

To start, we are going to go over the common steps that one might go through when processing any data. Often the task is to get some data that is in a text file to be better formatted for our standard input readers (e.g., tree readers, alignment readers, maybe something in R). Specifically, we are going to go over basic processing of delimited files and some regex.

We can’t go over everything about data processing here but if you have a specific request, please leave it in a comment and either myself or someone here will address the comment.

Reading and writing files

We will need to read and write files all the time in python. It is extremely simple to read and write files. Here is a simple example of opening a file, reading each line and printing it to the screen. This is in file reading_file.py in the dataprocessing directory:


import sys

infile = open(sys.argv[1],"r")
for i in infile:
	print i
infile.close()
1
2
3
4
5
6
import sys

infile = open(sys.argv[1],"r")
for i in infile:
	print i
infile.close()

The new bits here are the open which is a command to open the file. It takes the name of the file (which here is sys.argv[1] so a filename from the command line) and “r” which says to open the file for reading. Other options are “w” for write and “a” for append. Then there is infile.close() which you need to do everytime you open a file. If any other parts of the code look foreign (like the “for”), check out the python tutorial. Here is the output if I did python reading_file.py reading_file.py

import sys



infile = open(sys.argv[1],"r")

for i in infile:

	print i

infile.close()

You will notice that the output has all these extra lines at the end. That is because the line has an invisible character at the end that says ‘newline’. These newline characters can be different between Mac (especially OS9 programs), Linux, and Windows. Sometimes this will cause problems in the programs that you use. Anyway, we can get rid of these as we read in files if we do


import sys

infile = open(sys.argv[1],"r")
for i in infile:
    print i.strip()
infile.close()
1
2
3
4
5
6
import sys

infile = open(sys.argv[1],"r")
for i in infile:
    print i.strip()
infile.close()

The i.strip() returns a string that has all the whitespace on the front and the back removed. It doesn’t actually change the string, i.strip() just returns a string with no whitespace on the front or back. If you used i again after that it would still have the whitespace on the end. You could say j = i.strip() and then i would have the whitespace and j would not.

How do we write things to an outfile? We can do that by just by opening a file with the “w” or “a” options for writing or appending respectively. Here is an example in the file writing_file.py


outfile = open("test","w")
outfile.write("line 1")
outfile.write("line 2")
outfile.write("line 3")
outfile.close()
1
2
3
4
5
outfile = open("test","w")
outfile.write("line 1")
outfile.write("line 2")
outfile.write("line 3")
outfile.close()

This will make a file called test that has the following contents

line 1line 2line 3

Unfortunately, everything is on the same line. As with the line endings that we discussed above, we have to add the invisible ‘newline’ character which for linux is just “\n”. That changes the code to


outfile = open("test","w")
outfile.write("line 1\n")
outfile.write("line 2\n")
outfile.write("line 3\n")
outfile.close()
1
2
3
4
5
outfile = open("test","w")
outfile.write("line 1\n")
outfile.write("line 2\n")
outfile.write("line 3\n")
outfile.close()

The last thing that is really useful that we will go over here is a command that you can use on strings in python called split. This would be used if you have a text file that has values separated by commas and you want to do something to each value. There is a file in the repository called lonicera_japnoica.csv. This is a file that includes country data and latitude and longitude values for Lonicera japonica from GBIF. We are going to split each line by commas and store the country and latitudes and longitudes. This is all done in the file split_file.py


import sys

infile = open(sys.argv[1],"r")
countries = []
lats = []
longs = []
for i in infile:
	i = i.strip() #take off whitespace
	spls = i.split(",") #split with comma
	countries.append(spls[0])
	lats.append(spls[1])
	longs.append(spls[2])
infile.close()
print "number of records: "+str(len(countries))
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import sys

infile = open(sys.argv[1],"r")
countries = []
lats = []
longs = []
for i in infile:
	i = i.strip() #take off whitespace
	spls = i.split(",") #split with comma
	countries.append(spls[0])
	lats.append(spls[1])
	longs.append(spls[2])
infile.close()
print "number of records: "+str(len(countries))

You run this like this python split_file.py lonicera_japonica.csv. The output of this is

number of records: 3467

There are a bunch of cool things that we can do with this file like calculate maximums and minimums and unique values. With the file split_file2.py I will demonstrate calculating and printing unique country information


import sys

infile = open(sys.argv[1],"r")
countries = []
lats = []
longs = []
for i in infile:
	i = i.strip() #take off whitespace
	spls = i.split(",") #split with comma
	countries.append(spls[0])
	lats.append(spls[1])
	longs.append(spls[2])
infile.close()
uniq_countries = set(countries) # a set only holds unique records
print "number of records: "+str(len(countries))
print "number of unique records: "+str(len(uniq_countries))
print "here are the first ten:"
count = 0
for i in uniq_countries:
	if count < 10:
		print i
	else:
		break
	count += 1
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import sys

infile = open(sys.argv[1],"r")
countries = []
lats = []
longs = []
for i in infile:
	i = i.strip() #take off whitespace
	spls = i.split(",") #split with comma
	countries.append(spls[0])
	lats.append(spls[1])
	longs.append(spls[2])
infile.close()
uniq_countries = set(countries) # a set only holds unique records
print "number of records: "+str(len(countries))
print "number of unique records: "+str(len(uniq_countries))
print "here are the first ten:"
count = 0
for i in uniq_countries:
	if count < 10:
		print i
	else:
		break
	count += 1

This produces the output

number of records: 3467
number of unique records: 61
here are the first ten:

Brazil
Madagascar
Caribbean
BR
USA
SOUTH AFRICA
Panama
JP
Costa Rica

That first blank there is that obviously there are some records that don’t have a country and so blank is a country. The examples presented here give a taste of some basic ways to process simple text files. We will go into specific ways to process text files like sequence files and tree files later.

Regular expressions

Have you ever wanted to take a file and replace all the tabs with spaces or do something more complicated like reverse the genus and species names in a file? These kind of things (and much more) can be done with regular expressions. The text editors that I suggested here all have the capability to use regular expressions. If you go to search and replace in the respective programs you will see that as an option. I highly recommend checking out how to use those in your program of choice. Because these programs all handle regular expressions a little differently, I will let you find out how to do it in your favorite program. If you have specific questions, list them here.

You can also use regular expressions in python (check here), but I find that I can get most things done with split (see above).

Command line tools

There are more than a few command line tools that can help your process data. A few that I use often are awk, head, tail, grep, and cat (also, check out sed). I will give a few examples for each of these and the general uses for each.

awk

awk is a great and simple language that can be used at the command line. Just to give you a simple example, here is a way to get all the unique countries from that lonicera_japonica.csv file using awk and the command line and without python. At the command line we would type


awk -F ',' '{print $1}' lonicera_japonica.csv
1
awk -F ',' '{print $1}' lonicera_japonica.csv

This is saying that ‘,’ will be the delimited (you could change that to ‘\t’ to separate by tabs) and then in the second set of quotes we have the command, which here is just to print the first column. If you did $2 instead it would print the latitude. You can do more complicated things like print the latitude and longitude if the country equals GB.


awk -F ',' '{if($1=="GB"){print $2","$3}}' lonicera_japonica.csv
1
awk -F ',' '{if($1=="GB"){print $2","$3}}' lonicera_japonica.csv

What if you want to save the results to a file


awk -F ',' '{if($1=="GB"){print $2","$3}}' lonicera_japonica.csv > outfile
1
awk -F ',' '{if($1=="GB"){print $2","$3}}' lonicera_japonica.csv > outfile

Now outfile has all the results. Here are three other tools, the pipe |, the sort command, and the command uniq. If you use the | it will take the output that goes to the screen and redirect it as input to another command line too. So lets take the output from the first example with the countries, sort it, and then only print the uniq values. We would do that like this


awk -F ',' '{print $1}' lonicera_japonica.csv | sort | uniq
1
awk -F ',' '{print $1}' lonicera_japonica.csv | sort | uniq

The first few lines of that output is


1
Argentina
Australia
AUSTRALIA
BE
Bolivia
BR
BRA

You can see a bunch of mess in there including a blank (which we saw in the python case), a 1 for who knows what reason, and some abbreviations. These are things we would have to fix, but you can see that we found these in just a few seconds after analyzing the file.

head and tail

Head is a very simple command that looks at the first few lines of a file. So if you do


head lonicera_japonica.csv 
1
head lonicera_japonica.csv 

you will get

Country,Latitude,Longitude
Japan,135.283,34.7667
Japan,135.317,34.8333
Japan,135.267,34.8667
Japan,134.533,34.6667
Japan,134.533,34.6667
Japan,135.15,34.9333
Japan,134.817,34.2333
Japan,134.7,34.2333
Japan,134.817,34.1667

You can give options like head -100 lonicera_japonica.csv which will give you the first 100 lines of the file. You can also use it with the pipe as above if you want to only look at the first few lines of some output from a command.

Tail is the opposite of head. You use it exactly the same way but it looks at the end of the file.

cat

Cat is short for concatenate. This simply concatenates files. You can use it like cat file1 file2 and that will print to the screen. So if you redirect you can put it to a file like cat file1 file2 > outfile . You can also use wildcards like cat * for all the files in a directory.

grep

Grep is a regular expression tool. I would recommend seeking out more information on regular expressions. But for a simple example, lets only print the lines from lonicera_japonica.csv that have GB in there


grep GB lonicera_japonica.csv
1
grep GB lonicera_japonica.csv

The lines will be printed and highlighted. You can see maybe piping these lines to awk or other tools. There are special codes you can use to do more sophisticated searches and I would encourage you to seek out more information.

At this point, we have gone over some basic information on data processing. We will be revisiting these tools, especially reading files, later, and you can always leave a comment or ask questions about things on which you want more information. At this point, you can continue to the next section on Sequence manipulation.

2 thoughts on “Data processing

  1. Champak Reddy Mar 18, 2013 11:00

    Hi,
    The “print i.strip()” example code needs indentation.
    Cheers

Leave a Reply