Sequence manipulation

Next-generation sequencing has revolutionized the biological sciences. Where once the graphical manipulation of individual sequences was possible, now with the generation of millions of reads that cover vast amounts of the genome or transcriptome automated and algorithmic manipulation and processing is required. Because we will be considering molecular sequences for most of the phylogenetic analyses, we will use sequence manipulation in python as a way to introduce basic interactions with sequences.

Often sequence files are in the fasta or the fastq format. I should mention that the biopython package has many sequence readers and manipulation tools. However, using that may be convenient but it doesn’t help us learn how to do it. So we are going to look at our own sequence manipulation methods and implementations so we can solve our problems or do custom manipulations.

Sequences

If we had a file that was formatted like this:

>sequence1
ACGTGTGGGGTGTGTGTGACCACGTGTACACA
>sequence2
AAAACTTTCTCTATACTACATTGTGGAG

We understand that there are two sequences in that file. Each sequence has a title (or label or name or whatever you want to call it) and each has the actual sequence string. What we want to do is be able to read this file and manipulate the sequences (like reverse complement, remove gaps). So, in python it would be good to have a class that can keep the label and the sequence string for each sequence so that we can access them. If you need a little revision of this stuff, check out the python section.


class Sequence:
    def __init__(self,name="",seq=""):
        self.name = name
        self.seq = seq
1
2
3
4
class Sequence:
    def __init__(self,name="",seq=""):
        self.name = name
        self.seq = seq

This is an extremely small and stripped down sequence class that does exactly what we stated: allows for the storage of names and sequence strings. It works and you can save it in a file called sequence.py.

Reading sequences

It is a little hard to see how the above bit is used without actually reading the sequences and showing how we use it. So let’s write a stripped down but functional reader. This will be in a file called seq_reader.py.


import sys,os
from sequence import Sequence

def read_fasta_file(infilename):
    infile = open(infilename,"r")
    seqlist = []
    curseqname = ""
    curseq = ""
    first = True
    for i in infile:
        i = i.strip()
        if len(i) == 0:
            continue
        if i[0] == ">":
            if first == True:
                first = False
            else:# if not the first, store the last seq
                tseq = Sequence()
                tseq.name = curseqname 
                tseq.seq = curseq
                seqlist.append(tseq)
            curseqname = i[1:]
            curseq = ""
        else: #seq can be on multiple lines
            curseq += i
    #need to get the last sequence stored
    tseq = Sequence()
    tseq.name = curseqname 
    tseq.seq = curseq
    seqlist.append(tseq)
    infile.close()
    return seqlist
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import sys,os
from sequence import Sequence

def read_fasta_file(infilename):
    infile = open(infilename,"r")
    seqlist = []
    curseqname = ""
    curseq = ""
    first = True
    for i in infile:
        i = i.strip()
        if len(i) == 0:
            continue
        if i[0] == ">":
            if first == True:
                first = False
            else:# if not the first, store the last seq
                tseq = Sequence()
                tseq.name = curseqname 
                tseq.seq = curseq
                seqlist.append(tseq)
            curseqname = i[1:]
            curseq = ""
        else: #seq can be on multiple lines
            curseq += i
    #need to get the last sequence stored
    tseq = Sequence()
    tseq.name = curseqname 
    tseq.seq = curseq
    seqlist.append(tseq)
    infile.close()
    return seqlist

There are many ways to implement this. This is just one and is a simple one. Ideally, we would probably organize this a bit more to be object oriented (see the section about this in both python and scientific programming). If you saved the sequences from above into a file called test.fa, you can add something to the bottom of the seq_reader.py that looks like this


if __name__ == "__main__":
    seqs = read_fasta_file("test.fa")
    for i in seqs:
        print i.name,i.seq
1
2
3
4
if __name__ == "__main__":
    seqs = read_fasta_file("test.fa")
    for i in seqs:
        print i.name,i.seq

then you can just run python seq_reader.py and the output should be

sequence1 ACGTGTGGGGTGTGTGTGACCACGTGTACACA
sequence2 AAAACTTTCTCTATACTACATTGTGGAG

Next, let’s look at reading in fastq format.

Fastq is a compact format that includes quality in the file. It is a common format for next generation sequences. In order to store that in our sequence object, we need to add something about quality to our sequence class. Here is our updated sequence.py.


class Sequence:
	def __init__(self,name="",seq=""):
		self.name = name
		self.seq = seq
		self.qualstr = ""
		self.qualarr = []
1
2
3
4
5
6
class Sequence:
	def __init__(self,name="",seq=""):
		self.name = name
		self.seq = seq
		self.qualstr = ""
		self.qualarr = []

You can see that we added a couple things: self.qualstr and self.qualarr. The string is because quality is often stored as a string of ASCII characters but they stand for values. To see what I mean, take a look at some fastq sequences


@sequence1_q
NACAACAAAAACTTTATTGAAATAGTTGCATGAAATTTCAGACGTCAAAATAGTCATCTGAAAATTTACTCTTGATCCATTTAAAGCTGTTCCTTATAGAA
+
#1=DFFFFHHHHHJJJJJJHHIIIJIJJJJJJJJJJJJHHIIJJHIJIJJIJJGIIIJJJHIIJJJJJJJJJJJIEEHHGHHFEEFFFDEEEEEEDDDDDD
@sequence2_q
ACTCATTCAACGGTCCGAGCGGCAAGGTTGATGAGTTTGGATCATCAGGTAATCCAGAGATGAAGAGGAAGAAGAGAGTTGCTTCTTATAACATGTATGCT
+
@CCFFFFFHHHHHJJJJIJJJJIIJJJDGIDHHDHGHIIJ@GEHIGGHIEHHHHHHFFFBDDCEEECD=?A=ACCBC?55>::A>CCAC(::>::@A@A##
1
2
3
4
5
6
7
8
@sequence1_q
NACAACAAAAACTTTATTGAAATAGTTGCATGAAATTTCAGACGTCAAAATAGTCATCTGAAAATTTACTCTTGATCCATTTAAAGCTGTTCCTTATAGAA
+
#1=DFFFFHHHHHJJJJJJHHIIIJIJJJJJJJJJJJJHHIIJJHIJIJJIJJGIIIJJJHIIJJJJJJJJJJJIEEHHGHHFEEFFFDEEEEEEDDDDDD
@sequence2_q
ACTCATTCAACGGTCCGAGCGGCAAGGTTGATGAGTTTGGATCATCAGGTAATCCAGAGATGAAGAGGAAGAAGAGAGTTGCTTCTTATAACATGTATGCT
+
@CCFFFFFHHHHHJJJJIJJJJIIJJJDGIDHHDHGHIIJ@GEHIGGHIEHHHHHHFFFBDDCEEECD=?A=ACCBC?55>::A>CCAC(::>::@A@A##

The string after the + is the quality string. Each ASCII character represents a quality score. There are different quality offsets (see wikipedia). We are going to assume that we are looking at the illumina format and so the offset would be 33 and includes !"#$%&'()*+,-./0123456789:;?@ABCDEFGHIJ to represent quality of 0-41. So we might want to store both the array of quality scores as integers and we might want to store the string itself for easy printing and writing to files. We can write a couple small procedures so that depending on what we give the Sequence object, we can calculate either the respective string or array. Here is what our sequence.py looks like now:


import sys,os

class Sequence:
	def __init__(self,name="",seq=""):
		self.name = name
		self.seq = seq
		self.qualstr = ""
		self.qualarr = []
	
	def set_qualstr(self,qual):
		self.qualstr = qual
		if len(self.qualarr) == 0:
			for j in self.qualstr:
				self.qualarr.append(ord(j))
	
	def set_qualarr(self,qual):
		self.qualarr = qual
		if len(self.qualstr) == 0:
			for j in self.qualarr:
				self.qualstr += chr(j)
	
	def get_fasta(self):
		retstr = ">"
		retstr += self.name
		retstr += "\n"
		retstr += self.seq
		return retstr
	
	def get_fastq(self):
		retstr = "@"
		retstr += self.name
		retstr += "\n"
		retstr += self.seq
		retstr += "\n+\n"
		retstr += self.qualstr
		return retstr
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import sys,os

class Sequence:
	def __init__(self,name="",seq=""):
		self.name = name
		self.seq = seq
		self.qualstr = ""
		self.qualarr = []
	
	def set_qualstr(self,qual):
		self.qualstr = qual
		if len(self.qualarr) == 0:
			for j in self.qualstr:
				self.qualarr.append(ord(j))
	
	def set_qualarr(self,qual):
		self.qualarr = qual
		if len(self.qualstr) == 0:
			for j in self.qualarr:
				self.qualstr += chr(j)
	
	def get_fasta(self):
		retstr = ">"
		retstr += self.name
		retstr += "\n"
		retstr += self.seq
		return retstr
	
	def get_fastq(self):
		retstr = "@"
		retstr += self.name
		retstr += "\n"
		retstr += self.seq
		retstr += "\n+\n"
		retstr += self.qualstr
		return retstr

Now that we have expanded our sequence class, we probably want to write another procedure like our read_fasta for reading fastq sequences. This one can actually be a little shorter because fastq files have a more strict format where each eight lines has all the sequence information. There isn’t any white space that we have to worry about or anything like that. Here is what this procedure would look like (and it is in the seq_reader.py file).


import sys,os
from sequence import Sequence

def read_fastq_file(infilename):
	infile = open(infilename,"r")
	seqlist = []
	i = infile.readline()
	while len(i) > 0: #no EOF in readline
		if i[0] == "@":
			name = i[1:].strip()
			seq = infile.readline().strip()
			i = infile.readline().strip()
			qual = infile.readline().strip()
			tseq = Sequence(name=name,seq=seq)
			tseq.set_qualstr(qual)
			seqlist.append(tseq)
		i = infile.readline()
	infile.close()
	return seqlist
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import sys,os
from sequence import Sequence

def read_fastq_file(infilename):
	infile = open(infilename,"r")
	seqlist = []
	i = infile.readline()
	while len(i) > 0: #no EOF in readline
		if i[0] == "@":
			name = i[1:].strip()
			seq = infile.readline().strip()
			i = infile.readline().strip()
			qual = infile.readline().strip()
			tseq = Sequence(name=name,seq=seq)
			tseq.set_qualstr(qual)
			seqlist.append(tseq)
		i = infile.readline()
	infile.close()
	return seqlist

If you compare this one to the read_fasta_file procedure you will see that we read the file a little differently and basically just read four lines, make our sequence and then if the next line still has some length (meaning that there is still something in the file) we do it again.

To demonstrate how this all works we can add a if __name__ == "__main__": to our seq_reader.py file and read in the two test files in the directory. That would go to the bottom of the file and look like this


if __name__ == "__main__":
	seqs = read_fasta_file("test.fasta")
	for i in seqs:
		print i.get_fasta()
	seqs = read_fastq_file("test.fastq")
	for i in seqs:
		print i.get_fastq()
1
2
3
4
5
6
7
if __name__ == "__main__":
	seqs = read_fasta_file("test.fasta")
	for i in seqs:
		print i.get_fasta()
	seqs = read_fastq_file("test.fastq")
	for i in seqs:
		print i.get_fastq()

Now we have some good ways to read in two common sequence formats. It is time to do something with the sequences that we can now get easily stored. There are any number of procedures that we could write to help manipulate these sequences (reverse complement, trim). If you have some specific interests, try out some on your own or mention it in the comments. We could write procedures to combine or compare multiple sequences as well. And that is what we are going to do next in sequence alignment.

Leave a Reply