All by Ourselves: Mapping Short Reads to a Reference Genome


A principal goal of our mycobacterium project is to identify the polymorphisms in the sample genomes by comparing them to the well-studied reference strain. The first step in this analysis is mapping the individual reads to the reference genome. As we'll see, this is a task demanding of both CPU power and memory, and the simplest solution is intractable on our puny little laptops. However, we can think about ways to store the data more intelligently, process only the necessary pieces of the data, and thereby increase our computational leverage.

But the sequences of interest are not the ones we'll map to the genome this morning, but rather the ones that do not map (i.e., these are the reads that describe the polymorphisms in the mycobacterial genomes). This afternoon, we will utilize an optimized algorithm coded by a program called Bowtie to not only accomplish the task of mapping matching reads more efficiently, but extend the algorithm to identify reads representing polymorphisms.

First Attempt


As with all programs, we first need to define our task and lay out the battle plan. Our input data are 21 mycobacterial genomes worth of sequence reads, which come in the FASTQ file format, and the reference Mycobacterium tuberculosis genome, which comes in FASTA format. The output we wish to generate by the end of the morning is a collection of all the reads that match to the reference genome, as well as some counting statistics on how successful we are in placing the reads.

Now, if we were to really map all 21 genome samples, we'd be talking about more than 20GB of data, which is a very realistic amount of data to be confronted with, but not tractable for our pedagogical purposes. So, let's start by trying to map one of the samples, namely the ones from this file: SRR006916.fastq.gz.

And our reference genome is contained in the file: mtbReferenceGenome.FASTA.

Getting Started

As usual, we'll start by assembling our toolbox of functions:


first_attempt.py
#!/usr/bin/python
from os import chdir
from sys import exit
from file_tools import open_file_by_mimetype
from sequencetools import read_fasta, rev_comp
from map_reads import map_to_reference_slow     # note that this function doesn't exist yet
 
And, then we'll want to setup out variables for managing data and files.

But first, let's consider what each file's contents should look like when we get them open. Recall that a FASTA file should have the names of sequences, followed in each case by the actual sequence.


$ head -n3 mtbReferenceGenome.FASTA
>supercontnull.1 of Mycobacterium tuberculosis H37Rv
TTGACCGATGACCCCGGTTCAGGCTTCACCACAGTGTGGAACGCGGTCGTCTCCGAACTT
AACGGCGACCCTAAGGTTGACGACGGACCCAGCAGTGATGCTAATCTCAGCGCTCCGCTG



About the FASTQ format, q-scores, and how we will use them


According to the wikipedia page for the FASTQ format, we can expect it to look like this:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

The first line starts with an '@' and then identifies the sequence on the second line.
The second line is the sequence read in 5' -> 3' orientation.
The third line begins with a '+' and is an optional description of the sequence, in this case omitted.
The fourth line is an encoded line representing the quality score for each base in the read.

FASTQ quality scores reflect the confidence the Illumina GA II Sequencer has that the base reported at a given position is the correct base. Generally, the q-scores decrease toward the end of each read as the dynamics of the sequencing reaction become less stable. The q-scores are registered on a scale from 0-256, which is encoded by the symbols on the fourth line in the example above. As we discussed in the first week, chr are stored as ASCII values, not coincidentally on a scale from 0-256. Thus these values are a convenient way of storing the q-scores (imagine a tab-separated list of integers: ick!).

The q-scores are used to discard the reads in the dataset that do not meet certain criteria. In our case, we're going to eliminate reads with any of these traits:

1) If the q-score is effectively zero, the base at that position is simply labeled 'N.' If the read contains any 'N's, we're tossing it.

2) If the q-score at any point is below a given threshold (this will be variable), we toss the read.

3) If the average q-score for the entire read is below a given threshold, we toss the read.

Typical Illumina GA II data are 36-bp reads, yet much of our reads are longer. Since we have a genome with roughly 4.5M bases, we expect to encounter a random 11-mer by chance about once every 4.1M times (i.e. (1/4)^11). So, maybe we could get away with using only the first 11bp of the read (also usually the highest-quality 11bp)? But then again, genomes do not contain random distributions of nucleotides, nor of 11-mers, so it's probably safer to use something longer.

How long? Some of the data are 51bp reads, which should be fantastically unique, barring conserved duplications. We can set this as a variable, but 20bp is probably a good starting guess.


Handling the Data


For the purposes of testing our code, we'll use a truncated version of a FASTQ file, with the first 100,000 lines (25,000 reads), which you can all download here:



So, let's add the code to open these files:
# open the FASTQ file for processing
fastqfn = '/Users/matt/PythonCourse/S7.1/SRR006916_sample.fastq.gz'
fastqfh = open_file_by_mimetype(fastqfn, 'r')
 
# open the reference FASTA file for processing
refgenomefn = '/Users/matt/PythonCourse/data/mtbReferenceGenome.FASTA'
refgenome = read_fasta(refgenomefn)
 
# now, recall that the read_fasta function returns a dictionary of chromosomes,
# but in this case, we've only got one to concern ourselves with (thanks bacteria!)
# so let's go ahead and make things easier by saving each strand to a string
 
plusstrand = refgenome[refgenome.keys()[0]]
 
# and we can use our rev_comp function to get the minus strand
minusstrand = rev_comp(plusstrand)
 
In the case of the genome file we just opened, we are accessing about 4.5MB of data, which is trivial for use to load into memory as two strands of DNA. However, the FASTQ file in question is a roaring 3.2GB of data once it's opened, and even with high-memory servers (which our laptops are not), one must be careful about loading 3GB of data at a time.

Thus, it behooves us to implement a strategy that handles less-than-all of the data in the FASTQ file at one time. Since the fundamental unit of information in the FASTQ file is the sequence read, let's approach the problem by reading one read worth of data at a time. And since we're hoping to count all of the reads and how many times they fall on each location of the genome, we'll need a data structure to keep track of things.

# for each read worth of data, we'll keep track of each of the four lines
# of data associated with the read as a list
seq = [ [], [], [], [] ]
 
# to store the location of each read, we'll have a dictionary keyed by
# read with the values of each index it maps to.
readmap = {}
 
# A rare opportunity for a while loop:  we want to read not one line at a time, which fh.readline() would give
# us.  We also can't afford the memory to read all the lines at once as fh.readlines() would do.  We want
# 4 lines at time, which we will then check for quality, and map to the reference sequence.
 
# we start with a while loop that never evaluates to false
while (2 + 2) != 5:
    # then we add a for loop to take four lines at a time
    for i in range(4):
        seq[i] = fastqfh.readline().strip()
    if '' == seq[0]:
        # if we run out of lines in the filehandle, it will keep returning us blank lines;
        # we can use this to exit the loop we're otherwise stuck in.
        print "EXITING READFILE AT LINE NUMBER:", fastqfh.tell()
        break
    # now that we have the data for a read, let's call the function to check the read by qscores;
    # we'll code this to return False if the read is discarded, allowing the if statement to pass
    # the read on to be mapped if it's not discarded
    checkedseq = check_seq_by_qscore(seq)
    if checkedseq:
        # when we code the next function, we'll want it to consider the sequence of the read,
        # both strands of reference and we'll want the function to add the read mapping to our
        # dictionary of readmaps
        map_to_reference_slow(seq[1], plusstrand, minusstrand, readmap)
 
# and we're now done with our 3GB data file so let's close the filehandle
fastqfh.close()
 
At this point, we've got a basic control structure setup to handle our data; we haven't yet thought about how we'd like the data output, and more importantly, we haven't figured out how to actually check the reads nor map them. So, let's take care of the two processing functions first.

Processing Data: Checking and Mapping the Reads


Our strategy to conserve memory usage is to check and map the reads one at a time, so let's begin with the checking (no need to map it if it's trash!).

We'll need a function that uses the read data to decide whether it is of reasonable quality or not, and then return either "yes, this is a good read," or, "no, toss it."
def check_seq_by_qscore(seq, meancutoff = 40, mincutoff = 25, minseqlength = 20):
    '''Examine a sequence from a FASTQ file and trim away poor quality bases, or discard the entire read.
 
    Accepts a sequence, which is a list of four lines from a FASTQ file:
    @SEQ_ID
    READ SEQUENCE
    +SEQUENCE DESCRIPTION
    QSCORES
 
    see:  http://en.wikipedia.org/wiki/FASTQ_format for formatting details.
 
    meancutoff is a threshold for the average q-score in the read.
    mincutoff is a threshold for the minimum any given NT is allowed
    minseqlength is the shortest the read is allowed to become during trimming.
 
    Returns a checked sequence (possibly trimmed) sequence or False for a rejected read.
    '''
 
    seqID = seq[0]
    read = seq[1]
    seqDesc = seq[2]
    qscores = seq[3]
 
    # if there's an "N", toss the read
    if 'N' in read:
        return False
 
    # evaluate the qscore values; start with converting the qscores to a list of their ASCII values for the read
    # the built-in function ord() is used for this conversion, which we'll do with list comprehension:
    qscores = [ord(qscore) for qscore in qscores]
 
    # if the mean qscore is below the cutoff threshold, then toss the read
    ### okay, so, there's no function called calculate_mean.  We'll see tomorrow that there are vast numerical
    ### libraries for Python, but for today, we'll write this call in, knowing that in a moment, we'll have to
    ### write this function too.
    if calculate_mean(qscores) < meancutoff:
        return False
 
    # if there are any NTs less than the mincutoff, make note of their positions in the read sequence
    indices = []
    for idx in enumerate(qscores):
        if idx[1] < mincutoff:
            indices.append(idx[0])
    # if any of the reads are before the mincutoff position, then toss the read
    # otherwise, return the sequence with the bad position and subsequent positions trimmed
    # or, if there are no bad positions return the full sequence
 
    # we'll use a list comprehension here, which will exit as soon as it sees an idx
    # less than the minseqlength; if it doesn't, we keep going.
    if  ([ i < minseqlength for i in indices ]):
        return False
    elif indices == []:
        # this is the case where none of the reads had a bad q-score.
        # we have to convert the qscores back into ASCII values before we send them back
        qscores = ''.join([ chr(i) for i in qscores ])
        return [ seqID, read, seqDesc, qscores ]
    else:
        # and here's what happens if trimmed the read down to get all good positions
        end = min(indices)
        qscores = ''.join([ chr(i) for i in qscores ])
        return [ seqID, read[:end], seqDesc, qscores[:end] ]
So, let us not forget to make good on that promise to write a calculate_mean() function:
def calculate_mean(data):
    '''Accepts a list of numerical data.
 
    Returns the float value of the list of numbers in data.'''
 
    # It turns out that this is a performance optimization as well, owing to the large overhead
    # when you import the mean function from NumPy just for one measly function.
    # But, we're using it because we haven't learned how to use NumPy yet.
 
    # use list comprehension to convert all the datapoints to a floating point; then divide their sum
    # by their length, and return the value.
    return sum( [float(i) for i in data] ) / len(data)
Okay, we can now ensure that every read will be checked, and then if it's not False, we'll pass it on to be mapped.

Deep breath, and let's consider how we're going to approach this task:

The reads should be able to map to either strand. So, in addition to accepting the read, this function needs access to the plusstrand and minusstrand strings we created earlier. In addition, since we're doing this one read at a time, it's a good idea to just pass along the readmap dictionary so that the function can modify the map as it goes.

While we're at it, we need to consider that though the read is only approximately 36bp, the sequence that is read comes from a DNA molecule of longer length. So, once we know what strand we're mapping the read to, we should extend it by the length of the average insert (there's no way to know exactly how long each read is, unless we have paired-end data, which in this case, we don't).
def map_to_reference_slow(read, plus, minus, readmap):
    '''Accepts a read from single-end Illumina data, as well as two strings representing the plus and minus
    strands of the reference DNA, and a dictionary to store the readmap data.  This dictionary should have
    the following form after the first call of this function:
 
 
    readmap = { read: [ [ {} ], [ {} ] ] }
 
 
    The first-level keys are the strings of each read.  The values are two lists, one for the plus strand,
    one for the minus strand.  Each list contains a dictionary keyed by indices with each having a value for
    the number of times that index has a read "fall" across it.
 
 
    Returns nothing if successful; the readmap is modified in place.
    '''
 
    from random import choice
    # choice() picks an item randomly from a list
 
 
    # we'll assume that our insert size is about 100nt, even though in reality it's probably longer.
    # 100nt will be a faster assumption that 250, which is probably more accurate.
    insertlen = 100
 
 
    # for accounting purposes later, we'll want to know the length of the strands
    strandlen = len(plus)
 
 
    # first is plus, second is minus.
    readmap[read] = [ {}, {} ]
 
 
    # to find the map position, we're going to have to scan down the length of the genome, recording
    # not just the first, but every incidence of a match.  Once we figure out where they all
    # could be, we'll pick one at random and extend it.
 
 
    # The following code is quite sloppy, as we consider each strand separately, then combine them
    # before we choose a single index to map to.  This could be done in a nested loop where we
    # iterate over the list of the strands (plus, minus), but since we've used a list of dictionaries
    # above, it's easier to hard-code the [0] for plus and [1] for minus instead of using conditions
    # inside the hypothetical loop.
 
 
    # idx will start at the first nt of the strand, and we'll use lastidx to keep track of where the
    # last match was each time through the while loop.  The plusidxlist is to keep track of all matches.
 
 
    # So, here's the plus strand:
 
    idx = 0
    lastidx = 0
    plusidxlist = []
 
 
    # the loop will not break until we're done finding matches
    while 2 != 1:
        idx = plus[lastidx:].find(read)
        # recall that find returns -1 if there's no match
        if idx == -1:
            idx = 0
            break
        # gotta add the lastidx each time
        plusidxlist.append(lastidx + idx)
        # and now take the position we just appended and increment by one more
        lastidx = lastidx + idx + 1
 
 
    # Now look at the minus strand:
    idx = 0
    lastidx = 0
    minusidxlist = []
    while 2 != 1:
        idx = minus[lastidx:].find(read)
        if idx == -1:
            idx = 0
            break
        # except, we're going to store the minus strand as a negative value, so
        # that we can compare to the plus strand
        minusidxlist.append( -( strandlen - (lastidx + idx) ) )
        lastidx = lastidx + idx + 1
 
 
 
 
    # Now we have two lists of matches for each read, and we need to choose one index
    # at random for the actual mapping.  The mappings could be on either, neither,
    # or both strands.  We must consider each case:
 
 
    # check for the neither case, it's not True, then pick one of the indices
    # from all the possibilities
    if plusidxlist + minusidxlist != []:
        idx = choice(plusidxlist + minusidxlist)
    else:
        return
 
 
    if idx > 0:
        for i in range(insertlen):
                if (idx + i) not in readmap[read][0].keys():
                    readmap[read][0][(idx + i)] = 1
                else:
                    readmap[read][0][(idx + i)] += 1
    elif idx < 0:
        for i in range(insertlen):
            if (idx + i) not in readmap[read][1].keys():
                readmap[read][1][(idx + i)] = 1
            else:
                readmap[read][1][(idx + i)] += 1
 
 
    # note there's a special case here, where idx = 0, and we can't tell which end of the genome it's at.
    # I'm happy just tossing this one too, instead of worrying about 2 NTs of data.
    else:
        return
 
 

Updated first_attempt.py


#!/usr/bin/python
from os import chdir
from sys import exit
from file_tools import open_file_by_mimetype
from sequencetools import read_fasta, rev_comp
 
def check_seq_by_qscore(seq, meancutoff = 40, mincutoff = 25, minseqlength = 20):
    '''Examine a sequence from a FASTQ file and trim away poor quality bases, or discard the entire read.
 
    Accepts a sequence, which is a list of four lines from a FASTQ file:
    @SEQ_ID
    READ SEQUENCE
    +SEQUENCE DESCRIPTION
    QSCORES
 
    see:  http://en.wikipedia.org/wiki/FASTQ_format for formatting details.
 
    meancutoff is a threshold for the average q-score in the read.
    mincutoff is a threshold for the minimum any given NT is allowed
    minseqlength is the shortest the read is allowed to become during trimming.
 
    Returns a checked sequence (possibly trimmed) sequence or False for a rejected read.
    '''
 
    seqID = seq[0]
    read = seq[1]
    seqDesc = seq[2]
    qscores = seq[3]
 
    # if there's an "N", toss the read
    if 'N' in read:
        return False
 
    # evaluate the qscore values; start with converting the qscores to a list of their ASCII values for the read
    # the built-in function ord() is used for this conversion, which we'll do with list comprehension:
    qscores = [ord(qscore) for qscore in qscores]
 
    # if the mean qscore is below the cutoff threshold, then toss the read
    ### okay, so, there's no function called calculate_mean.  We'll see tomorrow that there are vast numerical
    ### libraries for Python, but for today, we'll write this call in, knowing that in a moment, we'll have to
    ### write this function too.
    if calculate_mean(qscores) < meancutoff:
        return False
 
    # if there are any NTs less than the mincutoff, make note of their positions in the read sequence
    indices = []
    for idx in enumerate(qscores):
        if idx[1] < mincutoff:
            indices.append(idx[0])
    # if any of the reads are before the mincutoff position, then toss the read
    # otherwise, return the sequence with the bad position and subsequent positions trimmed
    # or, if there are no bad positions return the full sequence
 
    # we'll use a list comprehension here, which will exit as soon as it sees an idx
    # less than the minseqlength; if it doesn't, we keep going.
    if True in ([ i < minseqlength for i in indices ]):
        return False
    elif indices == []:
        # this is the case where none of the reads had a bad q-score.
        # we have to convert the qscores back into ASCII values before we send them back
        qscores = ''.join([ chr(i) for i in qscores ])
        return [ seqID, read, seqDesc, qscores ]
    else:
        # and here's what happens if trimmed the read down to get all good positions
        end = min(indices)
        qscores = ''.join([ chr(i) for i in qscores ])
        return [ seqID, read[:end], seqDesc, qscores[:end] ]
 
def calculate_mean(data):
    '''Accepts a list of numerical data.
 
    Returns the float value of the list of numbers in data.'''
 
    # It turns out that this is a performance optimization as well, owing to the large overhead
    # when you import the mean function from NumPy just for one measly function.
    # But, we're using it because we haven't learned how to use NumPy yet.
 
    # use list comprehension to convert all the datapoints to a floating point; then divide their sum
    # by their length, and return the value.
    return sum( [float(i) for i in data] ) / len(data)
 
def map_to_reference_slow(read, plus, minus, readmap):
    '''Accepts a read from single-end Illumina data, as well as two strings representing the plus and minus
    strands of the reference DNA, and a dictionary to store the readmap data.  This dictionary should have
    the following form after the first call of this function:
 
 
    readmap = { read: [ [ {} ], [ {} ] ] }
 
 
    The first-level keys are the strings of each read.  The values are two lists, one for the plus strand,
    one for the minus strand.  Each list contains a dictionary keyed by indices with each having a value for
    the number of times that index has a read "fall" across it.
 
 
    Returns nothing if successful; the readmap is modified in place.
    '''
 
    from random import choice
    # choice() picks an item randomly from a list
 
 
    # we'll assume that our insert size is about 100nt, even though in reality it's probably longer.
    # 100nt will be a faster assumption that 250, which is probably more accurate.
    insertlen = 100
 
 
    # for accounting purposes later, we'll want to know the length of the strands
    strandlen = len(plus)
 
 
    # first is plus, second is minus.
    if not read in readmap.keys():
        readmap[read] = [ {}, {} ]
 
 
    # to find the map position, we're going to have to scan down the length of the genome, recording
    # not just the first, but every incidence of a match.  Once we figure out where they all
    # could be, we'll pick one at random and extend it.
 
 
    # The following code is quite sloppy, as we consider each strand separately, then combine them
    # before we choose a single index to map to.  This could be done in a nested loop where we
    # iterate over the list of the strands (plus, minus), but since we've used a list of dictionaries
    # above, it's easier to hard-code the [0] for plus and [1] for minus instead of using conditions
    # inside the hypothetical loop.
 
 
    # idx will start at the first nt of the strand, and we'll use lastidx to keep track of where the
    # last match was each time through the while loop.  The plusidxlist is to keep track of all matches.
 
 
    # So, here's the plus strand:
 
    idx = 0
    lastidx = 0
    plusidxlist = []
 
 
    # the loop will not break until we're done finding matches
    while 2 != 1:
        idx = plus[lastidx:].find(read)
        # recall that find returns -1 if there's no match
        if idx == -1:
            idx = 0
            break
        # gotta add the lastidx each time
        plusidxlist.append(lastidx + idx)
        # and now take the position we just appended and increment by one more
        lastidx = lastidx + idx + 1
 
 
    # Now look at the minus strand:
    idx = 0
    lastidx = 0
    minusidxlist = []
    while 2 != 1:
        idx = minus[lastidx:].find(read)
        if idx == -1:
            idx = 0
            break
        # except, we're going to store the minus strand as a negative value, so
        # that we can compare to the plus strand
        minusidxlist.append( -( strandlen - (lastidx + idx) ) )
        lastidx = lastidx + idx + 1
 
 
 
 
    # Now we have two lists of matches for each read, and we need to choose one index
    # at random for the actual mapping.  The mappings could be on either, neither,
    # or both strands.  We must consider each case:
 
 
    # check for the neither case, it's not True, then pick one of the indices
    # from all the possibilities
    if (plusidxlist + minusidxlist) != []:
        idx = choice(plusidxlist + minusidxlist)
    else:
        return
 
 
    if idx > 0:
        for i in range(insertlen):
            if (idx + i) not in readmap[read][0].keys():
                readmap[read][0][(idx + i)] = 1
            else:
                readmap[read][0][(idx + i)] += 1
    elif idx < 0:
        for i in range(insertlen):
            if (idx + i) not in readmap[read][1].keys():
                readmap[read][1][(idx + i)] = 1
            else:
                readmap[read][1][(idx + i)] += 1
 
 
    # note there's a special case here, where idx = 0, and we can't tell which end of the genome it's at.
    # I'm happy just tossing this one too, instead of worrying about 2 NTs of data.
    else:
        return
 
 
 
 
## Begin the main program
 
# open the FASTQ file for processing
fastqfn = '/Users/matt/PythonCourse/S7.1/SRR006916_sample.fastq.gz'
fastqfh = open_file_by_mimetype(fastqfn, 'r')
 
# open the reference FASTA file for processing
refgenomefn = '/Users/matt/PythonCourse/data/mtbReferenceGenome.FASTA'
refgenome = read_fasta(refgenomefn)
 
# now, recall that the read_fasta function returns a dictionary of chromosomes,
# but in this case, we've only got one to concern ourselves with (thanks bacteria!)
# so let's go ahead and make things easier by saving each strand to a string
 
plusstrand = refgenome[refgenome.keys()[0]]
 
# and we can use our rev_comp function to get the minus strand
minusstrand = rev_comp(plusstrand)
 
# for each read worth of data, we'll keep track of each of the four lines
# of data associated with the read as a list
seq = [ '', '', '', '' ]
 
# to store the location of each read, we'll have a dictionary keyed by
# read with the values of each index it maps to.
readmap = {}
 
# we start with a while loop that never evaluates to false
while True:
 
    # then we add a for loop to take four lines at a time
    for i in range(4):
        seq[i] = fastqfh.readline().strip()
    if '' == seq[0]:
        # if we run out of lines in the filehandle, it will keep returning us blank lines;
        # we can use this to exit the loop we're otherwise stuck in.
        print "EXITING READFILE AT LINE NUMBER:", fastqfh.tell()
        break
    # now that we have the data for a read, let's call the function to check the read by qscores;
    # we'll code this to return False if the read is discarded, allowing the if statement to pass
    # the read on to be mapped if it's not discarded
    checkedseq = check_seq_by_qscore(seq)
    if checkedseq:
        # when we code the next function, we'll want it to consider the sequence of the read,
        # both strands of reference and we'll want the function to add the read mapping to our
        # dictionary of readmaps
        map_to_reference_slow(checkedseq[1][:20], plusstrand, minusstrand, readmap)
 
# and we're now done with our 3GB data file so let's close the filehandle
fastqfh.close()
 
# Output the data
 
indices = [ {}, {} ]
for strand in [0,1]:
    for read in readmap:
        for idx in readmap[read][strand]:
            if idx in indices[strand].keys():
                indices[strand][idx] += readmap[read][strand][idx]
            else:
                indices[strand][idx] = readmap[read][strand][idx]
 
 
outfh = open('outfile.map', 'w')
 
 
for idx in sorted(indices[0].keys()):
    outfh.write('+' + str(idx) + '\t' + str(indices[0][idx]) + '\n')
for idx in sorted(indices[1].keys()):
    outfh.write('-' + str(idx) + '\t' + str(indices[1][idx]) + '\n')
 
outfh.close()
 
 




Now, as we run this on a small subset of the data, we can get an idea that it's going to take entirely too long. Using our shortened file of 25,000 reads, this takes tens of minutes. I've taken the liberty to calculate it on a comically small version of the data, holding just 250 reads, and with time we can see how it performs:

$ time python first_attempt.py
EXITING READFILE AT LINE NUMBER: 501632


real 1m38.435s
user 1m36.360s
sys 0m0.625s



So, at 1 min 38 seconds for 250 reads, it could take us quite some time to get to 20,000,000 reads. Any ideas where we went wrong (okay, maybe the function name was a hint)? But if not, fear not: there's an easy way to find out: the Python Profiler. Profilers exist for most languages, and in short, they help you discover where the majority of the work in your codes is taking place. We can invoke the profiler like so:

$ python -m cProfile first_attempt.py

$ python -m cProfile first_attempt.py
EXITING READFILE AT LINE NUMBER: 501632
657518 function calls (657511 primitive calls) in 98.278 CPU seconds


Ordered by: standard name


ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 <string>:1(ParseResult)
1 0.000 0.000 0.000 0.000 <string>:1(SplitResult)
1 0.000 0.000 0.000 0.000 future.py:48(<module>)
1 0.000 0.000 0.000 0.000 future.py:74(_Feature)
7 0.000 0.000 0.000 0.000 future.py:75(init)
1 0.001 0.001 0.001 0.001 base64.py:3(<module>)
1 0.002 0.002 0.002 0.002 collections.py:1(<module>)
2 0.001 0.001 0.001 0.001 collections.py:13(namedtuple)
99 0.000 0.000 0.000 0.000 collections.py:43(<genexpr>)
13 0.000 0.000 0.000 0.000 collections.py:60(<genexpr>)
13 0.000 0.000 0.000 0.000 collections.py:61(<genexpr>)
1 0.000 0.000 0.000 0.000 file_tools.py:1(<module>)
2 0.001 0.001 0.023 0.012 file_tools.py:1(open_file_by_mimetype)
2500 0.156 0.000 0.257 0.000 first_attempt.py:10(check_seq_by_qscore)
1 0.063 0.063 98.277 98.277 first_attempt.py:2(<module>)
2486 0.054 0.000 0.061 0.000 first_attempt.py:72(calculate_mean)
2486 1.214 0.000 96.482 0.039 first_attempt.py:85(map_to_reference_slow)
9 0.000 0.000 0.000 0.000 genericpath.py:26(isfile)
4 0.000 0.000 0.000 0.000 genericpath.py:85(_splitext)
1 0.000 0.000 0.000 0.000 gzip.py:149(_init_read)
1 0.000 0.000 0.000 0.000 gzip.py:153(_read_gzip_header)
10004 0.038 0.000 0.056 0.000 gzip.py:200(read)
2 0.000 0.000 0.000 0.000 gzip.py:23(read32)
10004 0.021 0.000 0.025 0.000 gzip.py:232(_unread)
164 0.002 0.000 0.018 0.000 gzip.py:237(_read)
1 0.000 0.000 0.000 0.000 gzip.py:26(open)
159 0.002 0.000 0.003 0.000 gzip.py:287(_add_read_data)
1 0.000 0.000 0.000 0.000 gzip.py:293(_read_eof)
1 0.000 0.000 0.000 0.000 gzip.py:308(close)
1 0.000 0.000 0.000 0.000 gzip.py:349(tell)
1 0.000 0.000 0.000 0.000 gzip.py:35(GzipFile)
10004 0.053 0.000 0.145 0.000 gzip.py:385(readline)
1 0.001 0.001 0.001 0.001 gzip.py:4(<module>)
1 0.000 0.000 0.000 0.000 gzip.py:44(init)
1 0.000 0.000 0.000 0.000 keyword.py:11(<module>)
1 0.000 0.000 0.000 0.000 map_reads.py:3(<module>)
1 0.003 0.003 0.006 0.006 mimetypes.py:200(readfp)
3 0.000 0.000 0.007 0.002 mimetypes.py:223(guess_type)
1 0.001 0.001 0.014 0.014 mimetypes.py:23(<module>)
1 0.000 0.000 0.007 0.007 mimetypes.py:296(init)
1 0.000 0.000 0.000 0.000 mimetypes.py:324(_default_mime_types)
1 0.000 0.000 0.000 0.000 mimetypes.py:50(MimeTypes)
1 0.000 0.000 0.000 0.000 mimetypes.py:58(init)
811 0.001 0.000 0.002 0.000 mimetypes.py:72(add_type)
3 0.000 0.000 0.000 0.000 mimetypes.py:89(guess_type)
1 0.000 0.000 0.000 0.000 os.py:35(_get_exports_list)
1 0.000 0.000 0.000 0.000 os.py:747(urandom)
4 0.000 0.000 0.000 0.000 posixpath.py:94(splitext)
542 0.006 0.000 0.007 0.000 random.py:259(choice)
1 0.001 0.001 0.001 0.001 random.py:40(<module>)
1 0.000 0.000 0.000 0.000 random.py:643(WichmannHill)
1 0.000 0.000 0.000 0.000 random.py:71(Random)
1 0.000 0.000 0.000 0.000 random.py:793(SystemRandom)
1 0.000 0.000 0.000 0.000 random.py:90(init)
1 0.000 0.000 0.000 0.000 random.py:99(seed)
1 0.000 0.000 0.000 0.000 re.py:188(compile)
1 0.000 0.000 0.000 0.000 re.py:229(_compile)
1 0.000 0.000 0.000 0.000 sequencetools.py:1(<module>)
1 1.166 1.166 1.166 1.166 sequencetools.py:1(rev_comp)
1 0.085 0.085 0.125 0.125 sequencetools.py:27(read_fasta)
1 0.000 0.000 0.000 0.000 socket.py:162(_closedsocket)
1 0.001 0.001 0.001 0.001 socket.py:174(_socketobject)
1 0.000 0.000 0.000 0.000 socket.py:224(_fileobject)
1 0.003 0.003 0.004 0.004 socket.py:44(<module>)
1 0.000 0.000 0.000 0.000 sre_compile.py:184(_compile_charset)
1 0.000 0.000 0.000 0.000 sre_compile.py:213(_optimize_charset)
4 0.000 0.000 0.000 0.000 sre_compile.py:24(_identityfunction)
1 0.000 0.000 0.000 0.000 sre_compile.py:360(_simple)
1 0.000 0.000 0.000 0.000 sre_compile.py:367(_compile_info)
3/1 0.000 0.000 0.000 0.000 sre_compile.py:38(_compile)
2 0.000 0.000 0.000 0.000 sre_compile.py:480(isstring)
1 0.000 0.000 0.000 0.000 sre_compile.py:486(_code)
1 0.000 0.000 0.000 0.000 sre_compile.py:501(compile)
4 0.000 0.000 0.000 0.000 sre_parse.py:132(len)
8 0.000 0.000 0.000 0.000 sre_parse.py:136(getitem)
1 0.000 0.000 0.000 0.000 sre_parse.py:140(setitem)
4 0.000 0.000 0.000 0.000 sre_parse.py:144(append)
4/2 0.000 0.000 0.000 0.000 sre_parse.py:146(getwidth)
1 0.000 0.000 0.000 0.000 sre_parse.py:184(init)
13 0.000 0.000 0.000 0.000 sre_parse.py:188(next)
9 0.000 0.000 0.000 0.000 sre_parse.py:201(match)
10 0.000 0.000 0.000 0.000 sre_parse.py:207(get)
2/1 0.000 0.000 0.000 0.000 sre_parse.py:307(_parse_sub)
2/1 0.000 0.000 0.000 0.000 sre_parse.py:385(_parse)
1 0.000 0.000 0.000 0.000 sre_parse.py:669(parse)
1 0.000 0.000 0.000 0.000 sre_parse.py:73(init)
1 0.000 0.000 0.000 0.000 sre_parse.py:78(opengroup)
1 0.000 0.000 0.000 0.000 sre_parse.py:89(closegroup)
3 0.000 0.000 0.000 0.000 sre_parse.py:96(init)
1 0.000 0.000 0.001 0.001 ssl.py:56(<module>)
1 0.000 0.000 0.000 0.000 ssl.py:81(SSLSocket)
1 0.000 0.000 0.000 0.000 stat.py:24(S_IFMT)
1 0.000 0.000 0.000 0.000 stat.py:49(S_ISREG)
3 0.000 0.000 0.000 0.000 urllib.py:1040(splittype)
1 0.000 0.000 0.000 0.000 urllib.py:108(ContentTooShortError)
1 0.000 0.000 0.000 0.000 urllib.py:114(URLopener)
257 0.000 0.000 0.000 0.000 urllib.py:1166(<genexpr>)
257 0.000 0.000 0.000 0.000 urllib.py:1167(<genexpr>)
1 0.002 0.002 0.013 0.013 urllib.py:23(<module>)
1 0.000 0.000 0.000 0.000 urllib.py:614(FancyURLopener)
1 0.000 0.000 0.000 0.000 urllib.py:844(ftpwrapper)
1 0.000 0.000 0.000 0.000 urllib.py:922(addbase)
1 0.000 0.000 0.000 0.000 urllib.py:951(addclosehook)
1 0.000 0.000 0.000 0.000 urllib.py:966(addinfo)
1 0.000 0.000 0.000 0.000 urllib.py:976(addinfourl)
257 0.000 0.000 0.000 0.000 urlparse.py:274(<genexpr>)
257 0.000 0.000 0.000 0.000 urlparse.py:275(<genexpr>)
1 0.000 0.000 0.000 0.000 urlparse.py:43(ResultMixin)
1 0.000 0.000 0.005 0.005 urlparse.py:5(<module>)
1 0.000 0.000 0.000 0.000 urlparse.py:86(SplitResult)
1 0.000 0.000 0.000 0.000 urlparse.py:94(ParseResult)
1 0.000 0.000 0.000 0.000 {_sre.compile}
2 0.000 0.000 0.000 0.000 {_struct.unpack}
13 0.000 0.000 0.000 0.000 {all}
1 0.000 0.000 0.000 0.000 {binascii.hexlify}
159 0.011 0.000 0.011 0.000 {built-in method decompress}
3 0.000 0.000 0.000 0.000 {built-in method match}
128066 0.023 0.000 0.023 0.000 {chr}
1 0.000 0.000 0.000 0.000 {dir}
1 0.002 0.002 98.278 98.278 {execfile}
1 0.000 0.000 0.000 0.000 {function seed at 0x489eb0}
2 0.000 0.000 0.000 0.000 {hasattr}
13 0.000 0.000 0.000 0.000 {isinstance}
26962/26961 0.005 0.000 0.005 0.000 {len}
2 0.000 0.000 0.000 0.000 {locals}
2 0.000 0.000 0.000 0.000 {map}
1 0.000 0.000 0.000 0.000 {math.exp}
2 0.000 0.000 0.000 0.000 {math.log}
1 0.000 0.000 0.000 0.000 {math.sqrt}
13 0.000 0.000 0.000 0.000 {method 'contains__' of 'frozenset' objects}
11 0.000 0.000 0.000 0.000 {method 'add' of 'set' objects}
84864 0.012 0.000 0.012 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'close' of 'file' objects}
2 0.000 0.000 0.000 0.000 {method 'copy' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'extend' of 'list' objects}
15540 95.227 0.006 95.227 0.006 {method 'find' of 'str' objects}
3 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
86 0.000 0.000 0.000 0.000 {method 'isalnum' of 'str' objects}
13 0.000 0.000 0.000 0.000 {method 'isdigit' of 'str' objects}
5 0.000 0.000 0.000 0.000 {method 'items' of 'dict' objects}
12495 0.015 0.000 0.015 0.000 {method 'join' of 'str' objects}
54201 0.034 0.000 0.034 0.000 {method 'keys' of 'dict' objects}
4 0.000 0.000 0.000 0.000 {method 'lower' of 'str' objects}
542 0.001 0.000 0.001 0.000 {method 'random' of '_random.Random' objects}
210 0.001 0.000 0.001 0.000 {method 'read' of 'file' objects}
1071 0.001 0.000 0.001 0.000 {method 'readline' of 'file' objects}
1 0.000 0.000 0.000 0.000 {method 'remove' of 'list' objects}
4 0.000 0.000 0.000 0.000 {method 'replace' of 'str' objects}
8 0.000 0.000 0.000 0.000 {method 'rfind' of 'str' objects}
9 0.000 0.000 0.000 0.000 {method 'seek' of 'file' objects}
811 0.000 0.000 0.000 0.000 {method 'setdefault' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'sort' of 'list' objects}
1072 0.001 0.000 0.001 0.000 {method 'split' of 'str' objects}
73539 0.013 0.000 0.013 0.000 {method 'startswith' of 'str' objects}
83531 0.015 0.000 0.015 0.000 {method 'strip' of 'str' objects}
12 0.000 0.000 0.000 0.000 {method 'tell' of 'file' objects}
2 0.000 0.000 0.001 0.000 {method 'update' of 'dict' objects}
165 0.000 0.000 0.000 0.000 {min}
3 0.000 0.000 0.000 0.000 {open}
126791 0.014 0.000 0.014 0.000 {ord}
1 0.000 0.000 0.000 0.000 {posix.close}
1 0.000 0.000 0.000 0.000 {posix.open}
1 0.000 0.000 0.000 0.000 {posix.read}
9 0.000 0.000 0.000 0.000 {posix.stat}
4118 0.013 0.000 0.013 0.000 {range}
2 0.000 0.000 0.000 0.000 {repr}
2486 0.006 0.000 0.006 0.000 {sum}
2 0.000 0.000 0.000 0.000 {sys._getframe}
160 0.002 0.000 0.002 0.000 {zlib.crc32}
1 0.000 0.000 0.000 0.000 {zlib.decompressobj}


If we look at all this mess, we can see that principally, we're consuming time in your map_to_reference_slow() function due to the really serious amount of time spent in find() calls that we're making. So, let's reconsider our design, as this approach isn't going to get us to the proverbial promise land.

Second Attempt


In our first attempt, we spent a lot of time finding a read on each strand of the genome. As an alternative to searching the entire genome so many times as a string, let's store the data as dictionary. What do I mean exactly? Well, it's gonna sound crazy, I admit. But, let's make a dictionary of every, say, 20-nucleotide segment of the genome. Granted, our reads are about 50 bp long, but since the random 20-mer should appear once in every 4^20, or 9.0949470177292824e13, nucleotides, I think we'll be safe for the most part using only the 20nt. Besides, these should be very high quality base reads, once we throw out the ones with N's in the reads. In short, we'll see that while we use more memory, cutting out all all those find() calls on a 4,500,000 character string will really save us time. We'll just make a dictionary with every 20-mer observed in the genome, then we'll count how many times a read falls on it.

Let's replace our map_to_reference_slow() function with a faster (and much shorter!) one. We'll also need a quick function to generate the dictionary to look up all the read positions with.
def make_kmer_dict(strand, readlen = 20):
    '''Record the position of every read of length readlen in the genome'''
 
    kmerdict = {}
 
    for i in range(len(strand) - readlen):
        seq = strand[i: i + readlen]
        if seq in kmerdict:
            kmerdict[seq].append(i)
        else:
            kmerdict[seq] = [i]
    return kmerdict
And now for our updated map_to_reference()

def map_to_reference(read, kmerdict, indices, nomaps, readlen = 20, insertlen = 100):
    from random import choice
 
 
    read = read[0:readlen]
 
 
    if read in kmerdict:
        idx = choice(kmerdict[read])
        for i in range(insertlen):
            if (idx + i) in indices:
                indices[idx + i] +=1
            else:
                indices[idx + i] = 1
    else:
        nomaps.append(read)
    return
 
 
And our new main body:

second_attempt.py
fastqfn = '/Users/matt/PythonCourse/S7.1/SRR006916_sample.fastq.gz'
fastqfh = open_file_by_mimetype(fastqfn, 'r')
 
refgenomefn = '/Users/matt/PythonCourse/data/mtbReferenceGenome.FASTA'
refgenome = read_fasta(refgenomefn)
 
plusstrand = refgenome[refgenome.keys()[0]]
minusstrand = rev_comp(plusstrand)
readlen = 20
 
plusdict = make_kmer_dict(plusstrand, readlen)
minusdict = make_kmer_dict(minusstrand, readlen)
 
seq = [ [], [], [], [] ]
 
indices = {}
nomaps = []
badreads = []
 
# To conserve memory, we're going to read each sequence read individually,
# instead of all reads at once.
 
while (2 + 2) != 5:
    for i in range(4):
        seq[i] = fastqfh.readline().strip()
    if '' == seq[0]:
        print "EXITING READFILE AT LINE NUMBER:", fastqfh.tell()
        break
    # now, check the quality of the read.  If it's good, map it; if not, append it to badreads.
    checkedseq = check_seq_by_qscore(seq)
    if checkedseq:
        map_to_reference(checkedseq[1], plusdict, plusindices, nomaps)
        map_to_reference(checkedseq[1], minusdict, minusindices, nomaps)
    else:
        badreads.append(seq[1])
 
fastqfh.close()
 
print "###########################'\nMapping Summary:"
print len(badreads), "reads were rejected by the quality check.\n"
print badreads[0:10], "are the first ten bad reads.\n"
 
 
print len(nomaps), "reads passed quality check, but failed to map.\n"
print nomaps[0:10], "are the first ten no-map seeds.\n"
 
 
print len(plusindices), "reads successfully mapped - plus.\n"
print len(minusindices), "reads successfully mapped - minus.\n"
 
 
min( [ i for i in plusindices.keys()] )
# 12
max( [ i for i in plusindices.keys()] )
# 4411486
min( [ i for i in minusindices.keys()] )
# 12
max( [ i for i in minusindices.keys()] )
# 4411486
 
 
 
outfh = open('outfile_fast', 'w')
 
 
for i in sorted(plusindices.keys()):
    outfh.write('+' + str(i) + '\t' + str(plusindices[i]) + '\t' + plusstrand[i:i+20] + '\n')
 
for i in sorted(minusindices.keys()):
    outfh.write('+' + str(i) + '\t' + str(minusindices[i]) + '\t' + minusstrand[i:i+20] + '\n')
outfh.close()
$ time python second_attempt.py
EXITING READFILE AT LINE NUMBER: 501632
###########################'
Mapping Summary:
14 reads were rejected by the quality check.

['GACGCTGGCACCGGCGACCGCTCCGCGCTCCACCCAGCCCTGCNTTCCGGT', 'GGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN', 'GTTCCNNNNNNNNACACCCTTTCCGGCTCATCTCACGGTCCAGTCCTTTAC', 'GGAAGCCCGCAATGCGGTCAATGGTGCTGTCAGCATCGGGGTTGGNGACGC', 'GANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN', 'GGCCCNNNNNNNNGGAAGGCGAGGGAGCTGGTGGAGGCCATTGTTGCGGTC', 'ACTGNNNNNNNNNGGTGTCGCGCGCTATATCCAAGAGGTCGGTGCCGAGCA', 'GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAACTNAAAAGGCGGCTTG', 'ANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN', 'ANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN'] are the first ten bad reads.

3964 reads passed quality check, but failed to map.

['GCGGTGAATGCTATTTCTTG', 'GTTGCCGCTGCCCAGGTTGA', 'GAGCATATCCCCCCGCGCTA', 'GAGCATATCCCCCCGCGCTA', 'AGATCAGAAACTGGAGGCAC', 'AGATCAGAAACTGGAGGCAC', 'GTTTTACTGGCAGAAGGTCC', 'GAATCCTTGCTCCGGCGTGC', 'GAATCCTTGCTCCGGCGTGC', 'GTCCCTCACGAGCAGCCAGC'] are the first ten no-map seeds.

48587 reads successfully mapped - plus.
51455 reads successfully mapped - minus.

real 3m23.526s
user 3m10.848s
sys 0m2.758s



So, this one takes longer to run the smaller dataset, though trust me, it's actually got a chance to finish the entire genome. I can back that up by comparing the profile statements.

And there we have it, something that takes quite a bit more memory (about 1GB at the 20-mer length), but saves us oodles of CPU time.

Updated second_attempt.py

#!/usr/bin/python
from os import chdir
from sys import exit
from file_tools import open_file_by_mimetype
from sequencetools import read_fasta, rev_comp
 
def check_seq_by_qscore(seq, meancutoff = 40, mincutoff = 25, minseqlength = 20):
    '''Examine a sequence from a FASTQ file and trim away poor quality bases, or discard the entire read.
 
    Accepts a sequence, which is a list of four lines from a FASTQ file:
    @SEQ_ID
    READ SEQUENCE
    +SEQUENCE DESCRIPTION
    QSCORES
 
    see:  http://en.wikipedia.org/wiki/FASTQ_format for formatting details.
 
    meancutoff is a threshold for the average q-score in the read.
    mincutoff is a threshold for the minimum any given NT is allowed
    minseqlength is the shortest the read is allowed to become during trimming.
 
    Returns a checked sequence (possibly trimmed) sequence or False for a rejected read.
    '''
 
    seqID = seq[0]
    read = seq[1]
    seqDesc = seq[2]
    qscores = seq[3]
 
    # if there's an "N", toss the read
    if 'N' in read:
        return False
 
    # evaluate the qscore values; start with converting the qscores to a list of their ASCII values for the read
    # the built-in function ord() is used for this conversion, which we'll do with list comprehension:
    qscores = [ord(qscore) for qscore in qscores]
 
    # if the mean qscore is below the cutoff threshold, then toss the read
    ### okay, so, there's no function called calculate_mean.  We'll see tomorrow that there are vast numerical
    ### libraries for Python, but for today, we'll write this call in, knowing that in a moment, we'll have to
    ### write this function too.
    if calculate_mean(qscores) < meancutoff:
        return False
 
    # if there are any NTs less than the mincutoff, make note of their positions in the read sequence
    indices = []
    for idx in enumerate(qscores):
        if idx[1] < mincutoff:
            indices.append(idx[0])
    # if any of the reads are before the mincutoff position, then toss the read
    # otherwise, return the sequence with the bad position and subsequent positions trimmed
    # or, if there are no bad positions return the full sequence
 
    # we'll use a list comprehension here, which will exit as soon as it sees an idx
    # less than the minseqlength; if it doesn't, we keep going.
    if True in ([ i < minseqlength for i in indices ]):
        return False
    elif indices == []:
        # this is the case where none of the reads had a bad q-score.
        # we have to convert the qscores back into ASCII values before we send them back
        qscores = ''.join([ chr(i) for i in qscores ])
        return [ seqID, read, seqDesc, qscores ]
    else:
        # and here's what happens if trimmed the read down to get all good positions
        end = min(indices)
        qscores = ''.join([ chr(i) for i in qscores ])
        return [ seqID, read[:end], seqDesc, qscores[:end] ]
 
def calculate_mean(data):
    '''Accepts a list of numerical data.
 
    Returns the float value of the list of numbers in data.'''
 
    # It turns out that this is a performance optimization as well, owing to the large overhead
    # when you import the mean function from NumPy just for one measly function.
    # But, we're using it because we haven't learned how to use NumPy yet.
 
    # use list comprehension to convert all the datapoints to a floating point; then divide their sum
    # by their length, and return the value.
    return sum( [float(i) for i in data] ) / len(data)
 
 
 
def make_kmer_dict(strand, readlen = 20):
    '''Record the position of every read of length readlen in the genome'''
 
    kmerdict = {}
 
    for i in range(len(strand) - readlen):
        seq = strand[i: i + readlen]
        if seq in kmerdict:
            kmerdict[seq].append(i)
        else:
            kmerdict[seq] = [i]
    return kmerdict
 
def map_to_reference(read, kmerdict, indices, nomaps, readlen = 20, insertlen = 1):
    from random import choice
 
    read = read[0:readlen]
 
    if read in kmerdict:
        idx = choice(kmerdict[read])
        for i in range(insertlen):
            if (idx + i) in indices:
                indices[idx + i] +=1
            else:
                indices[idx + i] = 1
    else:
        nomaps.append(read)
    return
 
 
fastqfn = '/Users/matt/PythonCourse/S7.1/SRR006916_sample.fastq.gz'
fastqfh = open_file_by_mimetype(fastqfn, 'r')
 
refgenomefn = '/Users/matt/PythonCourse/data/mtbReferenceGenome.FASTA'
refgenome = read_fasta(refgenomefn)
 
plusstrand = refgenome[refgenome.keys()[0]]
minusstrand = rev_comp(plusstrand)
readlen = 20
 
plusdict = make_kmer_dict(plusstrand, readlen)
minusdict = make_kmer_dict(minusstrand, readlen)
 
seq = [ [], [], [], [] ]
 
plusindices = {}
minusindices = {}
nomaps = []
badreads = []
 
# To conserve memory, we're going to read each sequence read individually,
# instead of all reads at once.
 
while (2 + 2) != 5:
    for i in range(4):
        seq[i] = fastqfh.readline().strip()
    if '' == seq[0]:
        print "EXITING READFILE AT LINE NUMBER:", fastqfh.tell()
        break
    # now, check the quality of the read.  If it's good, map it; if not, append it to badreads.
    checkedseq = check_seq_by_qscore(seq)
    if checkedseq:
        map_to_reference(checkedseq[1], plusdict, plusindices, nomaps)
        map_to_reference(checkedseq[1], minusdict, minusindices, nomaps)
    else:
        badreads.append(seq[1])
 
fastqfh.close()
 
print "###########################'\nMapping Summary:"
print len(badreads), "reads were rejected by the quality check.\n"
print badreads[0:10], "are the first ten bad reads.\n"
 
 
print len(nomaps), "reads passed quality check, but failed to map.\n"
print nomaps[0:10], "are the first ten no-map seeds.\n"
 
 
print len(plusindices), "reads successfully mapped - plus.\n"
print len(minusindices), "reads successfully mapped - minus.\n"
 
 
print min( [ i for i in plusindices.keys()] )
print max( [ i for i in plusindices.keys()] )
print min( [ i for i in minusindices.keys()] )
print max( [ i for i in minusindices.keys()] )
 
 
 
outfh = open('outfile_fast', 'w')
 
 
for i in sorted(plusindices.keys()):
    outfh.write('+' + str(i) + '\t' + str(plusindices[i]) + '\t' + plusstrand[i:i+20] + '\n')
 
for i in sorted(minusindices.keys()):
    outfh.write('-' + str(i) + '\t' + str(minusindices[i]) + '\t' + minusstrand[i:i+20] + '\n')
outfh.close()
 


Mapping Polymorphisms


Okay, so, keep in mind, we've learned a lot about managing our data and accomplishing a non-trivial task here, but this isn't the goal of our biology in this project. We want to see the reads that don't map precisely, that is, the ones with polymorphism compared to the reference genome. In order to see these particular mismatches, we need to know something much harder to calculate.

Let's stop for a minute and talk about how we might accomplish this more difficult task.

Exercises


1. Parse the M.tb genome file.

Tomorrow morning, we're going to have a fair amount of new work to do, and one of the tasks will be to compare our read data to the reference genome annotations. In order to do that, we're going to need to parse the genome annotation file, posted below. Write a function that can read this file into a list of lists that contains the start, stop, and strand coordinates for each annotation. Also, output a file that contains a two-column list of the start and stop sites of all the regions annotated as a gene.



Solutions

import sys
 
fh = open(sys.argv[1])
fh.readline() # get rid of the header information
 
datalist = []
loci_seen = []
 
for line in fh:
    line = line.strip().split('\t')
    # line[0] = Locus
    # line[1] = Symbol
    # line[2] = Synonym
    # line[3] = Lenght
    # line[4] = Start
    # line[5] = Stop
    # line[6] = Strand
    # line[7] = Name
    # line[8] = Chromosome
    # line[9] = Annotation
    # line[10]= Annotation Name
    if line and (line[0] not in loci_seen):
        datalist.append([int(line[4]), int(line[5]), line[6]])
        loci_seen.append(line[0])
 
    if len(line) >= 10 and line[9] == "OPERON":
        print  line[4], "\t", line[5]
 
# Do stuff with the parsed data
# ...
# ...