Beyond learning programming: how to write programs


After last week, we can all take some pride in our newly-acquired abilities to use the fundamentals of the Python language to leverage our computers and logic against large amount of process and data. But it would be a shame to leave everything with only the basics. We are all here because we're interested not in rudimentary computer science skills, but because we're biologists living in an era of big problems and their accompanying big datasets. In the second week of the class, we'll turn our efforts to actual biological data, an the analysis thereof.

To suit the constituency of the QB3 Institute, we'll be looking at both genomic and biophysical data. For the second year running, the genomic component is represented by the no-longer-next-generation short-read sequencing method. Illumina sequencing has all-but replaced traditional sequencing, qPCR, microarrays, and DNA-binding assays. This year, we'll be looking at a dataset that yields insights into the pathogenicity of a common infectious bacterial species and the population genetics and evolution of those bacteria. In the second part of the project, we'll look at a polymorphism that exists among these bacteria and see how we can understand its role in drug resistance by way of structural modeling.

By now you've read the relevant papers describing the 21 genomes of Mycobacterium tuberculosis, and the mechanism of the KitG protein in drug resistance. My lecture this morning will be in two parts: first, I want to have everyone on the same page with respect to the motivation, terms, and technology behind the papers, and second, I want to step through a broad overview of what we're going to need to do to analyze the study's raw data. This will include both a general strategy for approaching large programming projects and a specific example of how we're going to approach this one.

So first: why?

Why is this interesting?

The interests present in this classroom worth of scientists span neuroscience, biochemistry, biophysics, molecular evolution, genetics, i.e. just about everything in modern biology. And while we hope this project will have some interest for everyone, let's consider the merits of sequencing 21 genomes. In the olden days of genetics, if you wanted to find a drug resistance phenotype, you'd setup plates and plates and plates of cultures, and mutagenize them, and screen them for a resistance phenotype. Once you found a phenotype, if you had a meiotic microbe, you might do complementation tests amongst your isolates, and map the phenotype to a chromosome. If you had a bacterium, you might turn directly to biochemistry (which you may have started with in the first place), trying binding assays and purification protocols until you saw something change places in a gel. These days, however, you can just prep the DNA, make random libraries of the DNA, and send them to the sequencing facility. The results will likely reflect many mutations, and then you have the task of characterizing the difference. There are algorithms for predicting which parts of the genome represent genes, promoters, regulatory regions, and in higher organisms things like centromeres, telomeres, transposons, etc, etc. In the case of M.tb, we can look at many strains' genomes and see which parts of the genome are conserved and which are not. We can examine the polymorphisms that are in common between resistant strains, and see if they direct us toward a gene of interest. In the case of the Nature Genetics paper we've read, we see the conclusion that the epitope-generating regions of the genome are very conserved, reflecting the bacterial interest in being detected and endocytosed by the host immune system and cells, which is critical for the infection to succeed.

What did they do?

To fully understand how they organized their experiments and then analyzed their data, it's best to have an understanding of how the sequencing platform upon which their strategy is based works.

An Illumina sequencing run starts, predictably enough, with DNA that you want to sequence. This DNA is reduced, either through fragmentation or some other means, to short segments. After ligating primers to these fragments, they are input to the sequencing machine, which fixes them to a glass slide. The fragments are amplified in place, creating millions of dense clusters of identical sequence on the slide. Sequencing is performed by synthesis; each nucleotide incorporation is associated with a different color, this light being captured with a high-resolution camera. Efficiency and accuracy decreases with length, ultimately limiting the length of the reaction. Each resulting short sequence is called a sequencing 'read.'

This all sums up to the result that Illumina sequencing is known to produce very very many, very short reads (although they're getting longer). In contrast, 454 sequencing produces very many, somewhat short reads, and old-school Sanger sequencing produces very, very, very, very, very few, somewhat long reads. However, the total amount of sequence, number of reads times length of each, is much greater in Illumina sequencing than any other currently available sequencing technology.

In this case, many genomes are sequenced, and we're interested in where the sequences are different. First, we have to map all of these short sequences back to a (thankfully already available) reference genome. We'll see in depth how this works on Wednesday when we code this process ourselves. There will be tricks, as this is a very memory-intensive process, and the data from the Illumina files will need to be screened for erroneous sequencing reactions. What we attempt will be ambitious, but we'll stop short, for a number of reasons, of finding the actual differences between the genomes. Instead we'll use Python to leverage a faster, more sophisticated algorithm implemented by the program Bowtie to help us with our comparison. Finally, we'll take a look at the results and see where in the genome different sequences vary, which will give us insight into the evolutionary history of these strains and the resistance mutation that Terry will then explore in atomic detail with available structural data.

General Strategy

It's going to be very important that we graduate quickly from writing individual scripts to writing well-integrated programs comprised of sets of functions and high-level logic. Thus we'll need to keep track of the tools we created last week, such as file parsers and sequence-manipulating functions, because we're going to be weaving them in and out of our analysis programs. For those old enough to appreciate, or those who think Will Smith's kid belongs in a movie, this is the Karate Kid part of the class (i.e. wax-on/wax-off).

Stubbing and the 'pass' statement

When we write complicated code, we need to decompose it into simpler parts. This is an intuitive concept, and one that we've touched on before. Let's say that we want to make a program that gambles online and makes money for you so that you are free to pursue the standard academic career path of postdoctoral positions ad infinitum.

Stubbing is writing what your program should be doing, without actually getting around to filling in the details. It's like writing an outline of a paper. In this case:
#!/usr/bin/env python
import sys
import internet
import gambling
accountID = sys.argv[1]
password = sys.argv[2]
[balance,sessionInfo] = internet.loginToIllegalGamblingServer(accountID,password)
while balance > 0:
    balance = gambling.playGame('poker',sessionInfo)
    if balance > 1000000000:
         print 'Congratulations: you are a billionaire.'
print 'Darn!'
def loginToIllegalGamblingServer(accountID,password):
def logoffFromIllegalGamblingServer():
def playGame(gameType,sessionInfo,balance):
     if gameType == 'poker':
         hand = requestHand(sessionInfo)
         if handIsGood(hand):
             amountWon = goAllIn(hand,balance,sessionInfo)
             if amountWon:
                 return amountWon + balance
                 return 0
             return balance
         print "I don't know how to play that game yet"
         return balance
def requestHand(sessionInfo):
def handIsGood(hand):
def goAllIn(hand,balance,sessionInfo):
We're free to write the easy parts first, saving the hard parts for later. We've already created the logical flow of the program, and by doing this early, we can keep it organized.

The only thing new that we've covered here is the statement pass. It's pretty simple: it does nothing. Although this sounds somewhat pointless, in this case it allows you to write little function stubs without python (or your text editor) complaining. However, it pops up in other places as well, usually as a shortcut where you mean to write more code later. This could be in an if or else statement or while raising an exception. We won't cover those applications here, but keep them in mind while you're programming: we encourage you to try it out.


1. Stub out the basics of part one of the project. For the purposes of this exercise, this means:

1) reading and parsing our data from the read files;
2) evaluating good and bad reads, discarding the bad ones;
3) loading the reference genome from a FASTA file;
4) placing the individual reads into the genome;
5) counting how many reads fall onto each position in the genome
6) writing the output of our analysis into an easily-readable output file.

2. Prepare a FASTQ file parser

Unlike the FASTA files, the Illumina data comes in a new format, the FASTQ file. Each line is named for a sequence, i.e. a read, but in addition to the name of the sequence and the sequence itself, the file contains two additional lines: one for comments, one for the quality scores, or q-scores for each read. We will use these q-scores when we filter out the bad reads. For now, the task is to write a file that will read the FASTQ files and write out a new file with just the read sequences. Because they are very large files, we'll prefer to work with the gzipped versions of the files, so you'll need to use your knowledge of the gzip module to open the files. Also, we'll have a hard time fitting the entire file into memory, so instead of using readlines() to read the entire file at once, we'll have to devise a way to read four lines at a time, discarding three lines, and saving the sequence line.

Use this shortened sample of one of the data files we'll see tomorrow:

You may find it helpful to review the FASTQ format, which looks like this for each read:

3. Download modules for tomorrow

Tomorrow, we're going to need to make sure we're all on the same page, so to facilitate sym-pageness, we'd like you to download these modules and save them in your pylib directory:


1. Stub out the basics of part one of the project.

#!/usr/bin/env python
import sys
# Must input Illumina Data filename as 1st argument in Terminal
# Parse Illumina data file to obtain a list of all unique reads and a
# dictionary of unique reads whose values are the number of counts for
# each unique read
from illuminaDataToolbox import *
illuminaDataFilename = sys.argv[1]
illuminaReadList, setUniqReads, dictUniqReads = procReads(illuminaDataFilename)
# procReads will call a function that will open the file by mimetype and
# return a file handle, then procReads will process the data
# Parse the list of unique reads to remove "bad" reads (whatever that means)
listCleanReads = cleanUpReads(listUniqReads)
# Load the reference genome from a FASTA file
# Must input filename as 2nd argument in Terminal
from sequencetools import read_fastq
refGenomeFilename = sys.argv[2]
parsedRefGenome = read_fastq(refGenomeFilename) # returns a dictionary
# Place the individual reads into the genome
# Write the output to an easily-readable output file
# Must input processed data filename as 1st argument in Terminal
resultsFilename = sys.argv[3]
illuminaResults(listCleanReads, parsedRefGenome, resultsFilename)
title = 'illuminaDataToolbox'
def procReads(illuminaDataFilename):
    # This function should process the Illumina data file, returning a list
    # of all the unique reads and a dictionary of unique reads whose values
    # are the number of counts for each unique read.
    from filetools import open_file_by_mimetype as ofbm
    illuminaDataFH = ofbm(illuminaDataFilename)
    from sequencetools import read_fastq
    illuminaReadList = read_fastq(illuminaDataFH)
    setUniqReads = set(illuminaReadList)
    dictUniqReads = { }
    for uniqRead in setUniqReads:
        dictUniqReads[uniqRead] = 0
    for illuminaRead in illuminaReadList:
        dictUniqReads[illuminaRead][0] += 1
    # Use illuminaReadList to do stuff with the reads
    pass # return illuminaReadList, setUniqReads, dictUniqReads
def cleanUpReads(listUniqReads):
    # This function should parse the list of unique reads, evaluating good
    # or bad reads, discarding bad ones, and returning a list of all the
    # "good" unique reads.
    pass # return listCleanReads
def mapIlluminaReads(listCleanedReads, parsedRefGenome):
    # Map illumina reads to reference genome and return a dictionary containing
    # all the cleaned reads as keys and their mapped locations in the genome as
    # values.
    # Also return a list of all coordinates in the genome and a dictionary
    # containing each coordinate in the genome as a key and the numbers of reads
    # at each coordinate as a value.
    pass # return dictMappedReads ({coordinates: seqString, counts}), listAllGenomicCoordinates
def illuminaResults(listCleanReads, parsedRefGenome, resultsFilename):
    # This function should output a text file containing three columns: a column
    # of all genome coordinates, a column of reads mapped to each coordinate,
    # and a column of the number of counts for each read.'
    dictMappedReads, listAllGenomicCoordinates = mapIlluminaReads(listCleanReads, parsedRefGenome)
    resultsFH = open(filename, 'a')
    for item, itemInd in enumerate(listAllGenomicCoordinates):
        resultsFH.append(item, dictMappedReads[item][0], dictMappedReads[item][1])
    pass # return dictMappedReads, listAllGenomicCoordinates?

2. Prepare a FASTQ file parser (sorta)

# initialize a list to store four lines at a time.
seq = [ [], [], [], [] ]
# our loop will always be True, so we'll have to have a break condition somewhere inside
while True:
    # a loop for 4 lines at a time
    for i in range(4):
        seq[i] = fastqfh.readline().strip()
    # test to see if we're at the end of the file,
    # which is the only place we'll fine an empty string
    if '' == seq[0]:
        # the .tell() method will tell you what line in the fh you're at
        print "EXITING READFILE AT LINE NUMBER:", fastqfh.tell()
or, using enumerate:
from file_tools import open_file_by_mimetype
def parse_fastq(fname_in, fname_out):
    in_fh = open_file_by_mimetype(fname_in, 'r')
    out_fh = open_file_by_mimetype(fname_out, 'w')
    for idx, line in enumerate(in_fh):
        if idx % 4 == 0: # Identifier
            pass         # don't do anything.
        elif idx % 4 == 1: # Sequence
    return idx + 1
    # return total number of lines read.
    # Not part of the specification, but a nice sanity check