Graphing


Up to this point in the project, we have been mostly doing mathematical manipulations of data. This afternoone, we are going to make some pretty pictures with graphs. In addition to interpretting complex data visually, graphs are often key figures in papers. Python has some AMAZING tools that can be used to create paper-worthy graphs. By the end of this lecture, I hope to convince you to abandon Excel forever!

First, a little bit about documentation. I'm going to go over a couple of commonly used (or at least cool) functions from a module called MatPlotLib. There is some decent (not amazing) documentation on-line http://matplotlib.sourceforge.net/index.html. In particular, you want to check out the gallery as each image has the code used to generate it.

We'll start by making a coordinate map of the coding and noncoding regions you analyzed this morning. Just to try something different, we are going to be using scripts instead of iPython. In case you didn't save your code from the lecture this morning, here is the code I'm interested in working with:

from pylab import *
 
genomelen = 4411532
 
# setup one SNP file
snps6916fh = open('SRR006916.pileup', 'r')
snps6916 = zeros(genomelen, dtype='bool')
for line in snps6916fh:
    line = line.strip().split('\t')
    snps6916[int(line[1])-1] = True
print "SRR006916 contains", sum(snps6916), "SNPs"
 
# setup another SNP file
snps6917fh = open('SRR006917.pileup', 'r')
snps6917 = zeros(genomelen, dtype='bool')
for line in snps6917fh:
    line = line.strip().split('\t')
    snps6917[int(line[1])-1] = True
print "SRR006917 contains", sum(snps6917), "SNPs"
 
genomesumfile = 'mycobacterium_tuberculosis_h37rv_genome_summary.txt'
genomesumfh = open(genomesumfile, 'r')
genomesumfh.readline()
genes = [ [], [], [], [] ]
# [0] = locus name
# [1] = start
# [2] = stop
# [3] = strand
for line in genomesumfh:
    line = line.strip().split('\t')
    genes[0].append(line[0])
    genes[1].append(int(line[4]))
    genes[2].append(int(line[5]))
    genes[3].append(line[6])
 
coding = zeros(genomelen, dtype='bool')
# coding will be an array of all the coding nts set to = 1
for gene in zip(genes[1], genes[2]):
    coding[gene[0]:gene[1]] = True
 
# noncoding will be the remaining nts opol
noncoding = coding == False
 
print sum(coding), "are coding nucleotides"
print sum(noncoding), "are non-coding nucleotides"
 
print "SRR006916 has", len(intersect1d(where(noncoding)[0],where(snps6916)[0])),
print "SNPs in noncoding regions"
print "SRR006917 has", len(intersect1d(where(noncoding)[0],where(snps6917)[0])),
print "SNPs in noncoding regions"
 
print "SRR006916 has", len(intersect1d(where(coding)[0], where(snps6916)[0])),
print "SNPs in ncoding regions"
print "SRR006917 has", len(intersect1d(where(coding)[0], where(snps6917)[0])),
print "SNPs in coding regions"

Let's start by plotting the coding region:
# graph the coordinates of the genome
import matplotlib.pyplot as plot
plot.plot(coding)
plot.show()

Yikes! What happened?

Ok. Let's think about what we need a bit more:

1) We want to plot where the data is True
2) We want to not crash our computer or, since we're in class, have things take a super long time

Let's try a scatter plot instead of a line plot and slice into the data:
import numpy as np
#create data for the x-axis
x = np.array(range(0,10000,1))
#only plot the first 10000 genome positions
y = coding[:10000]
plot.scatter(x,y)
plot.show()

Nice...er. How might we make things less black?
plot.scatter(x,y,edgecolor='b')

Let's add the noncoding data now:
y = noncoding[:10000]
plot.scatter(x,y)

Getting closer! How do we make the markers for each data set look more different? And maybe zoom into just the True line?
plot.scatter(x,y,c='c',s=1,edgecolor='c')
plot.scatter(x,y,c='r',s=20,edgecolor='r')
plot.ylim((0.5,1.5))
Cool! Ok. Let's add the SNP data now. We can think of several ways of displaying the data. I have chosen to display it at the 0.95 y-value. For now, we'll focus on just the 6916 SNPs.
x = intersect1d(where(coding)[0],where(snps6916)[0])
y = len(x)*[0.95]
plot.scatter(x,y,c='k',marker='s')
plot.xlim((0,10000))
Let's add the noncoding data as well. Here, because we are just plotting the data basically as a marker, we can concatenate the coding and noncoding data instead of plotting it separately.
x1 = intersect1d(where(coding)[0],where(snps6916)[0])
x2 = intersect1d(where(noncoding)[0],where(snps6916)[0])
x = concatenate((x1,x2))
Yeah!!! Let's make it pretty for group meeting!
plot.scatter(x,y,c='c',s=1,edgecolor='c',label='Coding')
 
plot.scatter(x,y,c='r',s=20,edgecolor='r',label='Non-coding')
 
plot.scatter(x,y,c='k',marker='s',label='SNP')
 
plot.xlabel('Genome Position')
plot.title('Gene Map')
plot.yticks([])
plot.legend()

To go through some other graphing examples, we need a different type of data. For the project, we don't really have data in a format that is particularly amenable to different types of plotting. Therefore, I generated some other data so we have something to work with. Please download the file from here:


First, as always, we need to read in the data:
#read in data
results = []
fh = open('periodicData.txt', 'r')
for line in fh:
    line = line.strip().split()
    results.append([int(line[0]), float(line[1])])
 
#convert data into array
import numpy as np
a = np.array(results)
 
This data, as indicated by the file name, is periodic. A scatter plot will not be super helpful in exploring that property (plus we want to learn something different!). Instead, let's use a bar graph:

import matplotlib.pyplot as plot
plot.bar(a[:,0],a[:,1])
plot.show()
 
Once again, we have a LOT of data. Let's cut it down a bit so we can make it prettier:
plot.bar(a[75:125,0],a[75:125,1],width=1,color='r')

Now the periodicity (~every three bars) is a bit more apparent. Let's emphasize it even more by adding some vertical lines. To do this, we need to have more control over the layout of the plot itself using the subplot function:
p1 = plot.subplot(111)
p1.bar(a[75:125,0],a[75:125,1],width=1,color='r')
plot.show()
Visually, this looks exactly the same as before. But now we have a lot more control over the figure that is containing the plots. While we are here, let's play around a bit with the new subplot functionality. First, we can use it to add multiple graphs to the same figure:
p1 = plot.subplot(211)
p1.bar(a[75:125,0],a[75:125,1],width=1,color='r')
import pylab as pL
#add a uniform distribution to the data
c = a[:,1] + pL.uniform(0,1000, len(a[:,1]))
p2 = plot.subplot(212)
p2.scatter(a[:,1],c,c='g')
We can also do a plot-in-plot. However, for this example, we need to declare a figure so that python know we are adding multiple things to the same figure and not overwriting old subplots with new subplots:
fig = plot.figure()
p1 = fig.add_subplot(111)
p1.bar(a[75:125,0],a[75:125,1],width=1,color='r')
import pylab as pL
#add a uniform distribution to the data
c = a[:,1] + pL.uniform(0,1000, len(a[:,1]))
p2 = fig.add_subplot(222)
p2.scatter(a[:,1],c,c='g')
plot.show()

Fun! What if we want to make a circle??? For this, we need to get rid of subplot and declare a special axis:
fig = plot.figure()
p1 = fig.add_axes([0.1,0.1,0.8,0.8], polar=True)
import numpy as np
width = 2*np.pi/len(a[:,1])
p1.bar(a[:,0],a[:,1],width=width,color='r')
Alright, enough fooling around. Let's get back to the original goal--putting vertical lines on the periodic bar graph:
p1 = fig.add_subplot(111)
p1.bar(a[75:125,0],a[75:125,1],width=1,color='r')
p1.xaxis.grid()

And fix up the x-axis to make the grid lines more frequent, get rid of the y axis tick labels, and add axis labels:
p1.bar(a[75:125,0],a[75:125,1],width=1,color='r')
from matplotlib.ticker import MultipleLocator
p1.xaxis.set_minor_locator(MultipleLocator(3))
p1.xaxis.grid(which='minor')
p1.set_yticklabels([])
p1.set_xlabel('I am the Xaxis!')
p1.set_ylabel('I am the Yaxis!')
One more example. Let's use the same data to have fun with histograms. Remind me of the difference between a histogram and a bar graph?
import matplotlib.pyplot as plot
p1 = plot.subplot(111)
p1.hist(a[:,1])
plot.show()
And now, we'll increase the number of bins to better reflect the complexity of the data:
p1.hist(a[:,1],75)
For funsies, let's try making the x-axis into log base-2.
p1.set_xscale('log', basex=2)

Ack! What happened? How do we fix it?
a = np.log2(a)
Phew! Finally, after all this work, we might want to save the figure! But first, we have to tell matplotlib that it is a figure to be saved.
import matplotlib.pyplot as plot
fig = plot.figure()
p1 = fig.add_subplot(111)
p1.hist(a[:,1],75)
#p1.set_xscale('log', basex=2)
fig.savefig('histogram.pdf',dpi=500,format='pdf')
plot.show()

Questions?

Exercises:


1) Make the coordinate map we created at the beginning of this exercise more complete.
a) Add code to plot the SNPs for SRR006917.
b) Modify plotting such that there are two separate plots on the same figure. Have the top plot include the coding genes, non-coding genes, and the SNPs for SRR006916. Have the bottom plot include the coding genes, non-coding genes, and the SNPs for SRR006917. Change the axis labels and titles to match.
c) Save your new plot in eps format.

2) Getting to plots in the paper--An approximation of Figure 3. The goal of this exercise is to create something along the lines of Figure 3 from the source paper. The authors did not publish the subcategorizations of the coding regions. Also, we only have the data for two genomes instead of the 21 the authors analyzed. However, we can plot the same y-axis (ratio of the number of nonsynonomous mutations to synonomous mutations) for the coding regions of the two genomes we have analyzed. If you are feeling ambitious, try stubbing out your own method for generating this plot. If you are feeling overwhelmed, I have listed the broad strokes here:
a) Take the code you wrote in the exercises (or copied from the solutions) this morning. Rewrite the code so that it translates each of the genes annotated in the genome.
b) Compare the mutations between SRR006916 and SRR006917 and the reference genome. Calculate how many of the mutations are synonomous (base polymorphism does not alter the translated amino acid) and how many are nonsynonomous (base polymorphism that results in different translated amino acid). Track the counts based on whether they are in proteins with a given annotation. Sum up over both genomes.
c) Create a bar graph that plots the sums. Label the axises to match the graph in the paper. Go to matplotlib website and figure out how to add labels to the top of each bar on the graph (the numbers are the raw ratios).
d) Save the graph.


You can parse and use the calls of essential genes taken from the supplement to the paper.


Precipitates

1) Plotting in two subplots

from pylab import *
 
genomelen = 4411532
 
# setup one SNP file
snps6916fh = open('SRR006916.pileup', 'r')
snps6916 = zeros(genomelen, dtype='bool')
lsnps6916 = []
for line in snps6916fh:
    line = line.strip().split('\t')
    snps6916[int(line[1])-1] = True
    lsnps6916.append(int(line[1]) -1 )
print "SRR006916 contains", sum(snps6916), "SNPs"
 
# setup another SNP file
snps6917fh = open('SRR006917.pileup', 'r')
snps6917 = zeros(genomelen, dtype='bool')
lsnps6917 = []
for line in snps6917fh:
    line = line.strip().split('\t')
    snps6917[int(line[1])-1] = True
    lsnps6917.append(int(line[1])-1)
print "SRR006917 contains", sum(snps6917), "SNPs"
 
genomesumfile = 'mycobacterium_tuberculosis_h37rv_genome_summary.txt'
genomesumfh = open(genomesumfile, 'r')
genomesumfh.readline()
genes = [ [], [], [], [] ]
# [0] = locus name
# [1] = start
# [2] = stop
# [3] = strand
for line in genomesumfh:
    line = line.strip().split('\t')
    genes[0].append(line[0])
    genes[1].append(int(line[4]))
    genes[2].append(int(line[5]))
    genes[3].append(line[6])
 
coding = zeros(genomelen, dtype='bool')
# coding will be an array of all the coding nts set to = 1
for gene in zip(genes[1], genes[2]):
    coding[gene[0]:gene[1]] = True
 
# noncoding will be the remaining nts opol
noncoding = coding == False
 
 
from matplotlib import pyplot as mpl
 
plot1 = mpl.subplot(2,1,1)
 
x = range(10000)
y = coding[:10000]
plot1.scatter(x,y,c='c',s=1,edgecolor='c',label='Coding')
y = noncoding[:10000]
plot1.scatter(x,y,c='r',s=20,edgecolor='r',label='Non-coding')
y = 0.95 * snps6916[:10000]
plot1.scatter(x,y,c='k',marker='s',label='SNP')
 
plot1.set_xlabel('Genome Position')
plot1.set_title('Gene Map 6916')
plot1.set_yticks([])
plot1.set_ylim((.5,1.5))
plot1.legend()
 
plot2 = mpl.subplot(2,1,2)
x = range(10000)
y = coding[:10000]
plot2.scatter(x,y,c='c',s=1,edgecolor='c',label='Coding')
y = noncoding[:10000]
plot2.scatter(x,y,c='r',s=20,edgecolor='r',label='Non-coding')
y = 0.95 * snps6917[:10000]
plot2.scatter(x,y,c='k',marker='s',label='SNP')
 
plot2.set_xlabel('Genome Position')
plot2.set_title('Gene Map 6917')
plot2.set_yticks([])
plot2.set_ylim((.5,1.5))
plot2.legend()
 
mpl.savefig('Whoo.eps')

2) Extending from the morning

Heres an incomplete solution. At the moment, I'm getting weird results, so maybe you guys can spot what I'm doing wrong.

"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. "
-- Brian Kernighan, co-inventor of Unix

from __future__ import division
from pylab import *
from sequencetools import read_fasta, rev_comp
 
def parse_nuctable(filename):
    fh = open(filename)
 
    codon_table = {}
    for line in fh:
        line = line.strip()
        if line:  # skips blank lines
            data = line.split()
 
            aa = data[0]   # something like 'Ala/A'
            oneletter = aa.split('/')[1]
            codons = data[1:]
            for codon in codons:
                codon = codon.strip().replace(',', '')
                # turns 'GCU, ' into 'GCU'
                codon_table[codon] = oneletter
 
    return codon_table
 
def seq_to_codons(sequence):
    codons = []
    i = 0
    while i < len(sequence):
        codons.append(sequence[i: i+3])
        i += 3
    return codons
 
 
def translate_sequence(nt_sequence):
    aa_sequence = ""
    codon_table = parse_nuctable('nuctable.txt')
 
    nt_sequence = nt_sequence.replace('T', 'U')
    # nuctable uses the RNA sequence, not the DNA
    for codon in seq_to_codons(nt_sequence):
        aa_sequence += codon_table[codon]
 
    return aa_sequence
 
 
 
print "Reading Genome"
refgenomefh = '../data/mtbReferenceGenome.FASTA'
refgenome = read_fasta(refgenomefh)
 
refseq = refgenome.values()[0]
 
 
print "Reading Genome Features"
genomesumfile = 'mycobacterium_tuberculosis_h37rv_genome_summary.txt'
genomesumfh = open(genomesumfile, 'r')
 
genomesumfh.readline() # remove the header line
 
# Set up a list of lists, where each inner list is a gene start, i.e.:
#   genes = [[gene1_start, gene1_end], [gene2_start, gene2_end] ...]
# Also set up a dictionary keyed on a tuple of the starts and ends, which gives
# you the direction of that particular gene
### NEW ###
# Also, a dictionary where keys are the position tuple, and values are the locus
# names
directions = {}
###
locusnames = {}
###
genes = []
 
for line in genomesumfh:
    line = line.strip().split('\t')
 
    start = int(line[4]) - 1
    stop = int(line[5])
    strand = line[6]
 
    directions[(start, stop)] = strand
 
    ###
    locusnames[(start, stop)] = line[0]
    ###
 
    if [start, stop] not in genes:
        genes.append([start, stop])
 
print "Reading SNPs"
 
 
###
exclusions = {}
for line in open('exclusions.txt'):
    if line.strip():
        exclusions[line.strip().split()[0]] = True 
 
essentials = {}
for line in open('Essentials.txt'):
    line = line.strip().split()
    if line:
        essentials[line[0]] = line[1]
        exclusions[line[0]] = False
 
###
 
# Set up SNP dictionaries where the keys are the index of the SNP, and the
# values are the substitution made
snps6916fh = open('SRR006916.pileup', 'r')
snps6916 = {}
for line in snps6916fh:
    line = line.strip().split('\t')
    snps6916[int(line[1])-1] = (line[2], line[3])
 
 
 
# setup another SNP file
 
snps6917fh = open('SRR006917.pileup', 'r')
snps6917 = {}
for line in snps6917fh:
    line = line.strip().split('\t')
    snps6917[int(line[1])-1] = (line[2], line[3])
 
 
 
allSNPs = {}
allSNPs.update(snps6916)
#allSNPs.update(snps6917)
for allSNPs, name in zip([snps6916, snps6917], ['6916', '6917']):
    print '-'*32, name, '-'*32
    ###
    mutations = {'essential':{True: 0, False:0}, 
                 'nonessential': {True:0, False:0}}
 
    ###
 
    outfile = open('SNPS_nonsyn.txt', 'w')
    print "Finding SNPs in genes!"
    percent_dones = []
    wobbles = 0
    # Iterate over the keys of allSNPs, which are the indices where the SNPs fall
    for idx, snp in enumerate(allSNPs):
 
        # I get impatient easily, and like to know how much it's done. Don't worry
        # too much about what's going on here
        percent_done = int(100*idx/len(allSNPs))
        if percent_done % 10 == 0 and percent_done not in percent_dones:
            percent_dones.append(percent_done)
            print '%d%% done!' % (100*idx/len(allSNPs))
 
 
        if allSNPs[snp][1] not in ['A','T','C','G']:
            # If it's one of the IUPAC symbols for vague nts, skip it
            wobbles += 1
            continue
 
        # Check through all known genes to see if the SNP is in it
        for start, stop in genes:
            if start <= snp <= stop:
                # The new sequence is the old sequence with the SNP replaced
                newseq = refseq[start:snp] + allSNPs[snp][1] + refseq[snp+1:stop]
                oldseq = refseq[start:stop]
 
                if directions[(start, stop)] == "-":
                    newseq = rev_comp(newseq)
                    oldseq = rev_comp(oldseq)
 
                if len(newseq) % 3 != 0 or len(oldseq) %3 != 0:
                    # Somehow we got a gene that's not an integer multiple of 3
                    # nucleotides!
                    continue
 
                ###
                newaa = translate_sequence(newseq)
                oldaa = translate_sequence(oldseq)
                lname = locusnames[(start, stop)]
                if exclusions.has_key(lname) and (not exclusions[lname])\
                   and essentials.has_key(lname):
                    mutations[essentials[lname]][newaa == oldaa] += 1
                elif not exclusions.has_key(lname):
                    print "ARG!!", lname
                    continue
                else:
                    continue
                ###
                # only write out if it's a non-synonymous substitution
                if newaa != oldaa:
                    outfile.write('%s\t%s-->%s\t%d\t%s\t%s\n'%( locusnames[(start,
                                                                            stop)],
                                                        allSNPs[snp][0],
                                                       allSNPs[snp][1],
                                                      snp, oldseq, newseq))
 
 
    print wobbles, mutations
 
    for mtype in mutations:
        print mtype, "dn/ds=", mutations[mtype][False]/mutations[mtype][True]