Numerical Python (NumPy)


Topics

  • array data types
  • Conversion between standard Python data types and numpy data types
  • Introduction to numpy and pylab packages
  • Basic statistical analysis with numpy tools
  • Compare performance using vector math and for loops

Introduction

As python matured into the multifunctional language it is today, more and more featured were added. We've seen most of the original functionality of the language, which was originally intended for scripting of UNIX and web processes, and writing user interfaces. But as python became popular as a scripting language, it found its way into the arms of scientists, who generally speaking, care more about efficient data manipulation than web programming. The big piece that was missing in python was a numerical library, and after a few efforts, an integrated one was developed. Numerical Python, or NumPy provided a large collection of mathematical functions similar to those found in a language like MATLAB or R, and also similarly to those languages, provides datatypes that are manipulatable as vectors and matrices. We'll introduce these data types, then their associated functions.

Before we get started, let's have everyone download these two files:


NumPy Basics


Numerical Python is a powerful library of functions, methods, and data types we can used to analyze our data. Unforunately for those of us whose heads continue to spin in a crash-course of syntax, it also uses a different set of rules. I hope you'll understand why when you see the power and speed NumPy's data types afford us. Let's start off creating some empty arrays, which look sorta like lists, and are in fact vectors.

They differ in a few fundamental ways from lists:

1) Arrays cannot be of mixed types. They can be all integers, floats, strings, logical (or boolean) values, or other immutable values. But they cannot be some characters, some numbers, or any other olio of data types. They also cannot contain mutable types such as lists. So, we can have a list of lists, but not an array of lists. We can, however, have an array of arrays. Which brings us to:

2) Arrays can be multidimensional, but they are not sparse structures.

3) We can perform vector operations on them, which can be algebraic functions (like a dot product), or simple replacements of values in slice of the array.

Arrays


Here's one way: start with a list and change it:

import numpy as np
 
a = [0] * 40
a = np.array(a)

Or this can be shortened:

a = np.array([0] * 40)
 

But there's a better way to get a vector of zeros:

a = np.zeros(40)
 

Notice that the default type when declaring an array is float64:

type(a[0])
 

<type 'numpy.float64'>

And here's the simplest way to change that:

a = np.zeros(40, int)
type(a[0])
 

<type 'numpy.int32'>

And here's how to declare something that's not all zeros:

a = np.arange(40)
 

Notice the int type:

type(a[0])
 

How can we change that?

a = np.arange(40, dtype='float64')
 
Or, we could take the range of a float:
a = np.arange(40.0)

Some additional functionality of arange(): note that when it gets told
a float increment, it defaults to a float type

a = np.arange(40, 50, .25)
 


Okay, now for a vector with more than one dimension:

a = np.array( (np.zeros(100), np.zeros(100), np.zeros(100) ))
# note the middle-inner set of parentheses
a
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0.],[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0.],[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0.]])


a[1,:] = 1
a
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0.],[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1.],[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0.]])



a[:,0] = 7
a
 
array([[ 7., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0.],[ 7., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1.],[ 7., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0.]])

Let us pause for a moment and think about how we would do this with a for loop in lists:

LoL = [ [], [], [] ]
for i in LoL[1]:
    LoL[1][i] = 1
 
for L in LoL:
    LoL[0] = 7
 
 
Maybe you can see the advantage of the array syntax. If not, we'll look at vector math in a moment.

a[2,0] = 8
a
array([[ 7., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0.],[ 7., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1.],[ 8., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0.]])

We can also index individual arrays as if they were lists:

a = np.arange(10)
a[-1]
a[::-1]



9


array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

Vector Math with Arrays


We can multiply many values at once with arrays, no for loop required.

# pairwise multiplication, consider the nested for loop alternative
a = np.arange(0,100,2)
b = np.arange(50)
 
a * b
array([ 0, 2, 8, 18, 32, 50, 72, 98, 128, 162, 200,
242, 288, 338, 392, 450, 512, 578, 648, 722, 800, 882,
968, 1058, 1152, 1250, 1352, 1458, 1568, 1682, 1800, 1922, 2048,
2178, 2312, 2450, 2592, 2738, 2888, 3042, 3200, 3362, 3528, 3698,
3872, 4050, 4232, 4418, 4608, 4802])


# dot product (area) of the two vectors
a = np.arange(0,100,2)
b = np.arange(50)
 
sum(a * b)  # alternatively, the function np.dot(a,b)
 
80850

Boolean (logical) Values with Arrays


# dot product (area) of the two vectors
a = np.zeros(10, dtype='bool')
 
array([False, False, False, False, False, False, False, False, False, False], dtype=bool)

# dot product (area) of the two vectors
a[2:5] = 1    # We can also assign True instead of 1
 
 
array([False, False, True, True, True, False, False, False, False, False], dtype=bool)

# comparison and assignment
b = a == False
b
array([ True, True, False, False, False, True, True, True, True, True], dtype=bool)

a == b
array([False, False, False, False, False, False, False, False, False, False], dtype=bool)

Basic Statistics with Numpy


We can do Fourier transforms and abstract algebra with NumPy if we want, but more commonly we use some basic statistics to get a feel for our data. So let's make sure we hit some of those functions:

summary stats

a = np.arange(99)
np.mean(a)
 
# standard deve
np.std(a)
 
# variance
np.var(a)
 


random distributions


a = np.random.uniform(0,100,10)
print a
 
a = np.random.uniform(0,100,1000000)
np.mean(a)
 
# exponential sample with mean = e
np.mean(np.random.exponential(np.exp(1),1000000))
 

correlation
np.corrcoef(a, b)



NumPy is Gosh-Awful Big


So, we're not going to root through it all. But in the course of nosing through our SNP data in a moment, we're going to heavily enlist a few of these tools:

where()

np.where(a == True)[0]
array([2, 3, 4])

where() returns a tuple, for reasons that we're not going to go into, but the value of the [0] of that tuple is a list of all the indices where the array evaluated to True. We can also just write: np.where(a) to the same effect.


When we use sum() on a boolean array, we see the total number of True values added up.

sum(a)

3

But the len() of the array still tells us how long it is no matter what the contents of the array positions.

len(a)

10





Looking at Our SNPs

from pylab import *
from sequencetools import read_fasta
 
# We'll need the refgenome for at least the genome length,
# and also if we want to look at any seqeunces
refgenomefh = '../data/mtbReferenceGenome.FASTA'
refgenome = read_fasta(refgenomefh)
 
# We'll want to make some arrays to keep track of where we have SNPs,
# and they'll need to be as long as the refgenome
genomelen = 4411532
 
# setup one SNP file
 
# for each SNP in the pileup file, we want to record a True at the
# genomic position the SNP occurs at
# note that we also need to shift our index (-1) because arrays
# start at 0, and our genome files start at 1
 
snps6916fh = open('SRR006916.pileup', 'r')
snps6916 = zeros(genomelen, dtype='bool')
for line in snps6916fh:
    line = line.strip().split('\t')
    snps6916[int(line[1])-1] = True
 
print "SRR006916 contains", sum(snps6916), "SNPs"
 
# setup another SNP file
 
snps6917fh = open('SRR006917.pileup', 'r')
snps6917 = zeros(genomelen, dtype='bool')
for line in snps6917fh:
    line = line.strip().split('\t')
    snps6917[int(line[1])-1] = True
 
print "SRR006917 contains", sum(snps6917), "SNPs"
 
 
# Here we'll open our genome summary file, and look at the relationship
# between the SNPs and the other genomic features
genomesumfile = 'mycobacterium_tuberculosis_h37rv_genome_summary.txt'
genomesumfh = open(genomesumfile, 'r')
 
genomesumfh.readline()
 
# We'll read in the following 4 lists worth of data;
# nota bene:  a
genes = [ [], [], [], [] ]
# [0] = locus name
# [1] = start
# [2] = stop
# [3] = strand
 
for line in genomesumfh:
    line = line.strip().split('\t')
    genes[0].append(line[0])
    genes[1].append(int(line[4]))
    genes[2].append(int(line[5]))
    genes[3].append(line[6])
 
# a boolean array for the genomic positions annotated as coding positions
coding = zeros(genomelen, dtype='bool')
 
# coding will be an array of all the coding nts set to = True
for gene in zip(genes[1], genes[2]):
    coding[gene[0]:gene[1]] = True
 
# By inference, noncoding will be the complementary nts
noncoding = coding == False
 
print sum(coding), "are coding nucleotides"
print sum(noncoding), "are non-coding nucleotides"
 
 
# Use set operations for Venn-diagramish things
noncoding6916 = intersect1d(where(noncoding)[0], where(snps6916)[0])
noncoding6917 = intersect1d(where(noncoding)[0], where(snps6917)[0])
print "SRR006916 has", len(intersect1d(where(noncoding)[0], where(snps6916)[0])), "SNPs in noncoding regions"
print "SRR006916 has", len(intersect1d(where(noncoding)[0], where(snps6917)[0])), "SNPs in noncoding regions"
 
# The coding SNPs from each strain, in one line.
coding6916 = intersect1d(where(coding)[0], where(snps6916)[0])
coding6917 = intersect1d(where(coding)[0], where(snps6917)[0])
print "SRR006916 has", len(coding6916), "SNPs in noncoding regions"
print "SRR006916 has", len(coding6917), "SNPs in noncoding regions"
 
#the conserved (common, at least) and unique SNPs;
# and the union thereof
commonSNPs = intersect1d(where(snps6916)[0], where(snps6917)[0])
uniqueSNPs6916 = setdiff1d(where(snps6916)[0], where(snps6917)[0])
uniqueSNPs6917 = setdiff1d(where(snps6917)[0], where(snps6916)[0])
allSNPs = union1d(where(snps6916)[0], where(snps6917)[0])
 
 
# How many commonSNPs are in coding vs. non-coding?
print sum(coding[commonSNPs]), "conserved SNPs in a coding region"
print sum(noncoding[commonSNPs]), "conserved SNPs in a non-coding region"
 
# what about percentages of SNP in each type of region?
print float(sum(noncoding[allSNPs])) / float(sum(noncoding)) * 100, "percent of all noncoding nts have a SNP"
print float(sum(coding[allSNPs])) / float(sum(coding)) * 100, "percent of all coding nts have a SNP"
 
# What if we wanted to find a gene containing the SNP?
 
print where(genes[1] < coding6916[0])[-1]
print where(genes[1] > coding6916[0])[0]
 
coord = (genes[1][15], genes[2][15])
print coord
 
seq = refgenome[refgenome.keys()[0]][coord[0]:coord[1]]
print seq
 


<span style="background-color: #eeeeee; border-bottom-color: #cccccc; border-bottom-style: solid; border-bottom-width: 1px; border-left-color: #cccccc; border-left-style: solid; border-left-width: 1px; border-right-color: #cccccc; border-right-style: solid; border-right-width: 1px; border-top-color: #cccccc; border-top-style: solid; border-top-width: 1px; font-size: 13px; overflow-x: auto; padding-bottom: 10px; padding-left: 10px; padding-right: 10px; padding-top: 10px;">from pylab import *
from sequencetools import read_fasta
 
# We'll need the refgenome for at least the genome length,
# and also if we want to look at any seqeunces
refgenomefh = '../data/mtbReferenceGenome.FASTA'
refgenome = read_fasta(refgenomefh)
 
# We'll want to make some arrays to keep track of where we have SNPs,
# and they'll need to be as long as the refgenome
genomelen = 4411532
 
# setup one SNP file
 
# for each SNP in the pileup file, we want to record a True at the
# genomic position the SNP occurs at
# note that we also need to shift our index (-1) because arrays
# start at 0, and our genome files start at 1
snps6916fh = open('SRR006916.pileup', 'r')
snps6916 = zeros(genomelen, dtype='bool')
 
for line in snps6916fh:
    line = line.strip().split('\t')
    snps6916[int(line[1]) -1 ] = True
 
print "SRR006916 contains", sum(snps6916), "SNPs"
 
# setup another SNP file
 
snps6917fh = open('SRR006917.pileup', 'r')
snps6917 = zeros(genomelen, dtype='bool')
for line in snps6917fh:
    line = line.strip().split('\t')
    snps6917[int(line[1])-1] = True
 
print "SRR006917 contains", sum(snps6917), "SNPs"
 
# Here we'll open our genome summary file, and look at the relationship
# between the SNPs and the other genomic features
genomesumfile = 'mycobacterium_tuberculosis_h37rv_genome_summary.txt'
genomesumfh = open(genomesumfile, 'r')
 
genomesumfh.readline()
 
# We'll read in the following 4 lists worth of data;
# nota bene:  a
genes = [ [], [], [], [] ]
# [0] = locus name
# [1] = start
# [2] = stop
# [3] = strand
 
for line in genomesumfh:
    line = line.strip().split('\t')
    genes[0].append(line[0])
    genes[1].append(int(line[4]) -1 )
    genes[2].append(int(line[5]))
    genes[3].append(line[6])
 
# a boolean array for the genomic positions annotated as coding positions
coding = zeros(genomelen, dtype='bool')
 
# coding will be an array of all the coding nts set to = True
for gene in zip(genes[1], genes[2]):
    coding[gene[0]:gene[1]] = True
 
 
noncoding = coding == False
 
print sum(coding), "are coding nucleotides"
print sum(noncoding), "are non-coding nucleotides"
 
 
# Use set operations for Venn-diagramish things
noncoding6916 = intersect1d(where(noncoding)[0], where(snps6916)[0])
noncoding6917 = intersect1d(where(noncoding)[0], where(snps6917)[0])
print "SRR006916 has", len(intersect1d(where(noncoding)[0], where(snps6916)[0])), "SNPs in noncoding regions"
print "SRR006917 has", len(intersect1d(where(noncoding)[0], where(snps6917)[0])), "SNPs in noncoding regions"
 
# The coding SNPs from each strain, in one line.
coding6916 = intersect1d(where(coding)[0], where(snps6916)[0])
coding6917 = intersect1d(where(coding)[0], where(snps6917)[0])
print "SRR006916 has", len(coding6916), "SNPs in coding regions"
print "SRR006917 has", len(coding6917), "SNPs in coding regions"
 
#the conserved (common, at least) and unique SNPs;
# and the union thereof
commonSNPs = intersect1d(where(snps6916)[0], where(snps6917)[0])
uniqueSNPs6916 = setdiff1d(where(snps6916)[0], where(snps6917)[0])
uniqueSNPs6917 = setdiff1d(where(snps6917)[0], where(snps6916)[0])
allSNPs = union1d(where(snps6916)[0], where(snps6917)[0])
 
print len(commonSNPs), "are conserved SNPs."
print len(uniqueSNPs6916), "are unique SNPs strain 6916."
print len(uniqueSNPs6917), "are unique SNPs strain 6917."
print len(allSNPs), "is the total number of SNPs observed."
 
# what about percentages of SNP in each type of region?
print float(sum(noncoding[allSNPs])) / float(sum(noncoding)) * 100, "percent of all noncoding nts have a SNP"
print float(sum(coding[allSNPs])) / float(sum(coding)) * 100, "percent of all coding nts have a SNP"
 
 
# How many commonSNPs are in coding vs. non-coding?
print sum(coding[commonSNPs]), "conserved SNPs in a coding region"
print sum(noncoding[commonSNPs]), "conserved SNPs in a non-coding region"
 
 
# list of starts = genes[1]
# list of all coding SNP positions = coding6916
 
idx = where(genes[1] < coding6916[2])[0][-1]
print where(genes[1] > coding6916[2])[0][0]
 
coord = (genes[1][idx], genes[2][idx])
 
seq = refgenome[refgenome.keys()[0]][coord[0]:coord[1]]
print seq
'''
for pair in zip(genes[1], genes[2]):
    if snp > pair[0] and snp < pair[1]:
 
 
 
snp = 1500
start, stop = 1000, 2000
 
if snp in range(1000,2000):
 
'''
 
 
 
 
 
</span>


Exercises:


1) What proportion of the genome is noncoding?

2) Writing Mathematical Functions
a) Write a function that accepts an array of floats. Return an array where every value of the input array is divided by 1.5.
b) Use the random functions to generate an array of floats. Write a function that accepts this array, and returns a list of values that are more than one standard deviation from the mean of the array.
c) Write a function that evaluate the function y = e^x for values from 0 to 10 and stores the values in an array. The function should also generate a random exponential sample of the same length. Sort the random sample, then return the sum of the distances at each point from the random sample to the exponential model.

3) Genome Lookup for Our SNPs
Write a program that finds the sequence that each SNP falls into. Make sure to account for the strand, and rev_comp() the sequence if need be. Use the pileup files to write out the sequence from the polymorphic strand, followed by the reference sequence, to a new output file.

4) Consider Conservation
Modify the program from #3 to only write the sequences to the output file when they're conserved in both strains.

5) Assess the impact of the coding polymorphisms
Using the program from exercises 3 & 4, and the sequence translator from last week, identify at least one SNP that appears to have made a coding change in the strain sequence. What is the amino acid substitution?


Solutions?


1)
from pylab import *
from sequencetools import read_fasta
 
refgenomefh = '../data/mtbReferenceGenome.FASTA'
refgenome = read_fasta(refgenomefh)
 
genome_name = refgenome.keys()[0]
genome_len = len(refgenome[genome_name])
 
is_coding = zeros(genome_len, dtype="bool")
# Set up a list of zeros
 
genomesumfile = 'mycobacterium_tuberculosis_h37rv_genome_summary.txt'
genomesumfh = open(genomesumfile, 'r')
 
genomesumfh.readline() # remove the header line
 
for line in genomesumfh:
    line = line.strip().split('\t')
 
    start = int(line[4]) - 1
    stop = int(line[5])
 
    is_coding[start:stop] = True
 
genomesumfh.close()
 
 
print "There are %d coding nts in the genome of %d nts, which is: " % \
    (sum(is_coding), len(is_coding))
print "%f%% coding" % (100.0 * sum(is_coding) / len(is_coding))

2) Functions

# a)
def two_thirds(an_array):
    return an_array/1.5
 
# b)
from numpy import mean, std
from pylab import normal
def reject_outliers(an_array):
    middle = mean(an_array)
    spread = std(an_array)
    return an_array[((middle - spread) > an_array) | (an_array  > (middle + spread))]
    # the | combines the two arrays
 
print reject_outliers(normal(0, 1, 1000))
 
# c)
 
from pylab import arange, exp, uniform
 
x = arange(0,10,.01)
y = exp(x)
 
z = uniform(0,10, len(y))
z = exp(z)
z.sort()  # note that z.sort() returns None
 
print sum(y - z) / len(y)
 

3) Genome lookup for coding SNPs

from pylab import *
from sequencetools import read_fasta, rev_comp
 
print "Reading Genome"
refgenomefh = '../data/mtbReferenceGenome.FASTA'
refgenome = read_fasta(refgenomefh)
 
refseq = refgenome.values()[0]
 
 
print "Reading Genome Features"
genomesumfile = 'mycobacterium_tuberculosis_h37rv_genome_summary.txt'
genomesumfh = open(genomesumfile, 'r')
 
genomesumfh.readline() # remove the header line
 
# Set up a list of lists, where each inner list is a gene start, i.e.:
#   genes = [[gene1_start, gene1_end], [gene2_start, gene2_end] ...]
# Also set up a dictionary keyed on a tuple of the starts and ends, which gives
# you the direction of that particular gene
 
directions = {}
genes = []
 
for line in genomesumfh:
    line = line.strip().split('\t')
 
    start = int(line[4]) - 1
    stop = int(line[5])
    strand = line[6]
 
    directions[(start, stop)] = strand
    if [start, stop] not in genes:
        genes.append([start, stop])
 
print "Reading SNPs"
 
 
# Set up SNP dictionaries where the keys are the index of the SNP, and the
# values are the substitution made
snps6916fh = open('SRR006916.pileup', 'r')
snps6916 = {}
for line in snps6916fh:
    line = line.strip().split('\t')
    snps6916[int(line[1])-1] = (line[2], line[3])
 
 
 
# setup another SNP file
 
snps6917fh = open('SRR006917.pileup', 'r')
snps6917 = {}
for line in snps6917fh:
    line = line.strip().split('\t')
    snps6917[int(line[1])-1] = (line[2], line[3])
 
 
 
allSNPs = {}
allSNPs.update(snps6916)
allSNPs.update(snps6917)
 
 
outfile = open('SNPs.txt', 'w')
print "Finding SNPs in genes!"
percent_dones = []
outfile.write("GeneStart \tGeneStop \tSNP_from-->SNP_to \tSNP_pos \tOldSeq \tNewSeq \n")
# Iterate over the keys of allSNPs, which are the indices where the SNPs fall
for idx, snp in enumerate(allSNPs):
 
    # I get impatient easily, and like to know how much it's done. Don't worry
    # too much about what's going on here
    percent_done = 100*idx/len(allSNPs)
    if percent_done % 10 == 0 and percent_done not in percent_dones:
        percent_dones.append(percent_done)
        print '%d%% done!' % (100*idx/len(allSNPs))
 
 
    if allSNPs[snp][1] not in ['A','T','C','G']:
        # If it's one of the IUPAC symbols for vague nts, skip it
        continue
 
    # Check through all known genes to see if the SNP is in it
    for start, stop in genes:
        if start <= snp <= stop:
            # The new sequence is the old sequence with the SNP replaced
            newseq = refseq[start:snp] + allSNPs[snp][1] + refseq[snp+1:stop]
            oldseq = refseq[start:stop]
 
            # Reverse complement the sequence if we need to
            if directions[(start, stop)] == "-":
                newseq = rev_comp(newseq)
                oldseq = rev_comp(oldseq)
 
            if len(newseq) % 3 != 0 or len(oldseq) %3 != 0:
                # Somehow we got a gene that's not an integer multiple of 3
                # nucleotides!
                continue
 
            outfile.write('%s-->%s\t%d\t%s\t%s\n'%( allSNPs[snp][0],
                                                   allSNPs[snp][1],
                                                  snp, oldseq, newseq))

4) Only doing SNPs common to both
Showing only the changes made:
#...
#...
allSNPs = {}
 
for snp in snps6916:
    if (snp in snps6917) and (snps6916[snp] == snps6917[snp]):
        allSNPs[snp] = snps6916[snp]
        # it doesn't actually matter which one we pick from,
        # since they are the same
 
#...

5) Only doing Non-synonymous substitutions

Most of this code is the same. I've put in comments noting where I've elided code for clarity

from pylab import *
from sequencetools import read_fasta, rev_comp
 
def parse_nuctable(filename):
    fh = open(filename)
 
    codon_table = {}
    for line in fh:
        line = line.strip()
        if line:  # skips blank lines
            data = line.split()
 
            aa = data[0]   # something like 'Ala/A'
            oneletter = aa.split('/')[1]
            codons = data[1:]
            for codon in codons:
                codon = codon.strip().replace(',', '')
                # turns 'GCU, ' into 'GCU'
                codon_table[codon] = oneletter
 
    return codon_table
 
def seq_to_codons(sequence):
    codons = []
    i = 0
    while i < len(sequence):
        codons.append(sequence[i: i+3])
        i += 3
    return codons
 
 
def translate_sequence(nt_sequence):
    aa_sequence = ""
    codon_table = parse_nuctable('nuctable.txt')
 
    nt_sequence = nt_sequence.replace('T', 'U')
    # nuctable uses the RNA sequence, not the DNA
    for codon in seq_to_codons(nt_sequence):
        aa_sequence += codon_table[codon]
 
    return aa_sequence
 
 
# ... Do setup and reading of files
 
 
# Iterate over the keys of allSNPs, which are the indices where the SNPs fall
for idx, snp in enumerate(allSNPs):
 
    # I get impatient easily, and like to know how much it's done. Don't worry
    # too much about what's going on here
    percent_done = 100*idx/len(allSNPs)
    if percent_done % 10 == 0 and percent_done not in percent_dones:
        percent_dones.append(percent_done)
        print '%d%% done!' % (100*idx/len(allSNPs))
 
 
    if allSNPs[snp][1] not in ['A','T','C','G']:
        # If it's one of the IUPAC symbols for vague nts, skip it
        continue
 
    # Check through all known genes to see if the SNP is in it
    for start, stop in genes:
        if start <= snp <= stop:
            # The new sequence is the old sequence with the SNP replaced
            newseq = refseq[start:snp] + allSNPs[snp][1] + refseq[snp+1:stop]
            oldseq = refseq[start:stop]
 
            if directions[(start, stop)] == "-":
                newseq = rev_comp(newseq)
                oldseq = rev_comp(oldseq)
 
            if len(newseq) % 3 != 0 or len(oldseq) %3 != 0:
                # Somehow we got a gene that's not an integer multiple of 3
                # nucleotides!
                continue
 
#############  BIG CHANGE HERE ######################
            # only write out if it's a non-synonymous substitution
            if translate_sequence(newseq) != translate_sequence(oldseq):
                outfile.write('%d\t%d\t%s-->%s\t%d\t%s\t%s\n'%
                               (start, stop,
                                allSNPs[snp][0], allSNPs[snp][1],
                                snp, oldseq, newseq))