Fitting Data


This morning we collected a bunch of data about the distance of each residue from a particular location. In the paper, the authors state that there is little correlation between being close to the active site and the effect on catalysis. This afternoon, we are going to mine the kinetic data from the paper as well as the structure data from this morning to evaulate whether the statement is true.

The quantitative data in the paper is almost exclusively kinetics. So, just to remind you in case it's been a while...

external image 800px-Michaelis-Menten_saturation_curve_of_an_enzyme_reaction.svg.png
external image 485px-Mechanism_plus_rates.svg.png

eq1.JPG
eq2.JPG


In other words, KM is the binding affinity and kcat is the rate of the reaction.

First, we need data :O) If you were not able to finish the exercises from this morning, here is a copy of the distance data in the structures:


And here is a table with the kinetic catalase data:


PAUSE for thought....

We are planning to graph the distances against the kinetic data and see if it is correlated by fitting a straight line. What are the steps we need for this goal (stub it out ;O)?













STEPS:
1) Read in and parse files into a usable data structure
2) Map the distance data onto the kinetic data
3) Graph the data
4) Use math function to fit a straight line

Ok. Let's go through the list, starting with reading in the two data files:
distances = {}
f1 = open('structure_distances.txt', 'r')
distTitle = f1.readline().strip().split('\t')
distTitle.pop(0)
for next in f1:
    tmp = next.strip().split('\t')
    distances[tmp[0][:-2]] = tmp[1:]
 
kinetics = {}
f1 = open('catalase.txt', 'r')
kinTitle = f1.readline().strip().split('\t')
kinTitle.pop(0)
for next in f1:
    if 'n/d' in next:
        continue
    else:
        tmp = next.strip().split('\t')
    kinetics[tmp[0]] = [tmp[1:3], tmp[3:5], tmp[5:]]
 

Next, we need to map the kinetic data onto the distance data. EXCEPT, we have a problem. The structure data is stored by the three letter code and the kinetics data is by single letter code. Fortunately, your brilliant instructor has anticipated this! Here is a parameter file that can be used to translate between the two:


Let's read in this file as well and then map the distances data onto the kinetic data:
translate = {}
f1 = open('3to1.txt', 'r')
for next in f1:
    tmp = next.strip().split()
    translate[tmp[1].upper()] = tmp[0].upper()
 
for k in kinetics.keys():
    if k == 'WT':
        continue
    kinetics[k].append(distances[translate[k[0]]+'_'+k[1:-1]])
 

Alright! With these pieces, we can now construct a data structure that can be analyzed with a graph:
keys = kinetics.keys()
x = []
for d in range(len(kinetics[keys[0]][-1])):
    tmpX = []
    for k in keys:
        if k == 'WT':
            continue
        tmpX.append(float(kinetics[k][-1][d]))
    x.append(tmpX)
 
y = []
for d in range(len(kinetics[keys[0]])-1):
    tmpY = []
    for k in keys:
        if k == 'WT':
            continue
        tmpY.append(float(kinetics[k][d][0]))
    y.append(tmpY)
 
import numpy as np
x = np.array(x)
y = np.array(y)
#normalize values by wild type so it's relative kinetics
#instead of absolute
for nInd, n in enumerate(y):
    y[nInd] /= float(kinetics['WT'][0][0])

Ok. Now, let's take a look at the data. For simplicity, let's start by focusing on kcat:
import matplotlib.pyplot as plot
fig = plot.figure()
plotLoc = 221
for dInd, d in enumerate(x):
    a = fig.add_subplot(plotLoc)
    a.scatter(d, y[0])
    a.set_title(distTitle[dInd])
    plotLoc += 1
plot.show()
Kinda neat!

Moving on to fitting...but first, some math:

Linear Regression

Linear regression is used when you know the values of both X and Y and you want to find the best straight line through the data. It does this fit by finding a line that minimizes the sum of the squares of the vertical distances of the points from the line.

To measure the goodness of the fit, or correlation of the two variables, we use a value called r. It is a fraction between 0.0 and 1.0 with no units. An r value of 0.0 means that the best fit is a horizontal line going through the mean of all Y values; knowing X tells you nothing about Y. When r is 1.0, all points lie exactly on a straight line with no scatter; knowing X lets you predict Y perfectly.

Good news is this isn't a stats class; I'm just going to give you the equation and you don't really have to understand where it comes from :O) If you have a data set of (x,y) with n points in it, here are the formulas for finding the slope (m), intercept (b), and correlation (r):
eq3.JPG
Ok, let's translate the math into actual code. First, we can pre-compute certain subsets of the equation, which simplifies the final equation code. (NOTE: You might want to consider commenting out the graphing code for the time being).
xy = []
xSq = []
for d in x:
    xy.append(d*y[0])
    xSq.append(pow(d,2))
ySq = pow(y[0],2)
n = len(x[0])
Using these values plus the data structures generated above, let's compute the slope:
for xInd, X in enumerate(x):
    mNum = (n * sum(xy[xInd])) - (sum(X) * sum(y[0]))
    mDenom = (n * sum(xSq[xInd])) - pow(sum(x[xInd]),2)
    m = mNum / mDenom
 
Now add code to compute the y-intercept (b) and the correlation coefficient (r):
    b = (sum(y[0]) - (m*sum(x[xInd]))) / n
    rNum = (n * sum(xy[xInd])) - (sum(x[xInd]) * sum(y[0]))
    rDenom1 = (n * sum(xSq[xInd])) - pow(sum(x[xInd]),2)
    rDenom2 = (n * sum(ySq)) - pow(sum(y[0]),2)
    import math
    rDenom = math.sqrt(rDenom1 * rDenom2)
    r = rNum / rDenom
 

If we print out the m, b, and r, you should get the following:

0.00825839907758 0.0497564033074 0.668874303331
0.000456183924553 0.141761169567 0.0174648747341
0.00528375109936 -0.345428331302 0.497075546029
0.00668150105155 -0.413395254385 0.501210562009

Yeah :O)))) BUT, we have learned in previous lessons that complicated data analysis often has been pre-coded in a library. Linear regression is one of these functions. It is included in the stat module of scipy and is SUPER short and easy to use (and this time, we'll save the data so we can use it later):
statsLib = []
from scipy import stats
for X in x:
    m,b,r,p,err = stats.linregress(X,y[0])
    statsLib.append([m,b,r])
If we print m,b, and r again, you get the following:

0.00825839907758 0.0497564033074 0.668874303331
0.000456183924553 0.141761169567 0.0174648747341
0.00528375109936 -0.345428331302 0.497075546029
0.00668150105155 -0.413395254385 0.501210562009

It's the same!

First, we now can see that the catalytic rate for catalase activity is most directly correlated with distance from the heme group. Second, while we *can* translate math into programming, often it's worth it to google the function of interest to see if someone else did it first.

Finally, let's go back to the graphs and add the line fit as well as label with the r value:
fig = plot.figure()
plotLoc = 221
for dInd, d in enumerate(x):
    a = fig.add_subplot(plotLoc)
    a.scatter(d, y[0])
    fitX = np.array(range(min(a.get_xticks()),max(a.get_xticks())+10,10))
    fitY = statsLib[dInd][0]*fitX + statsLib[dInd][1]
    a.plot(fitX, fitY)
    a.annotate('R=%.2f' %statsLib[dInd][2], xy=(5,10),
                xycoords='axes points')
    a.set_title(distTitle[dInd])
    plotLoc += 1
plot.show()

Questions?

EXERCISES

1) Making the code more dynamic
a) For simplicity, we only looked at kcat above. Rewrite the code to above to analyze the data for Km and kcat/Km as well. (HINT: You don't have to display all three sets of plots at the same time but you should save them!)
b) Rewrite the file parser as a function. Repeat the analysis on the peroxidase kinetic activity:

2) More advanced fitting functions. While not necessary in this case, you will likely have data that requires more sophisticated fitting than a straight line. Look in the documentation for scipy and switch your code to run a polynomial regression.

SOLUTIONS


1) Making the code more dynamic

a) For simplicity, we only looked at kcat above. Rewrite the code to above to analyze the data for Km and kcat/Km as well.

#!/usr/bin/env python
 
# Parse the distances data from the file
distances = {}
f1 = open('structure_distances.txt', 'r') # verify correct pathname
distTitle = f1.readline().strip().split('\t')
distTitle.pop(0)
for next in f1:
    tmp = next.strip().split('\t')
    distances[tmp[0][:-2]] = tmp[1:]
 
# Parse the kinetics data from the file
kinetics = {}
f1 = open('catalase.txt', 'r') # verify correct pathname
kinTitle = f1.readline().strip().split('\t')
kinTitle.pop(0)
for thing in f1:
    if 'n/d' in thing:
        continue
    else:
        tmp = thing.strip().split('\t')
    kinetics[tmp[0]] = [tmp[1:3], tmp[3:5], tmp[5:]]
 
# Parse the 3-to-1 file
threeToOne = {}
f1 = open('3to1.txt', 'r') # verify correct pathname
for next in f1:
    tmp = next.strip().split()
    threeToOne[tmp[1].upper()] = tmp[0].upper()
 
# Append distances data to kinetics data dictionary
for k in kinetics.keys():
    if k == 'WT':
        continue
    kinetics[k].append(distances[threeToOne[k[0]]+'_'+k[1:-1]])
 
# Make list of keys for mutated residues
keys = kinetics.keys()
 
# Make a list of 4 lists containing the distances data
x = []
for d in range(4):
    tmpX = []
    for k in keys:
        if k == 'WT':
            continue
        tmpX.append(float(kinetics[k][-1][d]))
    x.append(tmpX)
 
# Make a list of 3 lists containing the kinetics data (Kcat, Km, & Kcat/Km, but not std. devs.)
y = []
for d in range(3):
    tmpY = []
    for k in keys:
        if k == 'WT':
            continue
        tmpY.append(float(kinetics[k][d][0]))
    y.append(tmpY)
 
import numpy as np
x = np.array(x)
y = np.array(y)
 
# Normalize values to wild type (to deal with relative kinetics instead of absolute)
for nInd, n in enumerate(y):
    y[nInd] /= float(kinetics['WT'][nInd][0])
 
# Fit linear regression to distance data vs kinetics data
statsLib = []
from scipy import stats
for Yind, Y in enumerate(y):
    tmpLib = []
    for X in x:
        m,b,r,p,err = stats.linregress(X,y[Yind])
        tmpLib.append([m,b,r])
    statsLib.append(tmpLib)
 
import matplotlib.pyplot as plot
for Yind, Y in enumerate(y):
    plotLoc = 221 # initiate a 2x2 graph
    # keep track of the kinetics data type and units as we loop
    # for documentation on mathematical symbols check out the mathtext documentation
    # (http://doc.astro-wise.org/matplotlib.mathtext.html#SymbolElement)
    if Yind == 0:
        k_name, k_datatype, k_units = 'Kcat', r'K$_{cat}$', r"$s^{-1}$"
    if Yind == 1:
        k_name, k_datatype, k_units = 'Km', r'K$_m$', 'M'
    if Yind == 2:
        k_name, k_datatype, k_units = 'KcatOverKm', r'K$_{cat}$/K$_m$', r'M$^{-1}\bullet$$s^{-1}$'
    fig = plot.figure() # initiate a new figure each time we loop through the kinetics data types
    for dInd, d in enumerate(x):
        a = fig.add_subplot(plotLoc) # add a subplot each time we loop through the distance data
        a.scatter(d, y[Yind], marker='x')
        fitX = np.array(range(min(a.get_xticks()),max(a.get_xticks())+10,10))
        fitY = statsLib[Yind][dInd][0]*fitX + statsLib[Yind][dInd][1]
        a.plot(fitX, fitY)
        a.annotate('R=%.2f' %statsLib[Yind][dInd][2], xy=(5,10), xycoords='axes points')
        a.set_yticklabels([])
        a.set_xlabel(r'%s ($\AA$)' % (distTitle[dInd]))
        a.set_ylabel('%s (%s)' % (k_datatype, k_units))
        plotLoc += 1
    fig.suptitle('%s' % (k_datatype))
    # save 4 graphs together in one pdf file for each kinetics data type
    fig.savefig('%s.pdf' % (k_name), dpi=500,format='pdf')

Kcat.jpg


Km.jpg


KcatOverKm.jpg

b) Rewrite the file parser as a function. Repeat the analysis on the peroxidase kinetic activity:

#!/usr/bin/env python
 
# REMOVED THE FOLLOWING TEXT FROM CODE IN PART A:
## # Parse the kinetics data from the file
## kinetics = {}
## f1 = open('catalase.txt', 'r') # verify correct pathname
## kinTitle = f1.readline().strip().split('\t')
## kinTitle.pop(0)
## for thing in f1:
##     if 'n/d' in thing:
##         continue
##     else:
##         tmp = thing.strip().split('\t')
##     kinetics[tmp[0]] = [tmp[1:3], tmp[3:5], tmp[5:]]
 
# REPLACED THE ABOVE TEXT WITH THE FOLLOWING CODE:
def kineticsdataparser(filename):
    kinetics = {}
    f1 = open(filename, 'r')
    kinTitle = f1.readline().strip().split('\t')
    kinTitle.pop(0)
    for thing in f1:
        if 'n/d' in thing:
            continue
        else:
            tmp = thing.strip().split('\t')
        kinetics[tmp[0]] = [tmp[1:3], tmp[3:5], tmp[5:]]
    return kinetics
 
filename = 'peroxidase.txt' # verify correct pathname
kinetics = kineticsdataparser(filename)

KcatPer.jpg

KmPer.jpg

KcatOverKmPer.jpg


2) More advanced fitting functions.

#!/usr/bin/env python
 
# To be added...  Pardon our dust!