Functions, modules, and pickles

Topics:

  • Function definitions
  • Function documentation
  • Functions within functions
  • Modules
  • Making your own libraries
  • Pickles and quick data storage

Introduction


This afternoon we'll concentrate on our last fundamental programming concept for the course. To date, we've been writing all of our program logic in the main body of our scripts. And we've seen how built-in python functions like print() are used to operate on variables and their values. In this session, we'll learn how to write functions of our own, how to properly document them for ourselves and other users, and how to collect them into modules, and make our own local repositories, or libraries.

If you properly leverage a well-designed function, writing the main logic of your programs becomes almost-too-easy. Instead of writing out meticulous logical statements and loops for every task, you just call forth your previously-crafted logic, which you've vested in well-made functions.

Functions

Functions are the basic means to manage complexity in your programs, allowing you to avoid nesting and repeating large chunks of code that could otherwise make your tasks unmanageable. They allow you to bundle code with a known input and a known output into single lines, and you should use them frequently from now on. We will start with the syntax:
#!/usr/bin/env python
 
# define the function
def hello(name):
     greeting = "Hello %s!" % (name)
     return greeting
 
# use the function
functionInput = 'Zaphod Beeblebrox'
functionOutput = hello(functionInput)
print functionOutput
 

To define a function, you use the keyword def. Then comes the function name, in this case hello, with parentheses containing any input arguments the function might need. In this case, it needs a name to form a proper greeting, so we're giving it a variable argument called name. After that, the function does its thing, executing the indented block of code immediately below. In this case, it creates a greeting "Hello <name>!". The last thing that it does is return that greeting to the rest of the program.

Note that the variable names are different on the inside and the outside of the function: I give it functionInput, although it takes name, and it returns greeting, although that return value is fed into functionOutput. I did this on purpose, as I want to emphasize that the function only knows to expect something, which it then internally refers to as name, and then to give something else back. In fact, there is some insulation against the outside world, as you can see in this example:
#!/usr/bin/env python
 
def hello(name):
    greeting = "Hello %s!" % (name)
    testVariable = 3
    return greeting
 
testVariable = 4
grt = hello("Zaphod Beeblebrox")
print testVariable

While 3 was assigned to a variable called testVariable inside the function, nothing happened to that variable outside the function. Variables created inside a function occupy their own namespace in memory distinct from variables outside of the function, and so sharing names between the two can be done without you having to keep track of it. This way, you can use functions written by other people without having to know what variables those functions are using internally. Just like a sleazy town in Nevada, what happens in the function stays in the function (an important exception lies with lists and dictionaries, which you will examine in the exercises).

Let's have another example with a more pressing subject:
#!/usr/bin/env python
 
def whichFood(balance):
    if balance < 10:
        return 'ramen'
    elif balance < 100:
        return 'good ramen'
    elif balance < 200:
        return 'better ramen'
    else:
        return 'ramen that is truly profound in its goodness'
 
print whichFood(14)
 

Here we've made a slightly more complicated function-- it has some control statements in there, and there is more than one way for it to return. We also never explicitly create an input variable, like functionInput in the last example, and we don't create an output variable either. However, it functions just like that block of code that we saw earlier. Finally, functions don't necessarily need to take anything as input, and certainly not just one thing, and they don't need to return anything back to the program (or just one thing). They can even have other functions nested inside them! For a few more examples of the syntax:
# functions can do their thing without taking input or returning output
def countToTen():
    for i in range(10):
        print i
 
# functions can take multiple items in and return multiple items out
def doLaundry(amtDetergent, dirtyLoads):
    cleanLoads = []
    for load in dirtyLoads:
        amtDetergent -= 1
        cleanLoads.append(load)
    return (amtDetergent,cleanLoads)
 
amtTide = 5
dirtyLaundry = ['socks','shirts','pants']
(amtTide, cleanLoads) = doLaundry(amtTide, dirtyLaundry)
print amtTide
print cleanLoads
 

We should go into a little more detail on returning values. Above, in doLaundry, I returned a tuple of the two variables enclosed in parenthesis. You could also return a list, which works much the same way. However, the real complication comes in what variable you store that value in:
def returnStuff():
    a = 3
    b = 4
    return [a,b]
 
(x,y) = returnStuff()
print x,y
both = returnStuff()
print both
# works both ways!

So, how do we tend to use them? We tend to use functions to break difficult tasks into a number of easier tasks, and then these easier tasks into ones easier still, and so on. Large 'raw' code blocks, with few function calls, are only tens of lines long, and many functions are only a handful of lines. This allows us to program in large, structural sweeps, rather than getting lost in the details. This makes programs both easier to write and easier to read:
def publishAPaper(authors,topic,journal):
    data = doWork(topic)
    analysis = analyze(data)
    paper = writePaper(data,analysis)
    submit(authors,paper,journal)
 

And, a big part of that ease comes with the use of:

Modules


In all of the examples above, we defined our functions right above the code that we hoped to execute. If you have many functions, you can see how this would get messy in a hurry. Furthermore, part of the benefit of functions is that you can call them multiple times within a program to execute the same operations without tiresomely writing them all out again. But wouldn't it be nice to share functions across programs, too? For example, working with genomic data means lots of time getting sequence out of FASTA files, and shuttling that sequence from program to program. Thus, many of these programs overlap to a significant degree as they need to parse FASTA files, calculate evolutionary rates, most need to interface with our lab servers, all of which means that many of them share functions. And if the same function exists in two or more different programs, we hit the same problems that we hit before: complex debugging, decreased readability, and, of course, too much typing.

Modules solve these problems. In short, they're collections of functions and variables (and often objects, which we'll get to towards the end of the class) that are kept together in a single file that can be read and imported by any number of programs.

Using a module: the basics


To illustrate the basics, we'll go through the use of two modules, sys and math, one of which we effectively use all the time. In fact, it's a very, very rare program indeed that doesn't use the sys module. sys contains a lot of really esoteric functions, but it also contains a simple, everyday thing -- what you typed on the command line. To illustrate:

$ ./testprogram.py argument1 argument2 argument3
then the sys module contains a list that contains './testprogram.py', 'argument1', 'argument2', and 'argument3.' This list is called argv.

#!/usr/bin/env python
import sys            # gaining access to the module
 
# you can access variables stored in the module by using a dot
# to get at the variable 'argv' which is stored in 'sys', type:
commandLine = sys.argv
 
print commandLine

We can also use functions stored inside other modules. To demonstrate this, I'll use the module math.
#!/usr/bin/env python
import sys
import math
 
# sys.argv contains only strings, even if you type integers.
# And, remember, the first element is the command itself-- usually
# not very useful.
 
x = float(sys.argv[1])
logX = math.log(x)
 
print logX
Great! Not so hard. It turns out that they're easy to write, too:

Making a module


greeting_module.py
def hello(name):
    greeting = "Hello %s!" % name
    return greeting
 
def ahoy_hoy(name):
    greeting = "Ahoy-hoy %s!" % name
    return greeting
 
test.py
#!/usr/bin/python
import greeting_module
 
hi = greeting_module.hello('class')
print hi
 
And that's it! See-- no more messy function declarations at the beginning of your script. And if you need another program to say hi to you, then all you need to do is import the greeting module.

Using modules: slightly more than just 'import'


Although creating a basic module is easy, sometimes you want more than just the basics. And although using a module in the most basic manner is easy, it's best to get a more thorough picture of how modules behave.

First, what if you only want one function from a given module? Let's say, as an Alexander Graham Bell loyalist, you really only dealt in ahoys rather than hellos. We have a modified syntax for retrieving only the ahoy function from the module, without wasting memory space loading the newfangled hello function preferred by the T.A. Edison posse.

test.py
#!/usr/bin/python
from greeting_module import ahoy_hoy
 
hi = ahoy_hoy('everybody')
# if grabbed with a 'from' statement, you don't need to use the <module>.<function> syntax
print hi
 

We see that we can now write ahoy_hoy('everybody') directly, instead of having to write greeting_module.ahoy_hoy('everybody'). And if we wanted to access both functions this way, we could import them both in one statement:

test.py
#!/usr/bin/python
from greeting_module import ahoy_hoy, hello

Or, what if we wanted to do this with every function in the module? Rather than writing out function names to import individually, (there could be a lot of them), we can use the asterisk wildcard (*) symbol to refer to them.

test.py
#!/usr/bin/python
from greeting_module import *
# equivalent to: from greeting_module import hello, ahoy_hoy
 
hi = ahoy_hoy('everybody')
hi2 = hello('everybody')
 

Where to Store Your Modules: using PYTHONPATH

Over time, you'll end up accumulating lots of these modules, and they'll tend to fall together in meaningful collections. For example, I have a module for all my functions related to reading and parsing files, which I call files_tools.py. I have another for common sequence-related tasks, called sequence_tools.py. Python keeps its modules installed in a system directory, that you may or may not have access to on a remote server. Therefore, it's easier to create your own python modules directory, and let your operating system environment know about it. In MacOS, I accomplish this by placing my modules in /Users/matt/Library/Python/2.6/ and then adding these lines to my .bash_profile file in my home directory:

PYTHONPATH=/Users/matt/Library/Python/2.6/
export PYTHONPATH

And with that, any .py file that ends up in this directory will be treated as a module by Python. And though this is a good final resting place for your polished modules, you can also prototype them by simply saving them in your current working directory, and moving them over when you're happy with them.

So, with this under our belts, why don't we start using an example module? This one here is handy:

Pickling

There are many modules that come with a default installation of python, and one of the more useful ones is pickle. It allows us to store data from a python script very easily into a file, and then when you want it again, we can unpickle the very same stuff! This is recommended if you have large amounts of processed data that you need to dump onto your disk momentarily, to free space while you look at other data. In effect, pickling data saves you the time of writing functions to write and read temporary data. It's nice.

Although like many built-in pieces of python, there's a lot to it, here we'll just cover the basic functionality, which comprises most of its use anyway.

programOne.py
#!/usr/bin/env python
import pickle
 
brands = ['vlasic','heinz','klaussen','kruger']
brandFileHandle = open('thePickleFile','w')
 
pickle.dump(brands,brandFileHandle)
brandFileHandle.close()
 
programTwo.py
#!/usr/bin/env python
import pickle
 
pickleFileHandle = open('thePickleFile')
revivedBrands = pickle.load(pickleFileHandle)

And there you have it! Pickles! Delicious! You can also store more complicated data structures:
#!/usr/bin/env python
import pickle
 
brands = {}
brands['west'] = ['kruger',"klein's"]
brands['midwest'] = ['claussen','vlasic','gedney']
brands['east'] = ['mt. olive','b&g']
brands['south'] = ['best maid','goldin']
 
brandFileHandle = open('thePickleFile','w')
pickle.dump(brands,brandFileHandle)
brandFileHandle.close()

Exercises


1) Practice with functions:

a) Takes an integer x as input, prints x * 2.

b) Takes integers x and y as input, prints x * y

c) Takes a list xs as input, prints xs[0] * xs[1]

d) Modify the above programs so that the function returns the result instead of printing it. This result is then printed by the program that called the function.

2. What happens in functions doesn't always stay in functions.

As promised, most things that happen in functions stay in the functions, but there are important exceptions. Make the following functions, which should illustrate this property:

a) The function takes an integer as input, and it increments that integer by one using the '+=' operator. Print the value of the integer before and after the function is called.

b) The function takes a list as input, and it changes the first element of the list to the string 'x'. Print the value of the list before and after the function is called.

c) The function takes a dictionary as input, and it adds the key 'x' with value 'y' to this dictionary. Print the dictionary before and after the function is called.

3. Reverse Complement

a) Write a function that takes a DNA sequence as an argument, ensures that it is all in capital letters, and then returns the reverse complement of the sequence.

b) Modifiy the function to ensure that only the characters A, T, G, C and N (for unknown nucleotide) are in the input sequence.

4. Making a module.

Create a directory in your PythonCourse directory called pylib, then add it to your PYTHONPATH. Create a module in this directory called exercises.py. Put your functions from Exercise 1 into this module. Now write two programs that import and call all of the functions in the module both of these ways:

a) A program that uses the 'import exercises' line.

b) A program that uses the 'from exercises import *' line

c) Add your reverse complement function from Exercise 3 to this module.

5. Make a FASTA parser

Starting with your script from this morning, make a function that takes a FASTA file as input, reads through the file using open(), distinguishes between ID-containing lines and sequence-containing lines, and returns a dictionary with gene IDs as keys and sequences as values. Put this function in your exercises.py module.

Copy and paste the following lines into a file called testFasta.fa. Create a program that imports the exercises.py module and prints the sequence corresponding to the gene ID 'gene3.'
>gene1
ATGAGACGTAGTGCCAGTAGCGCGATGTAGCG
ATGACGCATGACGCGCGACGCGCGAGTGAGCC
ATACGCACGCATTGGCA
>gene2
ATGTTCGACGCATACGACGCGCAGTACCAGCA
ATGACGCACCGGGATACACGACGCGGATTTTT
ACGCACCGAGATAGCATAAAAGACCATTAG
>gene3
TTATGGCACCCACTAGAGCCAGATTATTTTAAA
AGATGGGGG

#!/usr/bin/env python
from exercises import fastaParser
 
geneDict = fastaParser('testFasta.fa')
print geneDict['gene3']
 

6. Pickle Practice

Modify your program from (5) such that instead of printing the data, it pickles it. Now write another program that unpickles that pickle file and prints the sequence of gene3.

7. (bonus) Create an ORF finder

For our purposes, we will define an open reading frame (ORF) as a start codon followed at some distance by a stop codon in the same frame. This program should take a pickled fasta file as in (6) as input and outputs a pickled dictionary of gene name->ORF sequence key-value pairs. If the sequence does not contain an ORF, then the gene name should not be in the dictionary.

Solutions


1) Practice with functions:

#!/usr/bin/env python
 
# a) Takes an integer x as input, prints x*2 (x multiplied by 2)
 
def timestwo(x):
    print '%.0f multiplied by 2 is %.0f' % (x, x * 2)
 
print
num = float(raw_input('Input number to multiply by 2: '))
x = timestwo(num)
print
 
# b) Takes integers x and y as input, prints x * y
 
# Below I'll generate the list using command arguments, since
# we learned that today, but you could write them into the
# script instead
import sys
commandLine = sys.argv
print 'You entered the numbers', commandLine[1:], 'into the commandline.'
 
def product(x,y):
    print "The product of the first two numbers is %.0f." % (x*y)
 
numToMultiply1 = float(sys.argv[1])
numToMultiply2 = float(sys.argv[2])
multiplied = product(numToMultiply1, numToMultiply2)
print
 
# c) Takes a list xs as input, prints xs[0] * xs[1]
 
listOfNumbers = [2,3,3,4]
 
def product(xs):
    result = xs[0] * xs[1]
    print 'You supplied the list: %s' % (xs)
    print 'The product of the first two numbers in the list is %.0f.' % (result)
 
multipliedNumbers = product(listOfNumbers)
print multipliedNumbers # returns None
print
 
# d) Modify the above programs so that the function returns
# the result instead of printing it. This result is then
# printed by the program that called the function.
 
listOfNumbers = [2,3,3,4]
 
def product(xs):
    result = xs[0] * xs[1]
    print 'You supplied the list: %s' % (xs)
    return result
 
multipliedNumbers = product(listOfNumbers)
print 'The product of the first two numbers in the list is %.0f, but this time we returned the result from the function.' % (multipliedNumbers)
print

2. What happens in functions doesn't always stay in functions.

#!/usr/bin/env python
 
# a) The function takes an integer as input, and it increments that integer by one using the '+=' operator. Print the value of the integer before and after the function is called.
 
def increment(numberToIncrement):
    numberToIncrement += 1
 
numberToIncrement = 5
print 'The number to increment was', numberToIncrement
increment(numberToIncrement)
print 'The number is still', numberToIncrement
print
 
# b) The function takes a list as input, and it changes the first element of the list to the string 'x'. Print the value of the list before and after the function is called.
 
def modifyList(x):
    x[0] = 'overwrite'
    return x
 
stringlist = ['1', '33', '5', 'dog'] # could have used list of integers, or any type of list
print 'The list was', stringlist
modifyList(stringlist)
print 'Now the list is', stringlist
print
 
# c) The function takes a dictionary as input, and it adds the key 'x' with value 'y' to this dictionary. Print the dictionary before and after the function is called.
 
def appendToDict(Dict_with_a_new_name):
    Dict_with_a_new_name['x'] = 'y'
 
Dict = {}
Dict['0'] = 'zero'
Dict['1'] = 'one'
Dict['2'] = 'two'
print 'Before:', Dict
 
import sys
commandLine = sys.argv
 
appendToDict(Dict)
print 'After:', Dict
print

3. Reverse Complement

#!/usr/bin/python
 
def revComp(seq):
    seq=seq.upper()            # Makes seq uppercase
    seq=seq[::-1]              # Reverses seq
    seq=seq.replace('A','t')   # Replace ACGT with lowercase complement
    seq=seq.replace('C','g')
    seq=seq.replace('G','c')
    seq=seq.replace('T','a')
    seq=seq.upper()            # Make seq uppercase again
 
    isitempty=seq
    isitempty=isitempty.replace('A',"")
    isitempty=isitempty.replace('C',"")
    isitempty=isitempty.replace('G',"")
    isitempty=isitempty.replace('T',"")
    isitempty=isitempty.replace('N',"")
    if isitempty != "":
        print "Careful, improper characters!"
 
    return seq
 
#####################################
#  Iterative method
#####################################
 
def revCompIterative(watson):
    complements = {'A':'T', 'T':'A', 'C':'G', 'G':'C', 'N':'N'}
    watson = watson.upper()
    watsonrev = watson[::-1]
    crick = ""
    for nt in watsonrev:
       crick += complements[nt]
 
    return crick
 
 
 
print revComp("aTNrg")
 
 
 
 

4. Making a module.

#!/usr/bin/env python
 
# Make a directory in /Users/[username]/PythonCourse/pylib
# Open a new terminal window and type the following, substituting your username:
#      echo "PYTHONPATH=/Users/[username]/PythonCourse/pylib" >>.bash_profile
#      echo "export PYTHONPATH" >>.bash_profile
#      source .bash_profile
# Create a file called exercises.py in the pylib folder, copy in your timestwo() function
# To verify it worked, try part a
 
#Part a
import exercises
print exercises.timestwo(4) # or whatever your function was called
 
 
#Part b --note, this should be run separately from part a
from exercises import timestwo
print timestwo(6)
 
#Part c
#Copy the reverse complement function from problem 3 to PythonCourse/pylib/exercises.py
 
 

5. Make a FASTA parser

Below is the module called exercises.py where we have stored our functions.

#!/usr/bin/env python
 
def fastaParser(filename):
        current_gene = ""
        genes = {}
        fh = open(filename, 'r')
 
        for line in fh:
                line = line.strip()
                if line.startswith('>'):
                        current_gene = line[1:]
                        genes[current_gene] = ''
                else:
                        genes[current_gene] += line
 
        return genes
 
 
def timestwo(x):
# Takes 1 integer x as input, prints x*2
    print '%.0f multiplied by 2 is %.0f' % (x, x*2)
 
def product1(x,y):
# Takes 2 integers x and y as input, prints x * y
    print "The product of the first two numbers is %.0f." % (x*y)
 
def product2(xs):
# Takes a list as input, prints xs[0] * xs[1]
    result = xs[0] * xs[1]
    print 'You supplied the list: %s' % (xs)
    print 'The product of the first two numbers in the list is %.0f.' % (result)
 
def product3(xs):
# Same as product2() except this function returns the
# result instead of printing it. This result can then
# be printed by the program that called the function.
    result = xs[0] * xs[1]
    print 'You supplied the list: %s' % (xs)
    return result

Below is the script called Exercise5.py that will import functions from the module exercises.py.
#!/usr/bin/env python
 
# a) A program that uses the 'import exercises' line.
 
import exercises
x = exercises.squareNum(12)
 
# b) A program that uses the 'from exercises import *' line
 
from exercises import product1
product1(2,3)
 
# c) Add your reverse complement function from Exercise 3 to this module.
 
from exercises import fastaParser
 
x = fastaParser('seq.FASTA')
print x

6. Pickle Practice

parsedFastaDataPickler.py
#!/usr/bin/env python
 
from exercises import fastaParser
import pickle
 
x = fastaParser('seq.FASTA')
parsedFastaDataHandle = open('parsedFastaData','w')
pickle.dump(x, parsedFastaDataHandle)
parsedFastaDataHandle.close()

parsedFastaDataUnpickler.py
#!/usr/bin/env python
 
import pickle
 
pickleFileHandle = open('parsedFastaData')
parsedFastaData = pickle.load(pickleFileHandle)
print parsedFastaData

7. (bonus) Create an ORF finder

#!/usr/bin/env python
 
def find_orfs(sequence):
        """ Finds all valid open reading frames in the string 'sequence', and
            returns them as a list"""
 
        starts = find_all(sequence, 'ATG')
        stop_amber = find_all(sequence, 'TAG')
        stop_ochre = find_all(sequence, 'TAA')
        stop_umber = find_all(sequence, 'TGA')
        stops = stop_amber + stop_ochre + stop_umber
        stops.sort()
 
        orfs = []
 
        for start in starts:
                for stop in stops:
                        if start < stop \
                           and (start - stop) % 3 == 0:  # Stop is in-frame
                                orfs.append(sequence[start:stop+3])
                                # the +3 includes the stop codon
                                break
                                # break out of the inner for loop
                                # when we hit the first stop codon
        return orfs
 
 
def find_all(sequence, subsequence):
        ''' Returns a list of indexes within sequence that are the start of subsequence'''
        start = 0
        idxs = []
        next_idx = sequence.find(subsequence, start)
 
        while next_idx != -1:
                idxs.append(next_idx)
                start = next_idx + 1     # Move past this on the next time around
                next_idx = sequence.find(subsequence, start)
 
 
        return idxs
 
 
fname = file(sys.argv[1])   # Read in from the first command-line argument
 
fh = open(fname, 'w')
 
genedict = pickle.load(fh)
 
fh.close()
 
orfdict = {}
 
for gene in genedict:
    gene_seq = genedict[gene]
    orfs = find_orfs(gene_seq)
    if len(orfs) > 0:
        orfdict[gene] = orfs
 
print orfdict
 
fh = open('orfs_out', 'w')
pickle.dump(orfdict, fh)
fh.close()