Bowtie: Fast Read Mapping, lotsa times


Topics:

  • Using Python to map sequence with Bowtie
  • Parsing Bowtie output
  • SAM Tools

Introduction:

So, by now we've given up. Well, that is, we've admitted that we're powerless regarding our programming task, and that our code has become unmanageable, so we're going to ask for help from a higher power. That is, we're going to use a faster, smarter program, written in a faster language, to do the heavy (and complicated!) lifting.

The Bowtie program uses the Burroughs-Wheeler Algorithm (BWA) to *very* efficiently store large dictionaries of data in memory, and rapidly access the contents of those dictionaries. We're not going into the details of the algorithm, but I mention it in case you want to go wikipedia-ing on your own. The upshot is that in addition to storing the genome data much more efficiently, Bowtie can also find mismatches in our reads, and will write out a systematic format to be parsed later.

Install Bowtie

First, let's install the program that we're going to manipulate with Python (Windows users should have this installed already).

$ cd ~/PythonCourse/programFiles/
$ unzip bowtie-0.12.5-macos-10.5-x86_64.zip
$ mv bowtie-0.12.5 ~/bowtie-0.12.5

We now want to amend the line that we added the blast folder last week:

PATH=$PATH:~/ncbi-blast-2.2.23+/bin

to:

PATH=$PATH:~/ncbi-blast-2.2.23+/bin:~/bowtie-0.12.5

Using Bowtie


Just as with BLAST, when we had to format a new database to search against the yeast genome, we'll have to format a reference for Bowtie as well. We do this with the bowtie-build command. The first argument is the reference genome file, the second is the name we want to use for the genome file, which I've chosen to by mtbH37Rc, after the genera, species, and strain of the reference genome.

$ bowtie-build ~/PythonCourse/data/mtbReferenceGenome.FASTA mtbH37Rc

Now that we have a reference, to get a feel for how Bowtie works, let's look at the command line help:

$ bowtie --help

We'll omit the output here, as holy smokes, there's a lot of it.

In short, Bowtie has merciful default syntax, and lots of options thereafter. We can get started just using a single sequence on the command line (that's what the -c argument is for):

$ bowtie -c mtbH37Rc GGTTCGGCCAAGTCCA

0 - supercontnull.1 143439 TGGACTTGGCCGAACC IIIIIIIIIIIIIIII 0 2:T>A

  1. reads processed: 1
  2. reads with at least one reported alignment: 1 (100.00%)
  3. reads that failed to align: 0 (0.00%)
Reported 1 alignments to 1 output stream(s)

First off, let's lookup the format of our output:

Default Bowtie Output

The output here tells us that the sequence matched to position 143439 of the minus strand of the reference genome, and there is a change in the sequence at position 3 (index 2!), relative to the reference sequence. The change is supposed to be a T>A, and to investigate, I'll just show you the sequence at the genome in this location:

TGGACTTGGCCGATCC

Okay, so that works, but we don't want to just look at one read, but all of our reads, which we'll gunzip from our test file (okay, these aren't even close to all of our reads).

$ cp ../S7.1/SRR006916_sample.fastq.gz .
$ gunzip SRR006916_sample.fastq.gz .
$ bowtie mtbH37Rc SRR006916_sample.fastq --refout

The last argument tells Bowtie to write the output to a file, ref00000.map. The .map filename will iteratively increase if you supply more than one file of reads to map.

Scripting Bowtie


Let's start off by writing a script to run Bowtie once, with parameters, and then we'll move on to running it for each of the 21 genomes.

bowtie_once.py
import subprocess as sp
 
 
program = 'bowtie'
refgenomename = 'mtbH37Rc'
readfn = 'SRR006916_sample.fastq'
 
proc = sp.Popen( [program, refgenomename, readfn, '--refout'] )
 
 

At this point, let's two sample files, and use Python to run on both of them:



I'm going to make a new directory in our data directory called samples, and we'll copy them there:

$ mkdir ~/PythonCourse/data/samples/
[ copy the files to this directory, from wherever they were downloaded to]

and back to the session directory:

$ cd ~/PythonCourse/S7.2

bowtie_twice.py
import subprocess as sp
import os, sys
 
 
# setup our program variables
program = 'bowtie'
refgenomename = 'mtbH37Rc'
 
 
# the data directory, and read the filenames
datadir = '../data/samples/'
readfilenames = os.listdir(datadir)
 
 
# make a new directory for our output data
outdir = '../data/sample_maps/'
 
 
try:
    os.mkdir(outdir)
except Exception as error:
    print error
    print "WARNING:  OUTPUT DIRECTORY ALREADY EXISTS!  Data may be lost!"
 
 
 
 
# use a loop to run multiple instances of the bowtie mapper
for readfn in readfilenames:
    # make the outfile name from the readfile name, add the extension .map
    readfn = datadir + readfn
    outfn = outdir + readfn.split('/')[-1].split('.')[0] + '.map'
    print readfn
    print outfn
    proc = sp.Popen( [program, refgenomename, readfn, outfn] )
 
 
 

Okay, now we've seen how we can use python to loop through multiple executions of our Bowtie program. But, we have work left to do, as we want to change the format to then feed into another program call Samtools, which will help us identify our SNPs. The Bowtie flag for formatting Samtools output is simply -S, so let's alter our code to include this argument.

The actual command line we want to run for each file looks like this:

bowtie -S [read file] [output file]

We also need to stop for a second and install Samtools:

$ cd ~/PythonCourse/programFiles/
$ tar jxvf samtools-0.1.7_i386-darwin.tar.bz2
$ mv samtools-0.1.7_i386-darwin ~/samtools-0.1.7

and to our increasingly familiar ~/.bash_profile file, we should add this to the PATH line, so that it reads:

PATH=$PATH:~/ncbi-blast-2.2.23+/bin:~/bowtie-0.12.5/:~/samtools-0.1.7

and then we need re-source our .bash_profile:

$ source ~/.bash_profile

Now, if you execute:

$ samtools
Program: samtools (Tools for alignments in the SAM format)
Version: 0.1.7 (r510)


Usage: samtools <command> [options]


Command: view SAM<->BAM conversion
sort sort alignment file
pileup generate pileup output
faidx index/extract FASTA
tview text alignment viewer
index index alignment
fixmate fix mate information
glfview print GLFv3 file
flagstat simple stats
calmd recalculate MD/NM tags and '=' bases
merge merge sorted alignments
rmdup remove PCR duplicates






Okay, we're set to modify our script to produce Bowtie output that can then be processed by Samtools.

import subprocess as sp
import os, sys
 
 
# setup our program variables
program = 'bowtie'
refgenomename = 'mtbH37Rc'
 
 
# the data directory, and read the filenames
datadir = '../data/samples/'
readfilenames = os.listdir(datadir)
 
 
# make a new directory for our output data
outdir = '../data/sample_maps/'
 
 
try:
    os.mkdir(outdir)
except Exception as error:
    print error
    print "WARNING:  OUTPUT DIRECTORY ALREADY EXISTS!  Data may be lost!"
 
 
 
 
# use a loop to run multiple instances of the bowtie mapper
for readfn in readfilenames:
    # make the outfile name from the readfile name, add the extension .map
    readfn = datadir + readfn
    # change the name extension to .sam instead of .map
    outfn = outdir + readfn.split('/')[-1].split('.')[0] + '.sam'
    print readfn
    print outfn
    proc = sp.Popen( [program, '-S', refgenomename, readfn, outfn] )
 
Now a similar set of scripts to generate our Samtools output:

samtools_bam.py
import subprocess as sp
import os, sys
 
 
 
 
# now for samtools
# the format of the command line we're trying to mimick is:
# samtools view -bS -o filename.bam filename.sam
# where the .bam file is a new file in the compressed BAM format, and the .sam file
# is the file we just made above
 
 
program = 'samtools'
 
 
samdir = '../data/sample_maps/'
samfiles = os.listdir('../data/sample_maps/')
print samfiles
 
 
bamdir = '../data/bam/'
try:
    os.mkdir(bamdir)
except Exception as error:
    print error
    print "WARNING:  BAMDIR already exists!"
 
 
 
for samfile in samfiles:
    # set the sam and bam file names
    print samfile
    bamfile = bamdir + samfile.split('/')[-1].replace('.sam', '.bam')
    samfile = samdir + samfile
    print samfile
    print bamfile
 
 
    proc=sp.Popen( [program, 'view', '-bS', '-o', bamfile, samfile] )
 
 
 
 
samtools_sort.py

import subprocess as sp
import os, sys
 
 
# now for samtools
# the format of the command line we're trying to mimick is:
# samtools view -bS -o filename.bam filename.sam
# where the .bam file is a new file in the compressed BAM format, and the .sam file
# is the file we just made above
 
program = 'samtools'
 
bamdir = '../data/bam/'
bamfiles = os.listdir('../data/bam/')
print bamfiles
 
sorteddir = '../data/sorted/'
try:
    os.mkdir(sorteddir)
except Exception as error:
    print error
    print "WARNING:  SORTEDDIR already exists!"
 
 
for bamfile in bamfiles:
    # set the sam and bam file names
    print bamfile
    sortedfile = sorteddir + bamfile.split('/')[-1].replace('.bam', '.sorted')
    bamfile = bamdir + bamfile
    print bamfile
    print sortedfile
 
    proc=sp.Popen( [program, 'sort', bamfile, sortedfile] )
 





samtools_pileup.py
import subprocess as sp
import os, sys
 
program = 'samtools'
refgenome = '../data/mtbReferenceGenome.FASTA'
 
 
 
 
sorteddir = '../data/sorted/'
sortedfiles = os.listdir('../data/sorted/')
print sortedfiles
 
 
pilesdir = '../data/piles/'
try:
    os.mkdir(pilesdir)
except Exception as error:
    print error
    print "WARNING:  PILESDIR already exists!"
 
 
 
for sortedfile in sortedfiles:
    # set the sam and bam file names
    sortedfile = sorteddir + sortedfile
    pilefile = sortedfile.split('/')[-1].replace('.sorted.bam', '.pileup')
    pilefile = pilesdir + pilefile
    print sortedfile
    print pilefile
 
 
    pilefh = open(pilefile, 'w')
 
 
#   samtools pileup -cv -f genomes/NC_008253.fna ec_snp.sorted.bam
    proc=sp.Popen( [program, 'pileup', '-cv', '-f', refgenome, sortedfile], stdout=pilefh )
 
 


1. Track Aligned and Unaligned Reads
Use the Bowtie options (look at --help !) to write additional output files for the aligned and unaligned reads.



2. Pileup Files

Write a function that reads pileup files and stores the location of each SNP.


Solutions

1) Add aligned and unaligned:

# use a loop to run multiple instances of the bowtie mapper
for readfn in readfilenames:
 # make the outfile name from the readfile name, add the extension .map
 readfn = datadir + readfn
 # change the name extension to .sam instead of .map
 outfn = outdir + readfn.split('/')[-1].split('.')[0] + '.sam'
 print readfn
 print outfn
 proc = sp.Popen( [program, '-S', refgenomename, readfn, outfn, 
 '--al', readfn+'.al', '--un', readfn+'.un'] )