First Morning, Introductions, and Lesson Plans



Your instructors are: Terry Lang and Matt Davis.

Your teaching assistants are: Nate Krefman, Andrew Glazer, Phil Cleves, Aaron Hardin, Peter Combs, and Roseanne Wincek

Topics:

  • Expectations for the course
  • Navigating the UNIX shell
  • Viewing the content of files
  • How to get help
  • Text editors

Introduction


Welcome to the QB3 Introduction to Programming for Bioinformatics bootcamp!

Overview for today:
1) What should you expect to learn from this course.
2) What we are not going to cover.
3) How the class will be organized, including the instructors and TAs.
4) Learn a little about how to function in a UNIX environment.

What are you going to learn?

You are going to learn python :O)

Python is a simple and powerful programming language that is used for many applications from simple tasks to large software development projects. It has become popular as both a first language for beginning students and an everyday one for advanced programmers. Python is used by a range of companies including YouTube, Google, and Industrial Light and Magic.

Our goal is to allow you to apply programming to the problems that you face in the lab. Although we will only directly cover a couple application areas of programming to biology, we expect you to leave this course with a sufficiently generalized knowledge of programming that you will be able to apply your skills to whatever you happen to be working on.

What are you not going to learn?


Issue 1: Learning to program is a lot like learning a new language. It requires adjusting the way you think about solving a problem and communicating that solution. Even more confusing, each programmer develops his/her own style (accent). In practice, this means that there is almost never a "RIGHT ANSWER," but rather that there are almost infinite ways to solve almost any problem. In the course of the class, you will be exposed to several styles and see several ways to get to more complicated solutions. The goal, though, is to give you the tools to begin to develop a style of your own.


Issue 2: Python is big. Thousands of computer scientists and programmers have used it, contributed to it, and extended it to their own sub-fields. And they keep improving, and thus changing it! We can't get to all of it, even if we had much longer than two weeks. Included in the list of subjects we will not cover is object-oriented programming, writing parallel programs, integrating your code with faster code written in C or C++, or a host of other powerful-but-difficult methods and topics. We will teach you enough that you should be able to go learn about them if and when you want.

How are you going to learn it?


The course is broadly divided into two parts.

Week 1: Learning Python Basics

In the first week of the course we will learn the very basics of programming practice and the fundamentals of Python syntax, including:

- how to get information from files
- how to store information
- how to do interesting things with the information
- how to print information back out
- how to do really complicated things with the information

Week 2: Python Applied to Bioinformatics

In the second week, we will use real data from two published studies on high-throughput screening and biochemical analysis of drug resistance. The second week will show us:

- how to write a program (versus a script)
- how to perform complicated scientific data analysis
- how to visualize our data

Finally, the last lecture of the class is an open study hall where you can bring your problems to us, and we will help you develop a computational strategy to address them.

Our daily schedule will generally proceed like this:

Start at 8:30 am
1-2 hour lecture
2-3 hours lab for exercises

Lunch from 12:30 - 1

1-2 hour lecture
2-3 hours lab for excercises
Leave at 5 pm

You will have a number of exercises each day covering the breadth of the lectures. You will not be graded on these but it's REALLY important for you to be able to demonstrate what you've just learned. Just like learning French, one learns programming by doing. If you only finish half the problems, you've only really learned half the material.

Coding has a steep learning curve!

Learning to program is really HARD! REALLY! Don't worry if you get frustrated. Try to remember that python is a logic-based language and that you can reason your way around most problems we will be posing in the next two weeks. Ask questions! The idea is to get you to the point of being able to solve real problems in lab and to give you some tools to learn more on your own.

You have two incredibly useful resources at your disposal during the labs: first, you have us, the TAs and instructors, who are all familiar with the language and here to help you out. Second, you have the extensive documentation about python and programming that we will be introducing you to in the course of the class.

You can also access some resources here:
Learning Python
Python Pocket Reference
Python Website (documentation)
Linux Pocket Guide

Questions?

Using the UNIX shell


You will spend nearly all of your time in one of two places: the shell or the text editor. The shell allows you to move and copy files, run programs, and more, while the text editor is where you will write your programs. We will focus mostly on the shell this morning, although we will touch on the basic usage of two popular text editor, emacs and nedit. We will begin Python this afternoon.



Informative Interlude: Some notes on the formatting of the lessons for this course


Periodically in the page these lessons, we may stop with an informative interlude outlined with a horizontal line above and below (like the one two lines up!). In this case, we're taking a quick break to discuss this and other aspects of the formatting.

For this and all further examples, a $ represents your shell prompt, and boldface indicates the commands to type at the prompt. Italics will be used for output you should see when you take the described action.

Finally, when we use actual python code examples, they will be contained in the shaded boxes, such as:

This is where code will appear.

This concludes our first informative interlude.


Let's start by opening a new terminal window...

How do I move around?


pwd
[where am I?]
(Print Working Directory) Prints the directory in which you are at the current moment. If you create any files, they will appear in this spot. When you first open the terminal shell, you will be in your "home" directory.

$ pwd
/home/terry

cd
[move to a new directory]
(Change Directory) Given a complete path, this command moves your "current location" to the specified directory.

$ cd PythonCourse
/terry@George ~/PythonCourse

$ pwd
/home/terry/PythonCourse

To go up, use the command cd ..

$ cd ..
$ pwd
/home/terry

An aside on directories...
Directories in UNIX are set up the same way as your regular computer. Just as you would open up a window into your directories and click to open up folders, here you use cd to go through the directories. You are just typing the command instead of clicking!

ls directory_path
[lists contents of a directory]
(LiSt) Shows the files and directories.

$ cd teach/programming
$ ls
2010 b-lactamase.pdf katG.pdf tb_notes.doc
$ ls 2010
chimera

ls has many options. Here are some of the more useful ones to know:

ls -l
[lists the long form of the directory entries' security permissions, owners of files, sizes, date created]

ls -lt
[shows long listing, and sorts by modification time]

ls -lr
[reverses the list]

ls ..
[list contents of the directory above]

$ ls -ltr
..total 1444
-rwx------+ 1 terry None 39936 Dec 10 2007 nosehair.doc
-rwx------+ 1 terry None 32768 Dec 10 2007 bacterial_terrarium.doc
-rwx------+ 1 terry None 38400 Jan 1 2008 mucus.doc
-rwx------+ 1 terry None 39424 Jan 1 2008 tears.doc
-rwx------+ 1 terry None 68608 Jan 23 2008 Germs and Your Body aj edits.doc
-rwx------+ 1 terry None 29696 Jan 23 2008 CIC Intake Form rev 9-07.doc
-rwx------+ 1 terry None 54272 Mar 18 2008 script.doc
-rwx------+ 1 terry None 27648 Mar 18 2008 packing_list.doc
drwx------+ 2 terry None 0 Dec 12 2008 pictures
-rwx------+ 1 terry None 57856 Feb 19 2009 Germs and Your Health.doc
-rwx------+ 1 terry None 25088 May 18 2009 budget.xls
-rwx------+ 1 terry None 99328 Oct 21 2009 Germs and Your Body_Final.doc
-rwx------+ 1 terry None 20480 Dec 15 2009 contact_info.doc
drwx------+ 2 terry None 0 Apr 14 22:09 KIPP
-rwx------+ 1 terry None 921053 Jun 21 18:03 image.jpg
drwx------+ 3 terry None 0 Jun 21 18:03 programming

Making your mark...


mkdir directory_name
[Create a given directory]
(MaKe DIRectory) Exactly what it says - let's you create new directories.

$ echo 'Hello World' > tb_notes.txt

$ ls
2010 b-lactamase.pdf katG.pdf tb_notes.doc
$ mkdir linux
$ ls
2010 b-lactamase.pdf katG.pdf linux tb_notes.doc

cp original_name copy_name
[copy file or directory]
(CoPy) Create a copy of the original file

$ echo 'Hello world' >> tb_notes.doc
$ ls
2010 b-lactamase.pdf katG.pdf linux tb_notes.doc
$ cp tb_notes.doc charlyCat.doc
$ ls
2010 b-lactamase.pdf charlyCat.doc katG.pdf linux tb_notes.doc

mv source destination
[move files or directories]
(MoVe) Rename a file or directory.

$ mv charlyCat.doc tb_notes.doc
$ ls
2010 b-lactamase.pdf katG.pdf linux tb_notes.doc

Peeking inside files

less file_name
[view contents of a file]

less shows the contents of a file, and allows you to scroll and search the contents. However, less can only be used for simple text files, so you cannot view contents of, say, MS Word documents with less.

$ less linux_text.txt
Why hello there!
How are you this morning?
Look what I just found :O)
The Pythonidae, commonly known simply as pythons, from the Greek wordpython, are a family of non-venomous snakes found in Africa, Asiaand Australia. Among its members are some of the largest snakes in theworld. Eight genera and 26 species are currently recognized.[2]
Contents1 Geographic range2 Conservation3 Behavior4 Feeding5 Reproduction6 Captivity7 Genera8 Taxonomy9 Gallery10 See also
Some useful navigational tips for less:- You can use the arrow keys to move up or down a line in the text.
- The spacebar will advance an entire page.- You can search for a word by typing a slash (e.g. /) followed by the search word.
- To quit, type q.- To see the full help screen, type h.

Optional Informative Interlude: UNIX names tend to be overly clever.

As you've seen with the basic commands thus far, the names are generally descriptive abbreviations of the program's function. For example, mkdir is for making a directory, ls is for listing the contents of a directory, etc. However, programmers, especially UNIX programmers, tend to get increasingly clever as things progress. Unaware of the fact that this practice makes things opaque, the typically programmer cries out for attention by making program names self-referentially clever. less is a good example of this. In the olden days, the most basic ways to view a text file could not divide files into individual pages, thus a multipage document would scroll off the screen before the first page could be read. As a solution, a program called more was written, which paused at the bottom of each page and prompted the user to press the spacebar for "more." The program name here is reasonably descriptive, but more had some noticeable feature deficiencies: you could neither advance the text one line at a time nor navigate backward in the document without reloading the whole file. The program written to accommodate these features is less. The cleverness of the name is revealed by the paradoxical adage "less is more ." Your teachers and TAs may use the more command interchangeable with less throughout the class.


head filename
[print first 10 lines of the file]

By default, head prints the top 10 lines of the input file. To print a different number, say 12, lines:
$ head -12 filename
$ head linux_text.txt
Why hello there!

How are you this morning?
Look what I just found :O)
The Pythonidae, commonly known simply as pythons, from the Greek wordpython, are a family of non-venomous snakes found in Africa, Asiaand Australia. Among its members are some of the largest snakes in the

tail filename
[print the last ten lines of the file]

$ tail linux_text.txt
Most species in this family are available in the exotic pet trade. However,caution must be exercised with the larger species as they can be dangerous;cases of large specimens killing their owners have been documented.[8]
TaxonomyPythons are more closely related to boas than to any other snake-family.Boulenger (1890) considered this group to be a subfamily (Pythoninae) of
the family Boidae (boas).[1]

cat file1 file2 ...
[print named files to the screen]
(conCATenate) If given just one file, cat will print the contents of the file to the screen. Given multiple files, it will print one after another.

$ cat cat1.txt
HEY EVERYONE!!!
$ cat cat2.txt
WISH I WAS OUTSIDE PLAYING :O(
$ cat cat1.txt cat2.txt
HEY EVERYONE!!!
WISH I WAS OUTSIDE PLAYING :O(

grep 'search_string' file
(Global Regular Expression Print)
Searches for the "search string" in a text file and prints out all lines where it find the desired text.

$ grep python linux_text.txt
The Pythonidae, commonly known simply as pythons, from the Greek wordpython, are a family of non-venomous snakes found in Africa, AsiaIn the United States an introduced population of Burmese pythons, Pythonas the Indian python, Python molurus.down adult deer, and the African rock python, Python sebae, has been known to
python, P. reticulatus, do not crush their prey to death; in fact, prey is not

The -c argument counts the number of lines.

$ grep -c python linux_text.txt
6

The -v argument inverts the search (i.e. prints lines that *don't* contain your search string).

Special characters



wildcard matching with the *




The star functions as a "wild-card" character that matches any non-specified characters.

$ ls
cat1.txt cat2.txt chimera linux_text.txt
$ ls *.txt
cat1.txt cat2.txt linux_text.txt


pipe |

(the one above the backslash "\" key)

Piping with | connects UNIX commands, allowing the output of one command to "flow through the pipe" to another.

$ env
HOMEPATH=\Documents and Settings\terry
MANPATH=/usr/local/man:/usr/man:/usr/share/man:/usr/autotool/devel/man::/usr/ssl/man:/usr/share/qt3/doc/man
APPDATA=C:\Documents and Settings\terry\Application Data
HOSTNAME=George
VS71COMNTOOLS=C:\Program Files\Microsoft Visual Studio .NET 2003\Common7\Tools\
MPICH_HOME=../../../programs/mpich-1.2.5/SDK.gcc
TERM=xterm
PROCESSOR_IDENTIFIER=x86 Family 6 Model 13 Stepping 8, GenuineIntel
WINDIR=C:\WINDOWSTEXDOCVIEW_txt=cygstart %s
CVSROOT=terry@batman.berkeley.edu:/usr/local/cvs/alber
TEXDOCVIEW_dvi=cygstart %s
WINDOWID=12582949
CVS_SERVER=/usr/bin/cvsQTDIR=/usr/lib/qt3
USERDOMAIN=GEORGE
OS=Windows_NT
ALLUSERSPROFILE=C:\Documents and Settings\All Users
$ env | grep home
HOME=/home/terry

Redirection with >

In addition to redirecting output to another command, the results can be sent into a file with the >

$ cat cat1.txt cat2.txt > wishes.txt
$ cat wishes.txt
HEY EVERYONE!!!WISH I WAS OUTSIDE PLAYING :O(

Permissions

Unlike the computers you are used to, UNIX doesn't automatically know what to do with files (e.g. It won't know to use Word to open a .doc document).

The first thing that controls a file is the file's permissions. You can control who can read, write, and execute (run as a program) each of your files.

$ ls -la
total 7
drwx------+ 3 terry None 0 Jul 3 19:42 .
drwx------+ 4 terry None 0 Jun 29 22:44 ..
-rw-r--r-- 1 terry None 17 Jul 3 19:24 cat1.txt
-rw-r--r-- 1 terry None 32 Jul 3 19:25 cat2.txt
drwxr-xr-x+ 2 terry None 0 Jun 23 16:59 chimera
-rw-r--r-- 1 terry None 3937 Jun 29 22:55 linux_text.txt
-rw-r--r-- 1 terry None 49 Jul 3 19:42 wishes.txt

The first letter tells you whether it is a directory.

The next set of letters tell you if a file is readable (r), writable (w), or executable (x).

The 2nd-4th letters tell you what *your* permissions are, 5th-7th tell you what your group's permissions are, and the last three tell you what the rest of the world's permissions are.

chmod [flags] [filename]
Modify permissions.

$ ls -l script.py
-rw-r--r-- 1 terry None 6 Jul 3 20:10 script.py
$ chmod +x script.pyt
chmod: cannot access `script.pyt': No such file or directory
$ chmod +x script.py
$ ls -l script.py
-rwxr-xr-x 1 terry None 6 Jul 3 20:10 script.py

Matt is going to explain how to get UNIX to run your executable scripts this afternoon. However, if you try running a program and it's not working at some point in the class, double check the permissions!!!

Help, I'm stuck!


man command_name
[what does that command do again?]

Most commands have many useful flags beyond what I've shown you. For information on a particular command, look at the manual pages with man.
bash-3.2$ man chmod
CHMOD(1) User Commands CHMOD(1)
 
NAME
chmod - change file mode bits
 
SYNOPSIS
chmod [OPTION]... MODE[,MODE]... FILE...
chmod [OPTION]... OCTAL-MODE FILE...
chmod [OPTION]... --reference=RFILE FILE...
 
DESCRIPTION
This manual page documents the GNU version of chmod. chmod changes the
file mode bits of each given file according to mode, which can be
either a symbolic representation of changes to make, or an octal number
representing the bit pattern for the new mode bits.
 
The format of a symbolic mode is [ugoa...][[+-=][perms...]...], where
perms is either zero or more letters from the set rwxXst, or a single
letter from the set ugo. Multiple symbolic modes can be given, sepa-
rated by commas.
 
A combination of the letters ugoa controls which users' access to the
file will be changed: the user who owns it (u), other users in the
file's group (g), other users not in the file's group (o), or all users
(a). If none of these are given, the effect is as if a were given, but
bits that are set in the umask are not affected.
 
The operator + causes the selected file mode bits to be added to the
existing file mode bits of each file; - causes them to be removed; and
= causes them to be added and causes unmentioned bits to be removed
except that a directory's unmentioned set user and group ID bits are
not affected.
 
The letters rwxXst select file mode bits for the affected users: read
(r), write (w), execute (or search for directories) (x), execute/search
only if the file is a directory or already has execute permission for
some user (X), set user or group ID on execution (s), restricted dele-
tion flag or sticky bit (t). Instead of one or more of these letters,
you can specify exactly one of the letters ugo: the permissions granted
to the user who owns the file (u), the permissions granted to other
users who are members of the file's group (g), and the permissions
granted to users that are in neither of the two preceding categories
(o).
 
A numeric mode is from one to four octal digits (0-7), derived by
adding up the bits with values 4, 2, and 1. Omitted digits are assumed
to be leading zeros. The first digit selects the set user ID (4) and
set group ID (2) and restricted deletion or sticky (1) attributes. The
second digit selects permissions for the user who owns the file: read
...





Text Editors

Lastly, now that we can see into files, it would be nice to be able to create and edit our own files.And... Our first lesson in programming accents: different programmers use different text editors. I am going to introduce three common options here. Each has pluses and minuses depending on your needs. There is no 'right' one, so play around and pick your fav. Note that each of the teachers will be using their fav, so don't worry if they are using something different than you :O).
Program 1: vi
open a file: vi [filename]
write to file: i
save a file: :w
close: :q

Program 2: emacs
open a file: emacs [filename]
save a file: CTRL-X CTRL-S
close: CTRL-X CTRL-C

Emacs (including Aquamacs) has many, many short-cut keys or "accelerators." A quick Googling of "Emacs Cheat Sheet" will reveal several resources, such as this one from the Princeton CS department: Emacs Cheat Sheet

Program 3: nedit (gedit)
open a file: nedit [filename]
save a file: CTRL-S
close a file: CTRL-C pulldown menu!!!



Questions?




Exercises


1) Cerevisiae chromosomes

a) In your top-level directory, make a new directory called "fasta_files" and change into it
b) Go to http://downloads.yeastgenome.org/sequence/genomic_sequence/chromosomes/fasta/ and individually download each of the files ending in .fsa. These are the chromosomes of the yeast, S. cerevisiae. You may have to right-click these files depending on your web browser (and be aware, some browsers will save your file with a .txt extenstion).
c) Make a single whole genome file called "cerevisiae_genome.fasta"
d) Count the chromosomes in the whole genome file using commands from the lecture. (HINT: Each of the original FASTA files contains a single chromosome).
e) Look up the command 'wc' and find out what it does. Get size of total genome. (HINT: The size of the genome can be determined by counting the number of bases).

2) Cerevisiae genes

a) Get the list of cerevisiae chromosome features: ftp://genome-ftp.stanford.edu/pub/yeast/chromosomal_feature/SGD_features.tab
Columns within SGD_features.tab:
 
1.   Primary Standfor Gene Database ID (SGDID) (mandatory)
2.   Feature type (mandatory)
3.   Feature qualifier (optional)
4.   Feature name (optional)
5.   Standard gene name (optional)
6.   Alias (optional, multiples separated by |)
7.   Parent feature name (optional)
8.   Secondary SGDID (optional, multiples separated by |)
9.   Chromosome (optional)
10.  Start_coordinate (optional)
11.  Stop_coordinate (optional)
12.  Strand (optional)
13.  Genetic position (optional)
14.  Coordinate version (optional)
15.  Sequence version (optional)
16.  Description (optional)
 

b) Count total genes
c) Count only verified genes. Count only uncharacterized genes.
d) What other types of genes are in this file?

3) Moving beyond the lecture
a) Use google and any other references you want to find a command that tells you how much disk space you have left.
b) Use the 'man' command to see how it works.
c) How much space is left on your system? Make the command output in terms of gigabytes and megabytes-- 'human-readable' form.

4) Picking your favorite text editor
a) Play around with the three text editors I just introduced.
b) Using your favorite editor, write a short note about what you are most excited about learning in this class. Email it to us at intro.prog.bioinformatics@gmail.com.

Solutions


1) Yeast Genome


# make the directory
mkdir fasta_files
 
# change to the new directory
cd fasta_files
 
# create the file from all the chromosomes
$ cat chr01.fsa chr02.fsa chr03.fsa chr04.fsa chr05.fsa chr06.fsa chr07.fsa
 chr08.fsa chr09.fsa chr10.fsa chr11.fsa chr12.fsa chr13.fsa chr14.fsa
 chr15.fsa chr16.fsa chrmt.fsa > cerevisiae_genome.fasta
 
# count the number of chromosomes
$ grep -c  chr cerevisiae_genome.fasta
 
# lookup the command wc
 
$ man wc
 
# use it to count the length of the genome
 
$ wc cerevisiae_genome.fasta
 
# the number of words
 
$ wc -w cerevisiae_genome.fasta
 
# the number of lines
 
$ wc -l cerevisiae_genome.fasta
 
 


2) SGD Features


# count the number of genes (ORFs)
$ grep -c ORF SGD_features.tab
 
 
# Count the number of Verified
 
$ grep -c Verified SGD_features.tab
 
# All the other lines
$ grep -v Verified SGD_features.tab| grep -v Uncharacterized > all_others.tab
 

3) Disk Space


# I tried Googling "disk space free unix command"
 
$ man df
 
# try it out
$ df
 
Filesystem    512-blocks      Used Available Capacity  Mounted on
/dev/disk0s2   976101344 393891352 581697992    41%    /
devfs                215       215         0   100%    /dev
map -hosts             0         0         0   100%    /net
map auto_home          0         0         0   100%    /home
 
# with the -m flag
$ df -m
 
$ df -m
 
Filesystem    1M-blocks   Used Available Capacity  Mounted on
/dev/disk0s2     476611 192329    284032    41%    /
devfs                 0      0         0   100%    /dev
map -hosts            0      0         0   100%    /net
map auto_home         0      0         0   100%    /home
 
# with the "human readable" flag
$ df -H
 
Filesystem      Size   Used  Avail Capacity  Mounted on
/dev/disk0s2    500G   202G   298G    41%    /
devfs           110k   110k     0B   100%    /dev
map -hosts        0B     0B     0B   100%    /net
map auto_home     0B     0B     0B   100%    /home