From BioPERL to Biopython

Biopython is good for -

Quick analysis of nucleotide and protein sequence. You can easily extract
a segment from a longer sequence, get reverse complement, do nucleotide
to protein translation.
Parsing of all kinds of files, including simple FASTA files,
BLAST output, MUSCLE output, PDB files, and so on.
Submitting requests to online databases and fetchin data from them. For example,
you can programmatically run BLAST at NCBI, instead of manually filling up the
form.
Statistical and bioinformatics analysis - clustering, motifs, phylogeny, etc.

Analyzing Nucleotide and Protein Sequences

Biopython has many functions to perform routine analysis
of nucleotide and protein sequences. The sequences themselves
are saved in the Bio.Seq class.

Extracting Subsequences

from Bio.Seq import Seq
read = Seq(“GATCGTAGATAGTGCGCGCGTAGAGGAGAGATAGAGAGAGGAGATAGAGATA”)

print read[10:20]

You will see “AGTGCGCGCG” being printed.

Here read is a Bio.Seq object that can be used to store nucleotide
and protein sequences. A
subsequence of Bio.Seq object can be obtained in the same
way we get substrings. Its coordinate system starts from 0.

Reverse Complement

from Bio.Seq import Seq
read = Seq(“GATCGTAGATAGTGCGCGCGTAGAGGAGAGATAGAGAGAGGAGATAGAGATA”)

print read.reverse_complement()

The function ‘reverse_complement’ is included in Bio.Seq class. You
will see the output “TATCTCTATCTCCTCTCTCTATCTCTCCTCTACGCGCGCACTATCTACGATC”.

Translate into Proteins

from Bio.Seq import Seq
read = Seq(“GATCGTAGATAGTGCGCGCGTAGAGGAGAGATAGAGAGAGGAGATAGAGATA”)

print read.translate()

/usr/local/lib/python2.7/dist-packages/Bio/Seq.py:2095: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
BiopythonWarning)
DRR*CARRGEIERGDRD

If you trim the last nucleotide, the error will go away.

Compute GC Content

from Bio.SeqUtils import GC
from Bio.Seq import Seq
read = Seq(“GATCGTAGATAGTGCGCGCGTAGAGGAGAGATAGAGAGAGGAGATAGAGATA”)

print GC(read)

Output - 48.0769230769

You can experiment by changing the sequence to
all As or all Gs to see whether the GC function works correctly.

Parsing Biological Records

Parsing text files of different formats is a tedious task in
bioinformatics. Biopython makes this process
very easy. It can load data stored in many different file formats.

Functions - ‘read’ and ‘parse’

Biopython maintains uniform syntax for loading data from a file. Each
of its class parsing text files has two functions - ‘read’ and ‘parse’.

Typically, biological data files have multiple records of the
same type. For example, a FASTA file contains several different
sequences. A BLAST output file contains search results
for several different input sequence. The function ‘read’ loads only the first
record from a large file, whereas the function ‘parse’ creates an iterator
to go over all records. Examples for the functions are shown below.

Reading FASTA File

The function ‘read’ fetches only the first record from a FASTA file.

from Bio import SeqIO
record=SeqIO.read(“seq.fasta”, “fasta”)
print record.id, record.seq

The function ‘parse’ creates an iterator to go over all records in
a FASTA file. You can either loop over the records, or use next()
function to fetch one record at a time.

from Bio import SeqIO
records=SeqIO.parse(“seq.fasta”, “fasta”)

for record in records:
print record.id, record.seq

from Bio import SeqIO
records=SeqIO.parse(“seq.fasta”, “fasta”)

record = next(records)
print record.id
print record.seq
print len(record)

KEGG Example

http://www.genome.jp/dbget-bin/www_bget?ec:5.1.1.1

The function ‘read’ fetches only one record.

from Bio.KEGG import Enzyme
record = Enzyme.read(open(“ec_5.1.1.1.txt”))

The function ‘parse’ creates an iterator to go over all records.

from Bio.KEGG import Enzyme
records = Enzyme.parse(open(“ec_5.1.1.1.txt”))

record=next(records)
…
…

record=next(records)

Commands for Different Types of Data

Similar ‘parse’ and ‘read’ functions can be used to process many different
types of data files. In the following table, we list only one of the
two functions. The other one is also valid.

Data Type	Biopython Library
FASTA	from Bio import SeqIO<br>records=SeqIO.parse(“seq.fasta”, “fasta”)
Genbank	from Bio import SeqIO<br>records = SeqIO.parse(“dat.gbk”, “genbank”)
BLAST	from Bio.Blast import NCBIXML<br>records = NCBIXML.parse(open(“blast_out.xml”))
CLUSTAL	from Bio import AlignIO<br>align = AlignIO.read(“alignment.aln”, “clustal”)
MUSCLE	from Bio import AlignIO<br>align = AlignIO.read(“alignment.faa”, “fasta”)
Phylogeny	from Bio import Phylo<br>tree = Phylo.read(“tree.dnd”, “newick”)
Entrez	from Bio import Entrez<br>records = Entrez.parse(open(“Homo_sapiens.xml”))
UniGene	from Bio import UniGene<br>record = UniGene.read(open(“gene.data”))
GEO	from Bio import Geo<br>records = Geo.parse(open(“GSE273.txt”))
Medline	from Bio import Medline<br>records=Medline.parse(open(“pubmed_file.txt”))
Pubmed
SwissProt Keywords	from Bio.SwissProt import KeyWList<br>records = KeyWList.parse(open(“keywlist.txt”))
Prosite	from Bio.ExPASy import Prosite<br>records = Prosite.parse(open(“prosite.dat”))
Prosite Doc	from Bio.ExPASy import Prodoc<br>records = Prodoc.parse(open(“prosite.doc”))
EXPASy	from Bio.ExPASy import Enzyme<br>records = Enzyme.parse(open(“enzyme.dat”))
PDB PDBParser	from Bio.PDB.PDBParser import PDBParser
PDB MMCIF2Dict	from Bio.PDB.MMCIF2Dict import MMCIF2Dict
PDB MMCIFParser	from Bio.PDB.MMCIFParser import MMCIFParser
PDB MMTFParser	from Bio.PDB.mmtf import MMTFParser
KEGG	from Bio.KEGG import Enzyme<br>records = Enzyme.parse(open(“ec_5.1.1.1.txt”))

Although ‘parse’ and ‘read’ functions are used to parse different types
of data files, the records created by them are not identical. SeqIO.parse produces ‘SeqIO’ type of records, whereas ‘Medline.parse’ produces ‘Medline’ type of records. We
will see more details about those records in the following section.

Objects to Store Different Types of Data

In the last section, we learned about ‘read’ and ‘parse’
functions to read files in different formats. The output
from those calls create different kinds of records, as
appropriate for the situation. Let us see a few examples. Some
of those fields themselves can be iterators.

FASTA Record

Bio.SeqIO with fasta

id –>
seq –>

BLAST Record

Bio.Blast.NCBIXML object

from Bio.Blast import NCBIXML
records = NCBIXML.parse(open(“blast_output.xml”))

E_VALUE_THRESH = 0.00001
for record in records:
for alignment in record.alignments:
   for hsp in alignment.hsps:
     if hsp.expect < E_VALUE_THRESH:
             print(‘Next Alignment’)
             print(‘seq:’, alignment.title)
             print(‘L:’, alignment.length)
             print(‘e value:’, hsp.expect)
             print(hsp.query[0:75] + ‘…’)
             print(hsp.match[0:75] + ‘…’)
             print(hsp.sbjct[0:75] + ‘…’)

More examples are shown in the Example section.

Accessing Data from the Internet

One attractive feature of Biopython is that it can fetch different
kinds of data from the internet. For example, in case of BLAST, it
can submit BLAST request to the NCBI, and then get back the output for you.

BLAST

from Bio.Blast import NCBIWWW
from Bio import SeqIO
record = SeqIO.read(“m_cold.fasta”, format=”fasta”)
result_handle = NCBIWWW.qblast(“blastn”, “nt”, record.seq)

save_file = open(“my_blast.xml”, “w”)
save_file.write(result_handle.read())
save_file.close()
result_handle.close()

KEGG

http://www.genome.jp/dbget-bin/www_bget?ec:5.1.1.1

from Bio.KEGG import REST
from Bio.KEGG import Enzyme
req = REST.kegg_get(“ec:5.1.1.1”)
open(“ec_5.1.1.1.txt”, ‘w’).write(req.read())

Entrez

from Bio import Entrez
Entrez.email = “A.N.Other@example.com” # Always tell NCBI who you are
handle = Entrez.efetch(db=”nucleotide”, id=”186972394”, rettype=”gb”, retmode=”text”)
print handle.read()

References

https://coding4medicine.com/Materials/biopython/index.html
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://people.duke.edu/~ccc14/pcfb/biopython/BiopythonBasics.html
https://www.coursera.org/learn/python-genomics/lecture/ahlsr/lecture-8-biopython-13-32
https://www.gitbook.com/book/krother/biopython-tutorial/details

Up next

Closing Comments