From BioPERL to Biopython
Biopython is good for -
- Quick analysis of nucleotide and protein sequence. You can easily extract
a segment from a longer sequence, get reverse complement, do nucleotide
to protein translation. - Parsing of all kinds of files, including simple FASTA files,
BLAST output, MUSCLE output, PDB files, and so on. - Submitting requests to online databases and fetchin data from them. For example,
you can programmatically run BLAST at NCBI, instead of manually filling up the
form. - Statistical and bioinformatics analysis - clustering, motifs, phylogeny, etc.
Analyzing Nucleotide and Protein Sequences
Biopython has many functions to perform routine analysis
of nucleotide and protein sequences. The sequences themselves
are saved in the Bio.Seq class.
Extracting Subsequences
from Bio.Seq import Seq
read = Seq(“GATCGTAGATAGTGCGCGCGTAGAGGAGAGATAGAGAGAGGAGATAGAGATA”)
print read[10:20]
You will see “AGTGCGCGCG” being printed.
Here read is a Bio.Seq object that can be used to store nucleotide
and protein sequences. A
subsequence of Bio.Seq object can be obtained in the same
way we get substrings. Its coordinate system starts from 0.
Reverse Complement
from Bio.Seq import Seq
read = Seq(“GATCGTAGATAGTGCGCGCGTAGAGGAGAGATAGAGAGAGGAGATAGAGATA”)
print read.reverse_complement()
The function ‘reverse_complement’ is included in Bio.Seq class. You
will see the output “TATCTCTATCTCCTCTCTCTATCTCTCCTCTACGCGCGCACTATCTACGATC”.
Translate into Proteins
from Bio.Seq import Seq
read = Seq(“GATCGTAGATAGTGCGCGCGTAGAGGAGAGATAGAGAGAGGAGATAGAGATA”)
print read.translate()
/usr/local/lib/python2.7/dist-packages/Bio/Seq.py:2095: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
BiopythonWarning)
DRR*CARRGEIERGDRD
If you trim the last nucleotide, the error will go away.
Compute GC Content
from Bio.SeqUtils import GC
from Bio.Seq import Seq
read = Seq(“GATCGTAGATAGTGCGCGCGTAGAGGAGAGATAGAGAGAGGAGATAGAGATA”)
print GC(read)
Output - 48.0769230769
You can experiment by changing the sequence to
all As or all Gs to see whether the GC function works correctly.
Parsing Biological Records
Parsing text files of different formats is a tedious task in
bioinformatics. Biopython makes this process
very easy. It can load data stored in many different file formats.
Functions - ‘read’ and ‘parse’
Biopython maintains uniform syntax for loading data from a file. Each
of its class parsing text files has two functions - ‘read’ and ‘parse’.
Typically, biological data files have multiple records of the
same type. For example, a FASTA file contains several different
sequences. A BLAST output file contains search results
for several different input sequence. The function ‘read’ loads only the first
record from a large file, whereas the function ‘parse’ creates an iterator
to go over all records. Examples for the functions are shown below.
Reading FASTA File
The function ‘read’ fetches only the first record from a FASTA file.
from Bio import SeqIO
record=SeqIO.read(“seq.fasta”, “fasta”)
print record.id, record.seq
The function ‘parse’ creates an iterator to go over all records in
a FASTA file. You can either loop over the records, or use next()
function to fetch one record at a time.
from Bio import SeqIO
records=SeqIO.parse(“seq.fasta”, “fasta”)
for record in records:
print record.id, record.seq
or
from Bio import SeqIO
records=SeqIO.parse(“seq.fasta”, “fasta”)
record = next(records)
print record.id
print record.seq
print len(record)
record = next(records)
print record.id
print record.seq
print len(record)
KEGG Example
http://www.genome.jp/dbget-bin/www_bget?ec:5.1.1.1
The function ‘read’ fetches only one record.
from Bio.KEGG import Enzyme
record = Enzyme.read(open(“ec_5.1.1.1.txt”))
The function ‘parse’ creates an iterator to go over all records.
from Bio.KEGG import Enzyme
records = Enzyme.parse(open(“ec_5.1.1.1.txt”))
record=next(records)
…
…
record=next(records)
Commands for Different Types of Data
Similar ‘parse’ and ‘read’ functions can be used to process many different
types of data files. In the following table, we list only one of the
two functions. The other one is also valid.
| Data Type | Biopython Library |
|---|---|
| FASTA | from Bio import SeqIO<br>records=SeqIO.parse(“seq.fasta”, “fasta”) |
| Genbank | from Bio import SeqIO<br>records = SeqIO.parse(“dat.gbk”, “genbank”) |
| BLAST | from Bio.Blast import NCBIXML<br>records = NCBIXML.parse(open(“blast_out.xml”)) |
| CLUSTAL | from Bio import AlignIO<br>align = AlignIO.read(“alignment.aln”, “clustal”) |
| MUSCLE | from Bio import AlignIO<br>align = AlignIO.read(“alignment.faa”, “fasta”) |
| Phylogeny | from Bio import Phylo<br>tree = Phylo.read(“tree.dnd”, “newick”) |
| Entrez | from Bio import Entrez<br>records = Entrez.parse(open(“Homo_sapiens.xml”)) |
| UniGene | from Bio import UniGene<br>record = UniGene.read(open(“gene.data”)) |
| GEO | from Bio import Geo<br>records = Geo.parse(open(“GSE273.txt”)) |
| Medline | from Bio import Medline<br>records=Medline.parse(open(“pubmed_file.txt”)) |
| Pubmed | |
| SwissProt Keywords | from Bio.SwissProt import KeyWList<br>records = KeyWList.parse(open(“keywlist.txt”)) |
| Prosite | from Bio.ExPASy import Prosite<br>records = Prosite.parse(open(“prosite.dat”)) |
| Prosite Doc | from Bio.ExPASy import Prodoc<br>records = Prodoc.parse(open(“prosite.doc”)) |
| EXPASy | from Bio.ExPASy import Enzyme<br>records = Enzyme.parse(open(“enzyme.dat”)) |
| PDB PDBParser | from Bio.PDB.PDBParser import PDBParser |
| PDB MMCIF2Dict | from Bio.PDB.MMCIF2Dict import MMCIF2Dict |
| PDB MMCIFParser | from Bio.PDB.MMCIFParser import MMCIFParser |
| PDB MMTFParser | from Bio.PDB.mmtf import MMTFParser |
| KEGG | from Bio.KEGG import Enzyme<br>records = Enzyme.parse(open(“ec_5.1.1.1.txt”)) |
Although ‘parse’ and ‘read’ functions are used to parse different types
of data files, the records created by them are not identical. SeqIO.parse produces
‘SeqIO’ type of records, whereas ‘Medline.parse’ produces ‘Medline’ type of records. We
will see more details about those records in the following section.
Objects to Store Different Types of Data
In the last section, we learned about ‘read’ and ‘parse’
functions to read files in different formats. The output
from those calls create different kinds of records, as
appropriate for the situation. Let us see a few examples. Some
of those fields themselves can be iterators.
FASTA Record
Bio.SeqIO with fasta
id –>
seq –>
BLAST Record
Bio.Blast.NCBIXML object
from Bio.Blast import NCBIXML
records = NCBIXML.parse(open(“blast_output.xml”))
E_VALUE_THRESH = 0.00001
for record in records:
for alignment in record.alignments:
for hsp in alignment.hsps:
if hsp.expect < E_VALUE_THRESH:
print(‘Next Alignment’)
print(‘seq:’, alignment.title)
print(‘L:’, alignment.length)
print(‘e value:’, hsp.expect)
print(hsp.query[0:75] + ‘…’)
print(hsp.match[0:75] + ‘…’)
print(hsp.sbjct[0:75] + ‘…’)
More examples are shown in the Example section.
Accessing Data from the Internet
One attractive feature of Biopython is that it can fetch different
kinds of data from the internet. For example, in case of BLAST, it
can submit BLAST request to the NCBI, and then get back the output for you.
BLAST
from Bio.Blast import NCBIWWW
from Bio import SeqIO
record = SeqIO.read(“m_cold.fasta”, format=”fasta”)
result_handle = NCBIWWW.qblast(“blastn”, “nt”, record.seq)
save_file = open(“my_blast.xml”, “w”)
save_file.write(result_handle.read())
save_file.close()
result_handle.close()
KEGG
http://www.genome.jp/dbget-bin/www_bget?ec:5.1.1.1
from Bio.KEGG import REST
from Bio.KEGG import Enzyme
req = REST.kegg_get(“ec:5.1.1.1”)
open(“ec_5.1.1.1.txt”, ‘w’).write(req.read())
Entrez
from Bio import Entrez
Entrez.email = “A.N.Other@example.com” # Always tell NCBI who you are
handle = Entrez.efetch(db=”nucleotide”, id=”186972394”, rettype=”gb”, retmode=”text”)
print handle.read()
References
- https://coding4medicine.com/Materials/biopython/index.html
- http://biopython.org/DIST/docs/tutorial/Tutorial.html
- http://people.duke.edu/~ccc14/pcfb/biopython/BiopythonBasics.html
- https://www.coursera.org/learn/python-genomics/lecture/ahlsr/lecture-8-biopython-13-32
- https://www.gitbook.com/book/krother/biopython-tutorial/details