Practical Computer Concepts For Data Scientists
Practical Computer Concepts For Data Scientists
Matthew Curcio
Buy on Leanpub

Chapter 2 - Genomic Data Science With Galaxy

I never dreamed that in my lifetime my own genome would be sequenced.
James D. Watson

What is Galaxy?

  • more descriptive

2.1 - First Steps: Register with Galaxy

Galaxy Front Page
Galaxy Front Page

2.1.1 - Register A New Account at (https://usegalaxy.org)

  1. Open the UseGalaxy web page.
  2. Go to the top middle section, in the red circle above.
  3. Click: Login or Register / Register / Create account
  4. Fill in all account information
  5. Submit
Galaxy Create Account Page
Galaxy Create Account Page

2.1.2 - Explore Galaxy

The main Galaxy web page has 4 sections
Galaxy top banner
Galaxy Top Banner
Galaxy Top Banner
  • Go to: Help
  • Explore: Videos, Interactive Tours, Ask a question, …
  • Watch: Introduction to Galaxy by James Taylor, PhD, Assoc. Professor at Johns Hopkins Uni.
Tools (on left)
  • Here you can Get Data, Manipulate data, Sort and Fetch Alignments, and use NGS (Next Generation Sequencing) tools, etc.
  • For example, choose a category, then choose an algorithm.
  • Search for FastQC: either in the search bar or under NGS: QC and manipulation
Center panel

The Center panel allows one to inspect data, results, generate step-wise workflows, visualize results, share and publish data.

  1. Now, Go to the Tools section
  2. Search: fastqc in the search tools oval and click FastQC
    OR
  3. Find: NGS: QC and manipulation and click FastQC
  4. Notice the help information showing what FASTQC does and how to operate it.
  5. In the Center Panel, READ the FASTQC help section, you will need this later.
Find FastQC Help Section
Find FastQC Help Section
History (on right)
  • History allows one to search and look over their past work, &
  • Operate on multiple datasets, Edit History tags, & Add/Edit annotataions, also
  • Once you start producing work files you may, View data, Edit attributes, or Delete your work.
Find FastQC Help Section
Find FastQC Help Section

2.2 - Exercise # 1: Investigating SNP on a chromosome

Q. Which coding exon has the highest number of single nucleotide polymorphisms on chromosome 22?

See paper: A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms, Nature 409, 928-933, 15 February 2001, The International SNP Map Working Group

Generalized Steps:

  1. Load human chromosome data from UCSC Table Browser
  2. Load human snp data
  3. Join the genomic datasets
  4. Identify and count which exons which have repeats
  5. Save, download, and report data

Specific Steps:

A. Load human chromosome dataset from UCSC Table Browser
  1. Log in to Galaxy
  2. Go to: Tools / Get Data
  3. In the list, Find & click: UCSC Main Table Browser
    • download data from the Genome Table Browser database

In Genome Table Browser choose/set:

  1. clade: Mammal
  2. genome: Human
  3. assembly: Feb 2009 (…/hg19)
  4. group: Genes and Gene Predictions
  5. track: UCSC Genes
  6. table: knownGene
  7. Click Radio button: position, type chr22
  8. output format: Set to BED
  9. Click Radio button: Send output to Galaxy

NOTE: Do not change other menu items.

Table Browser view
Table Browser view
  1. A second screen entitled “Output knownGene as BED” will be shown next.
  2. On the Screen/Page #2: Output knownGene as BED file format
  3. Find: Create one BED record per:
  4. Click Radio button: Coding Exons
  5. Goto bottom button: Send query to Galaxy
  6. Note: This should return you to the Galaxy web page.
UCSC Second Screen for downloading coding exon data.
UCSC Second Screen for downloading coding exon data.
  1. You may rename your current ‘job’ from Unnamed history to chr snp codons
Change History Name
Change History Name
B. Download human snp data
Load Second Dataset, Repeats data:
  1. Go to: Tools / Get Data
  2. In the list, find & click: UCSC Main Table Browser
    • download data from the Genome Table Browser database
In Genome Table Browser choose/set:
  1. clade: Mammal
  2. genome: Human
  3. assembly: Feb 2009 (…/hg19)
  4. group: Repeats
  5. track: RepeatMasker
  6. Click Radio button: position / Type: chr22
  7. output format: BED
  8. Send output to Galaxy
  9. Go to bottom button: get output
  10. On the Screen/Page #2: Output rmsk as BED
  11. Create one BED record per: Whole Gene
  12. Goto bottom button: Send query to Galaxy
Go to your History
  1. History should now show two datasets
  2. Click: the View data (EYE icon) and inspect datasets
  3. This is a BED file format file format.
  4. You may click: the Edit Attributes (PENCIL icon) to add any information or annotations to your data.
C. Join the genomic datasets
The Goal of this section is to perform an INNER JOIN on the datasets.
  1. Go to: Tools
  2. Search for: Operate on Genomic Intervals / Join
  3. In the center panel with the header: Join
  4. Join, First dataset:
    • Set - UCSC Human knownGene
  5. with, Second dataset:
    • Set - UCSC Human rmsk
  6. Go to: with min overlap
    • Set - 1 bp
  7. Go to: Return
    • Set - Only records that are joined (INNER JOIN)
  8. Execute
  9. See the aside box above: Look in the History panel …
D. Identify & count the overlaps from the joined dataset
The Goal of this section is to identify which exons have repeats.
  1. Once Inner Join is finished
  2. Go to: Tools
  3. Go to: Join, Subtract and Grouping
  4. Go to: Group
  5. The Group data tool will appear in center panel.
  6. Select data: Joined dataset
  7. Group by Column: Set - Column 4
  8. Go to bottom: Operation
    • Click: + Insert Operation
  9. Type: Count – Note: We want the count of all overlaps.
  10. On column: Column 4
  11. Click: Execute
  12. Once Count is finished
  13. Go to: History & View data (Click Eye icon)
Q. What are these two columns What do they represent?
  • This data shows the exon name and count next in the middle panel.
  • We have answered our question, but we can do better.

STOP HERE

E. Incorporate (Join) the overlap counts with the Exon information
  1. Return to: Tools
  2. Go to: Join, Subtract, and Group
  3. Go to: Join two Datasets…
  4. In the middle panel, choose the known genes from first dataset.
  5. Set - Using column: Column 4
  6. With: Inner joined dataset
  7. Set - and column: Column 1
  8. Execute
  9. Once the join is complete we can view the results,
  10. Go to: History
  11. Go to: View data
  12. Go to: Tools
  13. Go to: Text Manipulation
  14. Go to: Cut
  15. Cut columns: c1, c2, c3, c4, c8
  16. Execute
  17. Note: You may save this final file to your computer:
  18. Click: History / Cut on data 5 / “floppy disk icon”
  19. Report findings
  20. Stop Here

2.3 - Exercise # 2: FASTA & FASTQ file formats

?Describe NGS?

Q. What are FASTA & FASTQ file formats? Explain?

  1. Go to: Shared Data
  2. Data Libraries
  3. Illumina iDEA Datasets (sub-sampled)
  4. BT20 paired-end RNA-seq subsampled (end 1)
NGS Data Quality Control
  • FASTQ format
  • Examine quality in an Chip-Seq dataset
  • Trim/filter as we see ft, hopefully without breaking anything.
Assessment tools
  • NGS QC and Manipulation to FastQC
  • https://en.wikipedia.org/wiki/FASTQ_format
  • Gives you a lot of information but little control over how it is calculated or presented.

2.4.2 What is FASTQ?

Show example?

1 @SEQ_ID
2 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
3 +
4 !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
  • Specifies sequence (FASTA) and quality scores (PHRED)
  • https://en.wikipedia.org/wiki/Phred_quality_score
  • Text format, 4 lines per entry
FastQC - Assessment tools
  • NGS QC and Manipulation / FastQC
    • Gives you a lot of information but little control over how it is calculated or presented.
  • UpLoad your dataset to line Short read data from your current history
  • Other options not necessary now
  • Execute
  • This produces an HTML report of the data in the History panel.
  • Click on the View data (Eye icon) to view the report in the middle panel.
Fastqc Results Page
Fastqc Results Page
Base Quality Trimming - Option 1
  • NGS: QC and Manipulation / FASTQ Trimmer by column
  • Trim same number of columns from every record
  • Can specify different trim for 5 prime and 3 prime ends
  • NGS: QC and Manipulation / Trim sequences
  • Place dataset in top box
  • In second and third boxes, select which bases to keep from x to y.
  • Execute
Base Quality Trimming - Option 2
  • NGS QC and Manipulation -> Select high quality segments.
    • Select by score & length
  • Keep or discard whole reads
  • Can have different thresholds for different regions of the reads.
  • Keeps original read length.
Base Quality Trimming - Option 3

Trim as we see ft: Option 3

  • NGS QC and Manipulation -> Manipulate FASTQ
    • By sliding ‘window’ one can fine tune reads
  • e.g. Trim from both ends, using sliding windows, until you hit a high-quality section.
  • Produces variable length reads
Trim? As we see fit? - Option 4
  • NGS QC and Manipulation -> FASTQ Quality Trimmer by sliding window
  • Choice depends on downstream tools
  • Find out assumptions & requirements for downstream tools and make appropriate choice(s) now.

2.4 - Exercise # 3: ChIP-Seq analysis with MACS

Q. What is ChIP-Seq analysis with MACS? And what does it investigate?

A fast and powerful analysis algorithm, titled Model- based Analysis of Tiling-arrays (MAT), reliably detect[s] regions enriched by transcription factor chromatin immunoprecipitation (ChIP),
Model-based analysis of tiling-arrays for ChIP-chip, Proc Natl Acad Sci, U S A, 2006 Aug 15; 103(33): 12457–12462

See paper: Model-based analysis of ChIP-Seq (MACS), Genome Biol. 2008; 9, 9, Zhang Y et al

Generalized steps:

  1. Get data: Demonstration Datasets
  2. USE BOWTIE2

Specific steps:

A. Load Demonstration datasets
  1. Go to History: Create New History
  2. Rename History to: Mouse - MACS G1E_CTCF
  3. Go to Top bar: Shared Data
  4. Click: Data Libraries
  5. Click: Demonstration Datasets
    • Description: Demonstration datasets collected from various Galaxy tutorials
  6. Click Radio button: Mouse Chip-seq G1E CTCF binding
    • Description: Sample datasets from Hardison lab for ChIP-seq analysis from Hardison lab
  7. Highlight link
  8. Choose all 4 mouse data sets

Q1. What type of datasets are these?
Hint: Biostars.org Question


  1. Go to: Import to current history
Data Libraries, Demo Datasets, Mouse data
Data Libraries, Demo Datasets, Mouse data
B. Analyze Data
  1. Once loaded,
  2. Goto: Analyze Data
  3. NO, DO CONTROL: Usually we would do FASTQC quality control, however for this example we will skip it.
  4. Go to: Tools
  5. Go to: NGS Mapping
  6. Bowtie2 aligned reads sorted BAM**
    • Bowtie2 - maps reads against a reference genome
    • Bowtie2 will show in middle panel
  7. This data is single-end reads

Q2. What does single ended reads refer to? What are the advantages of using single ended pairs.
Hint: Beginner’s Handbook to Next Generation Sequencing


  1. Choose FASTQ / G1E CTCF (chr19)
  2. Select reference genome / Mouse (mus musculus): mm9
  3. Write unaligned reads (in fastq format) to separate file(s): YES
  4. Group 1: No, just use defaults
  5. Group 2: Do you want to use presets?: Very fast end-to-end (–very-fast)
  6. Save the bowtie2 mapping statistics to the history: YES
  7. Execute: Run bowtie2

Q3. What is .bam file format?
Hint: Samtools ***

  1. Find summary stats in the History box, number of reads, % unpaired, % aligned exactly 1 time, % aligned >1 time…
C. FIND PEAKS
  1. Go to: **Tools
  2. Go to: NGS: Peak Calling
  3. Go to: MACS
  4. Go to Middle Panel:.
  5. Go to: Experiment Name:
  6. Mouse - MACS G1E_CTCF
  7. Tag File: Bowtie2 on data #: aligned sorted BAM
  8. Tag Size: 36
  9. Leave MFOLD: 32
  10. Check: Perform the new peak detection method (future dir): YES
  11. Execute
  12. This will produce two files:
  13. MACS on data 5 (peaks:bed)
  14. MACS on data 5 (html report)
  15. Click on View data (Eye icon) to get more information.
  16. Download from the html report data set:
1 Additional Files:  
2 
3 - MACS_in_Galaxy_model.pdf  
4 - MACS_in_Galaxy_model.r
  1. Goto: UCSC Main Genome Browser
Running controls in very important; MACS peak detection
  1. After shift, slide windows of size 2d across genome,
  2. Model tag count for windows as a Poisson distribution, and calculate a p-value for each window,
  3. For the (lambda) parameter (~expected number of tags per window), estimate from sample or control if available,
  4. Estimates for local windows of size 1kb, 5kb, 10kb or the whole genome and uses the max,
  5. RERUN CHIP-Seq with Control:
  6. NGS Mapping: Bowtie2 on G1E_Input** File #4, Single-end
  7. NGS: Peak Calling -> MACS On the resulting mapped reads
  8. Execute
  9. Run MACS again: NGS: Peak Calling
  10. Title can be: MACS on G1E_CTCF with Control
  11. Select data file #5, .BAM file
  12. Click / Chip-Seq Control File / Choose Control; Bowtie2 on data #4, aligned reads in BAM format
  13. Change Tag size: 32 bp
  14. Check box:
    • Perform the new peak detection method (future dir): YES
  15. Execute
  16. Analyze…
  17. Goto: UCSC Main Genome Browser
  18. 2.5.2 Biases on your experiment:

In this experiment, where is bias introduced?

  1. Chromatin accessibility affects fragmentation
  2. Amplification bias
  3. Repetitive regions
  4. Solution: Controls
  5. Input DNA (after fragmentation but before IP)
  6. Non-specific IP

Summary

  • MACS is one tool, available in Galaxy, for analysis of ChIP-seq data.
  • Controls are extremely important for accurately calling ChIP-seq peaks.
  • As for most genomics problems, there are other tools that may be appropriate depending on the type of data, for example SICER for broad histone modifications.

2.5 - Exercise # 4: RNA-Seq

Q. RNA-Seq Differential Expression…

General Directions; RNA-Seq Exercise

  1. Create new history
  2. (cog) -> Create New
  3. Get data;
  4. From Shared Data -> Data Libraries
  5. Demonstration Datasets / Human RNA-seq: CHB ENCODE Exercise / Select All 5 Datasets in folder
  6. Import to current history

NOTE: We’re ignoring quality control in this example, HOWEVER, in practice this would be a good time for FASTQC

INSERT PICTURE FROM SLIDES; RNA-SEQ MAPPING, SLIDE 5

NOTE: This data can be analyzed in many different ways depending on goals of the experiment, what other data is available, etc.

INSERT PICTURE FROM SLIDES; RNA-SEQ MAPPING, SLIDE 7

2.6.2 Two Approaches

  1. USE: Align-then-assemble: potentially more sensitive, but requires a reference genome, confounded by structural variation
  2. de novo: likely to only capture highly expressed transcripts, but does not require a reference genome, robust to variation.
  3. Mapping will be done using Tophat
  4. Find Tophat: Tools / NGS: RNA-seq / Tophat for Illumina
  5. In RNA-Seq FASTQ file: Select first data set, C20-rep1
  6. Drop down menu for built-in genome, choose hg-19
  7. We have single ended data.
  8. Use default settings
  9. Execute
  10. We see 4 NEW datasets on the right in the History panel.
  11. Insertions.bed, deletions.bed, splice junctions.bed, and accepted_hits.bam
NOW, we have multiple datasets that we want to run using the same set steps.
  1. Goto: Find Tophat: Tools / NGS: RNA-seq / Tophat for Illumina
  2. INSTEAD of choosing one file at a time.
  3. Hover over the 3 buttons on the top left, and find Multiple datasets, use Shift-Click you can choose multiple data files to run.
  4. AGAIN, WE REPEAT THE LAST STEPS.
  5. Drop down menu for built-in genome, choose hg-19
  6. We have single-ended data.
  7. Use default settings
  8. Execute
  9. ALL datasets will run.
  10. We SHOULD see 12 NEW datasets on the right in the History panel.
  11. Insertions.bed, deletions.bed, splice junctions.bed, and accepted_hits.bam
  12. CONTINUE AS NEEDED…To 2.6.3 RNA-Seq Analysis: Assembly Quantitation & Differential Expression

RNA-Seq Analysis: Assembly Quantitation & Differential Expression

  1. Use: Cufflinks, Cuffmerge, Cuffdiff
  2. Assembling RNA-seq data after mapping to a reference genome
  3. Spliced alignment provides estimates of
  4. locations of exons and
  5. splice junctions
  6. Is this information enough to know what transcripts are present?
  1. NGS: RNA-seq -> Cufflinks
  2. Use reference annotation as guide:
  3. Gene Annotations (chr19)
NOTE: Need to do this for each of the four Tophat accepted_hits

outputs separately — use “run tool in parallel across multiple datasets”

  1. We will be using Accepted_hits.bam
  2. Goto: NGS: RNA-seq / Cufflinks
  3. We need to do several preliminary things:
  4. Use Reference Annotation / Use Reference Annotation as guide
  5. Select data set 5, chromosome 19-annotations.gtf
  6. This is, this is a file that came from data library and contains the gene annotation information.
  7. INSTEAD of choosing one file at a time, we want to run multiple files.
  8. Hover over the 3 buttons on the top left, and find Multiple datasets, use Shift-Click you can choose ALL 4 data files to run.
  9. Use default settings for all other options
  10. Execute - ALL 4 datasets will run with Cufflinks.
  11. It will provide sets of transcripts which are assembled using the reference genome as a guide.
  12. Each Cufflinks job will generate a couple of different data sets.
  13. Gene expression will be produced ans can be used to quantify actual the expression levels.
  14. The file reports FPKM and tracking_id which uses the .gft files annotations.
So now the question becomes which genes are differentially expressed and are they significant?
  1. However, we have one slight problem, which is that we have multiple cuffinks files making the comparison more difficult.
  2. We can solve this problem by merging the files using Cuffmerge
  3. Goto: Tools / NGS: RNA-seg / Cuffmerge
  4. Cuffmerge allows one to select multiple .gft files to merge.
  5. Use: Additional GTF input files for as many files as you want to merge.
  6. **Use: Reference Annotation / Yes / (e.g.) Chr19-annotation.gtf
  7. Execute
  8. The transcripts will merge to ONE file.
  9. Click on Eye icon to view, exons, start and end of exons, names of genes, etc.
Use Cuffdiff
  1. Goto: Tools / NGS: RNA-seg / Cuffdiff
  2. Use: Transcripts / Cuffmerge
  3. Goto: **Condition / ** At least two
  4. In our case, we have CD20 cells vs H1hesc
  5. For each condition, we have two replicates.
  6. Add replicates from Tophat2 work
  7. We will use Default values for this experiment BUT it is a good idea to read over the options…
  8. Execute
  9. WHAT FILE TYPE IS THIS?
  10. WHAT COLUMNS ARE RELEVANT? VALUE1 AND VALUE2, Log2(fold_change), p-value, q-value, significant, etc.

2.6 - Exercise # 5: RNA-Seq of 3 family members

Q. Matt Curcio’s Galaxy Project

Genomic Data Science with Galaxy
Title: RNA-Seq - Peer-graded Assignment - Galaxy Course Project
Name: Matthew Curcio
Submitted: 10/23/2016

2.9.1 Instructions:

The zip file fastq_bundle.zip contains six fastq files. These files contain targeted re-sequencing data for a father, mother and daughter trio (identified as NA12877, NA12878, and NA12880 respectively). The data consists of raw reads from an Illumina MiSeq sequencer sequenced as paired ends (R1/R2) to 125bp in length.

Data Set Relation
Coriell-NA12877 (R1/R2) Father
Coriell-NA12878 (R1/R2) Mother
Coriell-NA12880 (R1/R2) Daughter
  • Table#1: Targeted Re-sequencing Raw Data From Illumina MiSeq Sequencer

Create a Galaxy workflow to identify polymorphic sites in all three individuals. Your workflow will need to map the three sets of paired reads to the appropriate reference genome. You will then need to use a variant caller to identify sites that appear to have strong support for the presence of a polymorphism, and call the genotype at that site for each sample.

You should report your results in VCF (variant call format). You should only include sites where the chance of a false positive call is 1 in 10,000 or better according to the VCF qual field.

Using your resulting VCF determine 1) the number of single nucleotide variants, 2) the number of insertion/deletion variants, 3) the number of multi-nucleotide variants, 4) the number of variants with multiple alternate alleles, and 5) the names of the 5 genes with the largest number of polymorphic sites.

2.9.2 Results from Variant Call Format File(VCF)

1) Determine the number of single nucleotide variants: 2,327

2) Determine the number of insertion/deletion variants: 268

3) Determine the number of multi-nucleotide variants: mnp = 23

4) Determine the number of variants with multiple alternate alleles: 62

5) Determine the names of the 5 genes with the largest number of polymorphic sites:
1) RBFOX1,
2) CLCN7,
3) UNKL,
4) CACNA1H,
5) USP7

Appendix A: Installing Ubuntu Linux

  • Computer scientists over the past decades have created Unix, and its relative Linux, which are free operating systems. Both are widely used as servers and in scientific computing. They come with free software and are very dependable.

‘Ubuntu’ is about Community

We will focus on the Ubuntu distribution, its installation and start-up. Ubuntu is a good place to start.

  • ANY QUSESTIONS or need HELP, Go to the Ubuntu Forums.
  • First, Search the Forums, because it is likely your question may have already been answered.
Ubuntu Forum
Ubuntu Forum

1. Install Ubuntu on a USB drive.

The link below describes how to create a USB stick on Windows. You will make a “bootable” USB drive with Linux on it. This means that YOU can take this USB drive and use Ubuntu anywhere you have a computer that can boot from a USB drive port.

These directions are specific for Windows machines.

YOU NEED:
1. a USB drive (>= 2 gigabytes, 4 GB is preferable)
2. a Windows machine where you can install programs.

You will Download 2 items:
1. Ubuntu image, Ubuntu 16.04.2 LTS for Windows.
- FOR CONTINUITY, USE ONLY UBUNTU 16.04.2 LTS
2. Rufus - takes the Ubuntu image and your USB drive makes a ‘bootable’ (running version) on your USB drive.

1.1 Goto: https://www.ubuntu.com/download/desktop/create-a-usb-stick-on-windows
  • Follow the the Ubuntu web-page directions.
Working with Ubuntu
Working with Ubuntu
1.2 Try Ubuntu: https://www.ubuntu.com/download/desktop/try-ubuntu-before-you-install
Try Ubuntu
Try Ubuntu
1.3 If you want Ubuntu on your computer…

There are two possible ways to have a permenant version of Ubuntu.
1. Install Ubuntu, on your personal computer. This is called a “dual-boot” with Windows
2. You can also make your USB drive permenant. This is called a Persistent live USB version.

1.4 Learn at your own speed and convenience, Explore!
  • You are finished.

Appendix B: Getting Python 3.x On Your Computer

B.1 Get Anaconda Python

  • Although there are several different ways to “get” Python, I am going start by recommending Anaconda Python.
  • Anaconda Python is from a company called Continuum Analytics which specializes in open source Python software and support.
https://www.continuum.io Download Page
https://www.continuum.io Download Page
  • Continuum has easily installable Python packages for Macs, Linux, and Windows.
  • Download the Python 3.x version for your operating system.
  • Choose 32 or 64 bit versions, depending on your computer.
https://www.continuum.io Download Page
https://www.continuum.io Download Page
Once you download the Anaconda installer, Install It.
Benefits Of Anaconda
Anaconda Contains:
Package Title Example(s) / Uses
Biopython FASTA, GenBank & alignment tools
Jupyter / IPython Notebooks Work & immediately see your results!
Spyder An advanced environment for writing Python
Numpy N-dimensional arrays
MatPlotLib Plot & make graphics
Scikit Learn Machine learning tools
Pandas Work with data structures
EVEN Git Save your work!

NOTE: The websites for each package or module or library is a good place to start reading & learning about what each does.

So your alternative is to download all these packages and update them yourself or just USE: Anaconda.

What Makes The Computer Language Python Good?

  1. It is easy use & read because it uses indentation
  2. Runs on Mac, Windows & Linux …
  3. “Interpreted” NOT “Compiled”
  4. Python is interactive, therefore you can see your results right away!
  5. It has many libraries which can help you with databases, math functions, as mentioned above.

B.2 Excellent Learning Resources:

  1. Home of Python
  2. Find Python 3.x Documentation Here
  3. Learn Python
  4. Think-Python
  5. Learn Python the Hard Way # Very good site despite the name.
  6. Codecademy
  7. Dive into Python
  8. Code School
  9. This is course material suggested # Steven Salzberg
  10. Python Scientific Lecture Notes # If you don’t read anything else, read these.
  11. NumPy for Matlab users START here.
  12. Lectures on Scientific Computing # Great Python Jupyter Notebooks.
  13. A Byte of Python # A very good book, at the introductory level.
  14. StackOverflow

B.3 The Very Basics of Programming Strategies

General Steps
  1. Identify the required inputs, such as data or specifications
  2. Make an overall design for the program, including listing all the steps by which the program computes the output.
  3. Decide what will be the output of the program.
  4. Refine the overall design by specifying more detail.
  5. Write the program.
  • Adapted from Beginning Perl for Bioinformatics by James Tisdall, O’Reilly Media, Inc., 2001
Designing a Program

Write pseudocode for a program that computes the GC percentage composition of a DNA sequence:

  1. read DNA sequence from user, dna = open(“dna.txt”)
  2. count the number of C’s in DNA sequence, dna.count(“C”)
  3. count the number of G’s in DNA sequence,dna.count(“G”)
  4. determine the length of the DNA sequence, len(dna)
  5. compute the GC%, g_c_content = $ frac{(dna.count(“C”) + dna.count(“G”))}{len(dna)} /$
  6. print GC%

Appendix C: Just Enough Git

What is GIT?

  • Git is one of MANY in a series of programs called Version Control Systems. Version control software (in the simplest terms) is a program to help one keep track of and organize writing program(s).
  • Since this is a JUST ENOUGH approach, I will go over enough so that you can start working and learning git on YOUR own account.

C.1 - Get Git

NOTE: If you previously loaded Anaconda Python you already have Git installed.
Check to see if you have git.
  • Go to: Terminal shell
  • Enter: git --version
    • git version 2.7.4

Download Git

  • Otherwise, obtaining Git is easy.
git-scm.com Front Page
git-scm.com Front Page

C.2 - Set up your first Project / Repository

  1. Sign into your GitHub account.
  2. Press: Start a project
Git First Page - <press> Start a project
Git First Page - <press> Start a project
  1. On the “Create a new repository” page: Fill in…
  2. Repository name: First project
  3. Description: Add you keyowrds here,
    • e.g. Bioinformatics Assembly of RNA-Seq using Python
  4. Initialize this repository with a README:
    • This should be any length you need to fully describe your work to anyone “walking off the street”.
  5. Press: Create repository
Create and Describe You Repository - <press> Create repository
Create and Describe You Repository - <press> Create repository

C.3 - Cloning a repository

  1. Copy the web address of new repository,
  2. See: Red Highlighted: Clone with HTTPS
  3. Go to your Working Directory, AND
  4. Open your Linux/Mac/Windows terminal window or shell:
  5. Type: git clone https://github.com/YOUR-USERNAME/YOUR-REPOSITORY
  6. Enter
Cloning a Repository
Cloning a Repository

C.4 - The 5 most useful git commands

  • Add files to your working directory as needed.
  • When you are ready to upload your work to Github…
  • Use the commands below to upload your work.
Adding new material to you GitHub account
  1. git status # You will see any files the need to added or removed here.
  2. git add * # if you have added files simply use * (the asterick for all files)
  3. git rm /dir/file_name.ext # to delete files use rm (remove)
  4. git commit -m “describe change” # Use QUOTES when describing your work.
    • At this point, you will be prompted for your Username and Password
  5. git push # This pushes your changes from your local directory (Your Computer) to GitHub. This completes the addition of a file(s) to your GitHub account.

Note: This is the minimum to get you started. As you need to use more functionality you can search for it via Google or the GitHub help site.

Git Help Page
Git Help Page