This document is an incomplete list of suitable topics to study to learn the basics of DNA sequencing-based bioinformatics.
As most bioinformatics tools are written to run in a Linux environment it is important to learn how to work on the command line. In addition, accessing high-performance computer resources is also normally done via a terminal interface. There is a lot to learn, but after learning the 20 or so most used commands you can start to be productive on the command line!
- Basic file handling and navigation:
ls
,cd
,mkdir
,cp
,mv
,rm
,cat
,less
/more
,chmod
, etc... - At least one terminal editor:
nano
,emacs
,vim
- Learn to use terminal multiplexers:
screen
or preferablytmux
- Advanced tools:
- Pipes, output redirection (
<
,>
,stdout
, andstderr
) - Shell scripts
- Regular expressions in
grep
,sed
,awk
... - Non-standard power tools, e.g. GNU
parallel
, Ebay'stsv-utils
, etc.
- Pipes, output redirection (
- Learn to work remotely over SSH
- Connect to remote computers
ExplainShell does an amazing job at explaining the different components of a command line. Try it out!
- The Linux Command Line
- Bash Pocket Reference
- Linux, MacOs, Windows and more command line reference - Nothing extra
Knowing basic programming is essential for a bioinformatician. Programming is often used to handle input and output files, pre-process data files, create plots, create workflows that run several different tools in a specified order. A well-constructed bioinformatics data analysis is reproducible, meaning that any one can run the same analysis on a different computer using the same input files to produce the same output results. This is challenging in practice, but it is important to consider all scripts that are written in the course of a bioinformatics analysis project as the "log book" or "lab book" of how the analysis was actually performed. And while it can be rewarding to do a quick analysis of some output files at the command line, you should make it a habit to always include everything you do to the data in a script that you can come back to in the future when you have forgotten exactly what you did.
There are several revision control systems that one can use to maintain a versioned history of for example program code. The most popular version control system in widespread use today is Git. Three common places to publish code are Github, Gitlab, and Bitbucket. They all work pretty much the same.
There are some very good guides and tutorials listed below that will introduce you to the vocabulary and concepts concerning version control. Version control rocks(!) and is a crucial tool in a bioinformatician's tool belt, so take the opportunity to learn it as soon as possible. When you start writing code, you will eventually encounter a situation where you want to make changes to the code, but without losing the older version (that you know worked). Version control makes it possible to go back in time to older versions of the code, without having to mess with copies of files called my_code_version-20181015_final_final2.py
. It will make your life so much easier and you will enjoy it!
Here are some nice introductions to version control:
- git - the simple guide
- Github resources for learning git
- Blog post: A visual guide to version control
- Blog post: Source control for scientists and soloists
- Online tutorial: Atlassian Git tutorials and introduction to workflows
GitHub has some nice resources for research/education. Check out their education portal! You can also get a free researcher account that enables unlimited free private repositories.
Python is the most common (and in my opinion most easy to learn) programming language. It is typically available on all Linux systems. Some resources for learning about Python in general:
- https://automatetheboringstuff.com/
- http://rosalind.info (specifically "Python Village", but then later also "Bioinformatics Stronghold")
There was a big debate a couple of years ago about which version of Python to learn. That discussion is no longer valid: you should learn Python 3 (start by installing the latest available version (3.6+)). There are several ways to download and install Python, but I recommend learning to use conda. There is a conda getting started guide that is OK, after you are familiar with the command line.
Unfortunately, there are no de facto standard integrated development environments (IDEs) for Python like there is for R (i.e. RStudio, see more below). The most common alternatives are probably Microsofts' Visual Studio Code and JetBrain's PyCharm, both are great and cross-platform. VS Code is free for everyone, and a free community edition (without professional support) is available for PyCharm. An other important Python programming tool you should learn is Jupyter. It is a tool to work with interactive programming notebooks where you can combine blocks of Markdown formatted notes with individually executable code blocks (with inline plots!). It is actually not specific to Python: it started out as a notebook format for the languages Julia, Python, and R (JuPyteR), but now runs more than a hundred different language kernels. It is often used in bioinformatics analyses and is getting more and more common nowadays as a way to share how analyses and plots were made for scientific papers.
R is by far the most commonly used language/environment for any type of data analysis that requires statistics. A bioinformatician has to be familiar with R. There is a very good Integrated Development Environment (IDE) available for R: RStudio. Ensure you become familiar with R, RStudio, and R Markdown (kind of like Jupyter notebooks, but focused on R).
There are several database systems, but the most common are some kind of relational database system (often called SQL databases). There are others, especially NoSQL-databases, that are gaining popularity (MongoDB is a NoSQL database that is seeing some use in bioinformatics applications). A bioinformatician can definitely benefit from learning the basics of SQL and a NoSQL system.
Using a consistent coding style is important to ensure code readability (you are going to read your code much more than you write it). Python has a style document called PEP8 (Python Enhancement Proposal number 8), which is a great starting point for a standard Python coding style. Every Python programmer should read and try their best to follow PEP8 to make it easy for other Python programmers to read and understand your code. In addition, have a look at The Zen of Python (i.e. PEP20).
There are style guides for R and SQL as well. A decent style guide for R is explained in the R for Data Science book (see link below).
Workflow managers are tools that help you write reliable and easy-to-use bioinformatics workflows. They make it easy to run several different programs after each other, or sometimes in parallel. This is an advanced bioinformatics topic that will be most useful after you have learnt the basics of programming (in either Python or R) and started using established bioinformatics tools to process your data.
- Ordination: PCA/PCoA, NMDS, t-SNE, OPLS-DA etc.
- Classification: LDA, Decision trees, Random Forests, SVM, ANN, ROC curves, supervised/un-supervised, etc.
- Clustering: Hierarchical, k-means, etc.
- GUSTAME is a very useful field guide to multivariate statistics
- Simple hypothesis tests (T-tests, etc.)
- Multiple testing correction: Bonferroni, FDR
- ANOVA
- Regression
- Differential abundance testing
- Published article: ANCOM
- Official tutorial: DESeq2 (for RNA-seq, but applicable to metagenomics as well)
- Official user's guide: edgeR
- Build your own differential abundance tool
- TileStats Videos
- Book: Python for Data Anlysis
- Book: Pthon Data Science Handbook
- Blog post: Greg Reda's Intro to pandas data structures
- YouTube channel: Kevin Markham's pandas video series
- Package docs: Pandas' 10 minutes to pandas
- Package docs: Seaborn
- Package docs: Jupyter
- (Online) book: R for Data Science (If you're only going to read one book about R; this is the one!)
- (Online) book: Orchestrating Microbiome Analysis
- Online course: Coursera R-programming
- Online course: Data Science in a Box
- Online course: [Stanfords Mini-Courses, Practical Relational Databases and SQL] (https://lagunita.stanford.edu/courses/DB/2014/SelfPaced/about)
- SQL
- Python documentation: SQLite3
- Online tutorial: PostgreSQL
- NoSQL
- Official website: MongoDB
- Rob's computational genomics manual
- sib-swiss training collection
- introduction to applied bioinformatics in Python
- Common file formats:
- Quality assessment
- Adapter trimming and quality filtering
- Mapping/aligning reads (maybe also something general on sequence alignment)
- Assembling reads
- Phylogenetic trees
- QIIME2 (again)
- Mothur
- Microbiome analyst
- The Huttenhower Galaxy server
- Take a look at Luisa's review
- Taxonomic profiling
- Using marker genes: MetaPhlAn2 and mOTU
- Using whole genome references: Kraken, Kaiju, etc.
- Functional profiling
- HUMANn2
- SUPERFOCUS
- Mapping to gene database (e.g. IGC)
- Metagenome assembly
- Binning
- Metagenome-assembled genomes, MAGs
- seqanswers.com
- biostars.com
- SciLifeLab Slack
- wiki for metagenomics-related terminology