Skip to content

Collection of Python and other data analysis resources

License

Notifications You must be signed in to change notification settings

deeenes/bioinfo-tools

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Python and R resources for beginners...

...with special regards to science and data analysis applications, visualization and graphics

This collection has been started by Dávid Fazekas and continued by Dénes Türei. Submissions are welcome.

Python and R resources for beginners...

...with special regards to science and data analysis applications, visualization and graphics

Here you find a list of useful resources for learning Python or R, as well as a virtual environment where many tools are readily available for you. You will find this collection helpful either if you are a scientist and you just realized that you need computational tools for processing your data, or you already have done it for a while but looking for new tools and alternatives, or you just would like to start learning Python or R for any other purpose.

Setting up a Python environment

This repo contains a virtual environment created by David Fazekas and set up for easy installation. It is unmaintained for a couple of years already, I can not guarantee it works. The idea of this virtual environment is that you can start using Python without learning how to install Python and modules, which is not always straightforward and might even require some system administration knowledge. Of course later you can learn these, but the virtual environment provides a little help for a quick start. To install this environment to your own computer follow the description.

Alternatively you can install Python and modules the traditional way. See below.

General points

  • Python has two major versions available: Python 2 and 3. These are incompatible, although it is possible to write code which runs both in 2 and 3. If you have both of them installed they will reside in their own directories and you install modules for them independently. It is highly recommended to use only Python 3 today. Almost all the important modules have been already ported to Python 3. Python 2 exists only to keep it possible to run old code when it's really necessary. The most important science and data analysis modules like numpy and scipy are about to end Python 2 support.
  • On some Linux distributions and Mac OS X still Python 2 is the default. Sometimes the python command in the shell is a link to either python2 or python3. Same stands for pip/pip2/pip3. Check your system and be aware where are your python and pip exacutables and where is the site-packages or dist-packages directory (where the modules install).
  • Always know which Python distribution you use and where do you install the modules. If you just call pip install numpy and then you start a Python shell from the Anaconda distribution, don't be surprised if import numpy gives a ModuleNotFoundError. If you are in doubt you can find out by which python and something like ls -l /usr/bin/python and similar ways.

Linux

In Linux distributions you will find up-to-date and well maintained packages for both Python 2 and 3 and also many modules. One issue is that pip and the distibution's package manager don't know about each other but will complain if files already exist (because the other package manager installed them) and won't recognize if a dependency is already installed by the other manager. Most often I just give the force option to overwrite the files.

Mac OS X

OS X comes with a built-in Python 2. Sometimes this is quite old. Anyways probably you will want to install an up-to-date Python 3 distribution. Most convenient is to install a package manager for OS X (most popular is HomeBrew, another one is MacPorts) and you can use these to install Python 3 and many other modules.

Python on Windows

Install Python by the provided installer and don't forget to tick the "include in the path" box. Also you might consider to install cygwin and git (actually git installer already offers also cygwin). This way you will have BASH and git which are essential for development. See more about Windows here.

pip

pip is the most often used package manager for Python. If pip does not come with your installation you can install it by easy_install pip. Or by your operating system's package manager.

Editor

If you are about to start writing code it is important to have a good text editor. What makes a good text editor is the followings:

  • Syntax highlighting: it automatically colors the different elements of the language so it will be easier for you to recognize them and read the code
  • Autocompletion: it automatically offers suggestions to complete words while you are typing so you don't need to type of long function or variable name more than once
  • Automatic indentation and closing parentheses automatically. These features can make writing code even more convenient.
  • Nice color scheme and line numbering: it is important to have a color scheme which has appropriate contrast and gentle with your eyes (usually dark background color schemes).
  • Line numbering: you must have the lines numbered so you can easily find the line blamed by the error message.
  • Search and replace also by regular expressions, go to line key combination.

In Linux usually not a problem as the default ones like gedit can be tuned to be quite good. Personally I use Kate from KDE. For Mac and Windows you need to install one, for example Notepad++ is popular for Windows, TextMate is popular for Mac. Also see the IDEs listed below and you can consider the new JupyterLab which is and IDE in the web browser.

Anaconda

Anaconda is a Python and R distribution and package manager for science and data analysis. They promise standardized packege management which make collaboration and deployment easier. Some people like it, I've never used it. For me pip and the system's package manager have been always easy to use and sufficient.

Where to start?

Non Python specific but important resources

  • http://stackoverflow.com/ - This very important resource worths special mention. As you will se if you google for any programming issue, in 90% you will end up on this site. SO is a Q&A (question and answer) site where anybody can ask programming related questions and answer or comment others questions. Users collect reputation points for their contributions which made it a very efficient platform for community building around mutual help. Your question likely will be answered very quickly, but be careful not to ask something already answered by answers for other questions.
  • https://bitbucket.org/ - We suggest you to familiarize yourself as soon as possible with version control frameworks. Nowdays the most popular is git. Version control helps you to keep track of changes, keep your project files in order, backup your work often, avoid data loss, to collaborate and to share your code in a standard and convenient way. BitBucket allows you to to create more private repos than the most well known git server, http://github.com/.
  • https://maryrosecook.com/blog/post/git-from-the-inside-out - If you write code please start using git. Sooner is better, even your random exercises you can commit to a git repo. Here is an in depth introduction to git starting from the basics from Mary Rose Cook.

Interactive Python learning platforms

Programming exercises

When you write code with the aim of learning it is often difficult to find a task, you want to code, but don't know what to code. In Euler Problems you find hundreds of small mathematics problems, each of them you can solve just in a few lines of code, ideal even if you have only half hour for practicing. As you develop you can return to already solved problems, and find out better and nicer implementations.

Python tutorials

Python resources

Here we list blogs and essays which are not primarily tutorials, but give an introduction or insight into specific topics.

Online Courses

Talks

Programming languages and paradigms

Python environments

  • http://www.bpython-interpreter.org/ - Nice, colorful command line environment with smart autocompletion and built in help functions.
  • https://gist.github.com/lonetwin/5902720 - You can easily customize your Python shell with editing your *~/.pythonrc- file. For example, copy the one in this git repo into yours, and you will have a colored shell with autocompletion.
  • https://www.pythonanywhere.com/ - A full Python environment in the cloud with lots of libraries and many Python versions available. You can write Python scripts in the browser, and even deploy your application as a webpage. Free plan is available.
  • https://jupyter.org/ - Interactive Python environment in the browser: Python runs in the background on your machine, and you write the code and get the output in the browser, in so called notebooks. Note: formerly known as IPython, they just renamed when it became language agnostic (originally it was only for Python, but now can be used also with other languages, for example R).
  • https://blog.jupyter.org/jupyterlab-is-ready-for-users-5a6f039b8906 - A complete IDE developed from Jupyter.
  • https://www.continuum.io/why-anaconda - Python environment intended for science and data analysis, with easy availability of relevant modules (at least in theory: eventually installation might be more complicated).

Generic Python modules

Python modules for data analysis

Pandas alternatives

Python visualization and plotting

We have seen a number of efforts emerging in the past years with the aim to provide powerful data visualization in Python, so sciencists and data analysts would not need to be envy of R user colleagues. Perhaps the perfect ggplot2 or lattice equivalent is still to come (although two very fresh libraries, Altair and Plotnine are promising), but each of the frameworks listed below are very good in certain tasks, and of course have its limitations. Thus, it is difficult to chose a plotting library, likely you will try more of them.

Graphs (networks)

Alternative network visualizations

Visualization in general

About particular visualization methods

Other tools for graphics and typography: post processing figures, designing slides, posters and figures, typesetting reports, papers, theses and books

  • https://www.typewolf.com/cheatsheet - Orient yourself quickly regarding the correct use of various quotation marks, hyphens, dashes and other special characters. Did you know that 5 different single quote marks and 5 different dashes exist?
  • https://practicaltypography.com/ - A more detailed, but still quite concise guide for basic typography, by Matthew Butterick. This link will first redirect you to the donation page; please consider donating to the author, though you can access the the book without any donation, simply by navigating to the main page.
  • https://github.com/gztchan/awesome-design - A curated collection of graphic design resources from Tony Chan.
  • https://css-doodle.com/ - A library for drawing static or animated vector graphics with CSS within HTML files.
  • http://inkscape.org/ - Inkscape is a professional vector graphics editor and a free alternative of Adobe Illustrator. Its default format is standard SVG, while you can import and export many other formats, for example, of course PDF. Of note, you really should not pay $240 per year for an application to a company only to design vector graphics. Nor should you make your institute pay it for you. This way you only chain yourself to Adobe and the more time and energy you invest in learning to use this sophisticated application it just constrain you to keep using it and keep paying for it. What's the point in this when an excellent free and open source alternative is available? Inkscape is powerful and what you learn and achieve using Inkscape remains yours forever.
  • http://gimp.org/ - GIMP stands for GNU Image Manipulation Program. Is a professional bitmap graphics editor and a free alternative of Adobe Photoshop. What I wrote about Inkscape and Illustrator as alternatives completely stands also for GIMP and Photoshop.
  • https://www.latex-project.org/ - LaTeX is the best tool and the state of the art standard for scientific typesetting and publishing. Created in the 80s by Leslie Lamport, and developed by the scientific community with the aim of having a tool which completely fits their needs. LaTeX is a free and open source software built on top of TeX which has been created by Donald Knuth also addressing the needs of scientific typography. The difference is that TeX does very basic elementary things in the background, like how to fill a line of text with symbols, while LaTeX provides macros for more complex tasks, like how to size and position items of a list or titles on a page in order to make it look good. I can not say LaTeX is an alternative of Adobe InDesign, but for scientific publishing it is definitely superior. You can of course use it for typesetting fiction books or entertaining journals, but maybe you will have more difficulties and if you just want an open source alternative for this, there is Scribus. If you need help with LaTeX don't go to StackOverflow but to its sister site dedicated for LaTeX. One more important thing: you can not simply download and install LaTeX, it comes packaged in many different distributions, including different fonts, typesetting engines and macro packages. First you should look up which of these are available for your operating system. And many templates and styles are also available, for example journals used to have their own article style, universities their own presentation and thesis styles.
  • https://www.wikiwand.com/en/Beamer\_%28LaTeX%29 - Beamer is a LaTeX package for creating presentations. In my opinion for scientific presentations it is much better than PowerPoint and Keynote which are the most awful applications I have ever seen and I am really happy I could completely avoid them in the last 15 years. Also Beamer is a free and open source software. Most of the default themes look not very nice and old-style, but you can easily modify them to have something better looking. See examples here, here, here, here, here, here, here or here. If you work at EMBL or in the Saez-Rodriguez Group at RWTH Aachen University you can find my Beamer theme, slightly modified from PaloAlto, in my git repos there: @EMBL or @Aachen. Other notes: the final format of your slides will be PDF which is perfectly cross-platform. You should check or ask the tech support for the aspect ratio of the projector in your lecture room. Prepare wide-screen (16:9 or 16:10) slides if those fit as you can have more space this way. Also check if your connection can transmit this resolution and have an HDMI cable and adapter with you if necessary. VGA cables are sometimes limited to 4:3 aspect ratio and 1024x768 resolution which is quite poor.
  • http://www.texstudio.org/ - TeXstudio is a great editor for LaTeX. It comes with autocompletion, built in help, embedded compilation tool, PDF viewer and many other handy tools.
  • http://gpick.org/ - Gpick is a nice little color picker and palette editor application for Linux by Albertas Vyšniauskas. I use it with great satisfaction to create palettes what I use later in R, Python, Inkscape, GIMP or whereever else. I don't know about alternatives for Mac or Windows but definitely there are.
  • https://www.zint.org.uk/manual/chapter/3 - Zint is a powerful and well designed QR code generator application. It has CLI, GUI (Qt) and web interfaces. QR codes are straightforward, the world is full of solutions to generate them, but most of these are not free or very limited in features. Why not to simply use one of the best and totally free solutions?

TikZ

Pictograms, icons, patterns, reusable graphics

Creating graphics, it's really time consuming to draw everything from scratch. Instead it's convenient to use re-usable elements, for example a pictogram symbolising a mouse, a dotted pattern, a Twitter logo, etc. There are plenty of resources where you can find these things for free.

  • https://fontawesome.com/ - A collection of 14k+ nicely designed pictograms (icons). Packaged as a font, which makes it's use even easier in almost any medium: HTML, SVG, LaTeX.
  • https://freesvg.org/ - More than 150k free SVG drawings, most of them are low quality, but still many of them are useful, e.g. patterns, or single color pictograms.
  • https://svgsilh.com/ - Similar to the previous one, with better quality on average.
  • https://www.svgrepo.com/ - Another free SVG repo with 300k+ items.

Fonts

A good choice of fonts can really make a piece of graphics stand out. The main difficulty is to find the right font in the right time: browsing through thousands of fonts is very slow and unefficient, while good public tagging or classification systems are not really available so far.

  • https://www.fontsmith.com/blog/2019/06/24/a-guide-to-type-styles - A short introduction about font styles: quickest way to get the basic terminology (especially the PDF available at the bottom of the page).
  • https://www.fontsquirrel.com/ - A great collection of quality free fonts.
  • https://fonts.google.com/ - More than 1.5k free font families. Thanks to the API, it's easy to embed the fonts into webpages and they are readily available in some font manager apps. The web interface is convenient for browsing the fonts and taking them to a test drive.
  • https://www.myfonts.com/ - Huge database of commercial fonts. A few of them are free, but typically the less valuable ones or a few specimens from larger families.
  • https://fonts.ilovetypography.com/ - Another large commercial font store and a nice typography blog.
  • No link here, but you can find many font sharing forums in the Russian facebook, vk.com. Just search it by google.
  • https://fontba.se/ - This program has a $30/y subscription fee, but it offers crowd sourced and AI enhanced tagging. It's also faster than the following two.
  • https://github.com/FontManager/font-manager - A free alternative, capable of everything that fontba.se, except that you can tag only manually.
  • https://github.com/fontmatrix/fontmatrix - Another free font manager, great for organizing of your font collection.
  • https://fontforge.org/ - A font editor full of featues, fast and easy to use. Comes handy if you want to add some missing accent, modify a symbol, etc.
  • https://birdfont.org/ - Great FOS font editor, looks pretty, but feels less responsive than FontForge.

R blogs and tutorials

Connecting Python and R

Statistics

These are not Python related but generic.

P-values

Debate in Nature Methods

Others

Suggestions for new p-value

IDEs (integrated development environments)

IDEs help you to keep track of files in your project, their history, dependencies, testing, outputs, etc.

Python IDEs

R IDEs

Image processing

In biology computational analysis of images is often unavoidable. With high-throughput microscopy you can acquire hundreds of images just in one hour of time and obviously you can not do all the adjustments and measurements by hand. Luckily there are a number of easy to use tools around. You can identify structures and make measurements and finally come to quantitative data from images. ImageJ is very popular and can be programmed by its own macro language or many other languages including Python. But if you want to use Python maybe better to go for Python image processing modules like scikit-image, OpenCV or ITK.

Chemistry

Chemical typography

Here we list a couple of tools for high quality drawing of chemical formulae and schema. A popular proprietary tool for this purpose is ChemDraw, however it costs 600 (1 year) to 6,500 (unlimited) USD. There are a number of similar design GUI free software available but I found them especially difficult to use and lacking of essential features. One proprietary but free solution is Marvin Sketch (see below). Some further software are for 3D drawing, among these vimol is a nice piece (see below). However, as always in LaTeX you can find high quality, free and full featured solutions. In addition, you can draw structures from with OpenBabel or RDKit from any language they support, including Python for this you need to create the structure or download from a database.

LaTeX

  • https://tex.stackexchange.com/questions/52722/can-you-make-chemical-structure-diagrams-in-latex - For a general guidance please read both answers to this question. They give an overview of the alternatives with minimal examples and opinions.
  • http://xymtex.com/fujitas3/xymtex/indexe.html - XyMTeX is a software for drawing chemical formulae and schemata. Developed by an old professor in Japan, Shinsaku Fujita, it seems to be the ultimate "if something you can't do with this probably there is no way to do it" solution. XyMTeX is huge, it's manual is huge (780 pages!), it's features are countless and it's output is the highest quality. Of course as always, using such a great tool requires learning, and there are some smaller and simpler solutions available.
  • https://ctan.org/pkg/chemfig - Another great tool with good manual, actively maintained and maybe easier to ues than XyMTeX but maybe not so perfect drawings. By Christian Tellechea.
  • https://github.com/aminophen/chemobabel - Generates graphics within LaTeX directly from SMILES or ChemDraw files. Uses OpenBabel for rendering.

Books

Python beginner and intermediate books

Advanced Python

R books

Lectures

Podcasts

Miscellanous

  • https://awesomeopensource.com/ - A huge catalogue of open source projects, sometimes only short recommendations like here, sometimes great tutorials and examples.

About programming

Here we collect blogs, assays and other resources about programming, coding, software development in general: design patterns, attitudes, trends, history, etc.

Coding style

Development tools

Documentation

R package documentation

R analysis documentation

  • https://workflowr.github.io/workflowr/ - workflowr is a pkgdown wrapper for analysis projects. Its main advantage is the nice presentation: the analysis is presented as a good looking web page. It also offers entry points to organize the project and re-run specific sections of the pipeline. It presents git history on the analysis webpage, and notifies if some parts of the presented results are outdated. A disadvantage is that it doesn't follow the R package layout, making difficult the use of many other tools.

Reproducibility

R

  • https://rstudio.github.io/renv/articles/renv.html - renv is a dependency management tool for R. In your scripts use the package::function notation or use library() calls. When you're about to submit your analysis, run renv::init (later for updating use renv::snapshot). Include the renv.lock file in your repo, so anyone who wants to reproduce your analysis will be able to use all the required packages with the exact same versions that worked for you.
  • https://workflowr.github.io/workflowr/ - workflowr is also related to reproducibility, see details above in the documentation section.

Performance

Most of the times scripting languages like R and Python are provide sufficient performance to run our tasks on our laptop, or and if that's not enough, we can just go to the big computer cluster of the institute. The performance intensive methods are often contained in extensions written in C or Fortran and the scripting language is only the glue holding together our current particular arrangement, adding easy access and flexibility to the efficient core methods. However, you might find yourself in a situation when your code much runs slower than ideal. Especially if you need a new, computationally intensive method, not implemented yet in any package with bindings to your scripting language. So what to do with performance issues?

Profiling

The first thing you should do is profiling. In Python the cProfile module is fantastic for this purpose. At the end you will get a hierarchic tree of the method calls, with the number of calls and time spent for each method. Based on this you can go back to the code and try to improve the methods which mean a bottleneck, those which consume most of the time or called the most often.

Use efficient libraries

In the profiling you might have found out where are the slow parts in your code. One of the most efficient ways to improve efficiency is to outsorce intensive parts to extensions written in C or other high performance language. Make sure especially that vectorizable operations are done by numpy. Similarly, for graph computations igraph is an efficient C extension, and for mass spec methods you can use OpenMS. And these are just two examples, you should search for similar extensions for your method.

Writing C extensions

If the sufficient performance can't be achieved due to limitations of the scripting language, and no libraries are available, you can write the computationally intensive parts in C++ and create the sufficient bindings. It's especially easy to mix Python and C code with Cython or R and C++ with Rcpp.

Using Rust

Writing C or C++ code is difficult, and actually there are more convenient and more modern alternatives available. One of them is Rust, a new languege which became highly popular among programmers and gaining popularity also in science.

Regular expression resources

In data analysis we process tremendous amount of data which is sometimes noisy and we need to extract information from messy patterns. Regular expressions sooner or later will be your essential tools no matter which field and language do you work with. Here are a few excellent resources to learn these small tricky things called regex:

sed, awk, grep

Collections, cheatsheets

  • https://devhints.io/ - Nicely designed cheatsheets for many tools, mostly web related, but quite some of them also used in data science. For example, xpath is used in a large number of XML processing tools, and on the cheatsheet you can quickly find the syntax elements you need: https://devhints.io/xpath.

Unix, Linux and Bash

Introductory Bash

Basic bash is essential whatever operating system you use. Linux and Mac anyways has Bash as its default command line. It is extremely versatile and convenient. You can have it also in Windows if you want, see the section below. If you have access to any computing cluster or file storage server at your institute or university, you most likely have the easiest (and often the only) access to them by Bash.

There are so many Bash tutorials for beginners. I think most of them are quite dry and you will get bored soon or forget soon what you read. I think aim to learn the most essential 10-20 commands first, and keep using it actively for many weeks. Then according to your needs you can learn more useful things and what you read will be more digestable for you. Among the 3 materials below maybe the first one is the best in style:

If you prefer to learn from podcasts and videos you can find a ton of them on youtube, for example this channel covers many topics including Linux, Bash, SSH, vim, etc:

SSH

SSH stands for secure shell and it gives you a bash session on a remote computer (e.g. computing cluster of your institute), also capable to copy files between your computer and the remote one (scp) and to encrypt any communication between any software on different computers. It is convenient to set up a key based authentication for the server so you don't need to type your password any more:

Virtual machines

If you use Unix-like system you most likely want to have a Windows virtual machine (VM) and conversely, if you use Windows you most likely need a Linux VM for the very few tasks which can not be performed but only with one of these operating systems. Really you should not expect to have such tasks often, most probably you will start your VM only once a month.

The easiest way to create virtual machines is VirtualBox which has a free and open source edition, but the not OS edition is also free. It integrates seemlessly with your main operating system (especially if you install also the guest additions): you can share your directories, USB devices, network, etc. If you switch the VM to full screen you will have exactly the same experience as it was your main system.

Windows

In my opinion by using Windows you make most of the things more difficult for yourself because it is designed to restrict your insight and control over the technology. This way you do many things blindly without understanding what is happening in the background, and if you encounter a problem you will have less information available for thinking about a solution. Still many people use Windows because they feel they already invested significant time into "learning" it in the school and they don't want to learn something new.

If you use Windows as your main operating system, you might want to have some Unix compatible tools to make it more convenient to use. Alternatively you might create a virtual machine with Linux. But in this case I think you need a strong reason why you don't install Linux as your main system.

  • https://www.putty.org/ - A little SSH and Telnet client for Windows. The easiest way to log in to Unix servers (usually the computing clusters of your institute or university). Note: for an SSH login you need at least 3 information: the address of the server, your user name and your password.
  • https://mingw-w64.org/ - GCC for Windows. You need this if you want to compile software using GCC.
  • https://www.cygwin.com/ - A POSIX compatible environment for Windows. You need this to have all the very helpful tools and environment you have by default on Linux. And also to run some software which need a POSIX environment.
  • https://gitforwindows.org/ - Git for Windows. I think an easy quick solution is to install only this, and it will offer for you to install also Cygwin and Bash, and set up your paths. At the end you will have a nice environment.

Alternatively you can use the "Windows Subsystem for Linux" or virtual machines to use Linux from within Windows. But in this case why not to run Linux and keep a Windows virtual machine for the rare cases when you might need somthing from Windows.

  • https://docs.microsoft.com/en-us/windows/wsl/about - Windows Subsystem for Linux is a compatibility layer from Microsoft to run Linux binaries in Windows.
  • https://www.virtualbox.org/ - With VirtualBox you can create virtual computers which are almost completely like real computers and you can install and run any operating system on them. They can be made full screen and integrate seemlessly to the desktop of your main (host) operating system. VirtualBox is very popular because is easy to use, has great features and is open source.

Linux

  • https://pat-s.me/post/arch-install-guide-for-r/ - How to set up an Arch Linux system for data science purpuse? A guide from Patrick Schratz. Arch Linux is a highly customizable and efficient Linux distribution which requires advanced computer skills to configure and manage. But exactly because of this reason it's an excellent environment for learning more about Linux and computers in general.

Life in academia

About

Collection of Python and other data analysis resources

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published