Skip to content

Tips and tricks for pandas devs #3156

Closed
@ghost

Description

Working on pandas for a while now, there's a bunch of tools and tricks
I use, here's a list to help pandas devs slip into the zone:

Use ipdb rather then pdb with nose: --ipdb --ipdb-fail

https://github.com/flavioamieiro/nose-ipdb

Because tab-completion is not optional.

Re-running only failed tests

nosetests --with-id --failed will rerun only the tests which failed last
time you ran nosetests --with-id. If you use test_fast.sh

test_fast.sh --failed

will do what you expect after you had some tests fail

Better integration of github and git commandline flow

hub a wrapper around git, with github
sugar. first and foremost:

hub checkout https://github.com/pydata/pandas/pull/1134

adds a remote, fetches it, creates a branch for it, and generally puts your right there.

Note: see comment below for a way to do this with pure git, if you don't
mind thousands of remote branches.

GH issues from the command line

ghi
open/manipulate gh issues from the command line.

I use it to open issues when I hit a bug and want to quickly
open a reminder to fix, without breaking my focus.

Testing across python version locally

tox let's you run the test suites across all python versions using virtualenvs.
Everything is setup in the repo, just install and run.
detox parallelizes tox.

Faster pandas builds/testing

Note: the build cache was baked into setup.py from roughly 0.9.1. as of 0.11.0
it's been factored out into scripts/use_build_cache.py, which rewrites setup.py
to use the build cache. The script has been tested as far back as 0.7.0.

Putting the following in your .bashrc

# Use the pandas build cache
export BUILD_CACHE_DIR="$HOME/tmp/.pandas_build_cache/"
if [ ! -e $BUILD_CACHE_DIR ]; then
    mkdir -p $BUILD_CACHE_DIR ;
fi

echo $BUILD_CACHE_DIR > [pandas repo root dir]/.build_cache_dir
function cdev {
# any recent commit should do
git checkout c69e3aa scripts/use_build_cache.py vb_suite/test_perf.py
scripts/use_build_cache.py $1 # rewire setup.py with build_cache
if [ x"$VIRTUAL_ENV" == x"" ]; then
   _SUDO="sudo"
fi

sudo chown $USER -R .;
$_SUDO python ./setup.py clean;
$_SUDO python ./setup.py develop;
sudo chown $USER -R .;
echo "Restoring setup.py"
git checkout setup.py # restore setup.py
}

c69e3aa can be any recent commit, needs to be bumped if there are updates
to the script.

The pandas build cache code, caches cythonization, compilation and
2to3 artifacts for reuse in subsequent builds.
To compile, use "git reset --hard" to get the commit you're after, then use cdev
to build pandas. setup.py will reuse what it can to speed this up.
Note that setup.py gets overwritten, but also restored when the build completes.
With a warm cache, moving to a given commit takes just a few seconds rather then
then the several minutes of a full compile.

You may also run scripts/use_build_cache.py prior to launching tox to speed up tetsing.

Use ccache

The build cache just described caches things on a very coarse level, if there's
any change to .pyx (cython) files, all the files will be recythonized and rebuilt.
Using ccache (an apt-get+envar away on most distors these days) can speed
up the compilation part by caching the gcc compilation results. Yes, this overlaps
with the caching from the previous section, only it also caches the cythonized
c files.

Benchmarking commits

test_perf.sh let's you compare the performance of one commit against
another or benchmark the current HEAD.
It produces a table of results suitable for posting in a PR, and can serialize
the results dataframe into a pickle file, for analysis in pandas.

It can print summary stats over mutliple runs and all sorts of things.
see test_perf.sh --help,

Easily generate dataframes of different kinds

mkdf let's you easily fabricate dataframes of varying dimensions
and arbitrary data:

from pandas.util.testing import makeCustomDataframe as mkdf
In [12]: mkdf(3,2)
Out[12]: 
C0      C_l0_g0 C_l0_g1
R0                     
R_l0_g0    R0C0    R0C1
R_l0_g1    R1C0    R1C1
R_l0_g2    R2C0    R2C1

# or even...
In [11]: mkdf(5,3,r_idx_nlevels=2,c_idx_nlevels=3,data_gen_f=lambda r,c: r*2+c)
Out[11]: 
C0               C_l0_g0  C_l0_g1  C_l0_g2
C1               C_l1_g0  C_l1_g1  C_l1_g2
C2               C_l2_g0  C_l2_g1  C_l2_g2
R0      R1                                
R_l0_g0 R_l1_g0        0        1        2
R_l0_g1 R_l1_g1        2        3        4
R_l0_g2 R_l1_g2        4        5        6
R_l0_g3 R_l1_g3        6        7        8
R_l0_g4 R_l1_g4        8        9       10

# or even
In [19]: mkdf(8,3,r_idx_nlevels=3,r_ndupe_l=[4,2])
Out[19]: 
C0                      C_l0_g0 C_l0_g1 C_l0_g2
R0      R1      R2                             
R_l0_g0 R_l1_g0 R_l2_g0    R0C0    R0C1    R0C2
                R_l2_g1    R1C0    R1C1    R1C2
        R_l1_g1 R_l2_g2    R2C0    R2C1    R2C2
                R_l2_g3    R3C0    R3C1    R3C2
R_l0_g1 R_l1_g2 R_l2_g4    R4C0    R4C1    R4C2
                R_l2_g5    R5C0    R5C1    R5C2
        R_l1_g3 R_l2_g6    R6C0    R6C1    R6C2
                R_l2_g7    R7C0    R7C1    R7C2
ipython startup file

your ipython installation has ~/.ipython/profile_default/startup directory,
put your imports, monkey-patches and utility function there and have them
always available.

Speel checking github issues

issues can quickly become stream of conciousness thing once
you start doing a lot of them, if you'd like an easy way to get red squigglies
when your comment contains silly mistaces, you might consider installing
After the deadline, available as an extension for firefox and chrome.

Handy git commands

There are too many git tricks to cover, but the following are both useful and less commonly known:

Generate a new Hash for the current commit, without any other changes to repo state.

git commit --amend -C HEAD

Report author of given commit hash:

function gauthor {
         git show --format='%an <%ae>' $@ | head -n 1
}

and properly assign authorship of a commit:

git commit --author="$(gauthor foohash)"

where foohash is any previous commit authored by that contributor.

To locate the merge commit that introduced a commit into the branch:
https://github.com/jianli/git-get-merge

Metadata

Metadata

Assignees

No one assigned

    Labels

    AdminAdministrative tasks related to the pandas projectDocs

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions