Skip to content

Getting Started with Hashmapd

utunga edited this page Jan 11, 2012 · 9 revisions

Getting started with hashmapd

Setting up the libraries

.. get the code..

  • cd hashmapd

make sure you have virtualenv and virtualenv wrapper installed ..

  • sudo easy_install virtualenv

  • sudo easy_install virtualenvwrapper

  • cat config/virtualenv_bashrc.txt >> ~/.bashrc

  • source ~/.bashrc

.. add some configs and aliases relating to virtualenv .. be sure to check out the virtualenv_bashrc.txt file itself, (and the paths therein)

  • mkvirtualenv hashmapd

.. creates an isolated virtual environment (we do this to make sure that we're all on specific versions of referenced libs)

NB later on, to load this isolated environment, just type workon hashmapd

  • pip install numpy==1.5.1

.. unfortunately, for reasons that are not clear, we found you need to do this step separately, to convince scipy to install correctly

  • pip install -r requirements.txt

.. installs the libraries listed under requirements.txt (including theano!)

  • pip install matplotlib==1.0.1

.. again, due to stuff that just doesnt work, we have to do this seperately

  • python setup.py build_ext --inplace

.. ok we are ready to build the cython code for tsne (compiled.pxd -> compiled.c -> compiled.so)

  • pip install -e .

.. installs the hashmapd library itself, in current location, but available from anywhere so you can just type 'import hashmapd' and it should work

--

Run the test project - fake txt gen

First thing to play with is probably the 'test' project - fake txt gen. this simply generates data of various classes using stochastic model, then confirms that we correctly output hashcodes that put data of the same class together.

  • cd projects/fake_txt_gen

.. cd to relevant directory

  • python generate_synthetic_raw_data.py

.. generate some example data - you can see this in the raw/*.csv files. The data simulates different classes of people talking about different words at different rates

  • python prepare.py

.. takes the generated data (as csv), loads it into a numpy array and saves it into a pickled, gzipped files (in data/ directory). Ideally we might do without this step, but leaving it in for speed.

  • python train.py

.. this is the core training step. Controlled (like all the steps) by config.cfg. Runs RBM and calls that 'pre-training' then refines the network as an 'autoencoder' using standard back prop. Saves weights thus generated into data/fake_txt_weights.pkl.gz

  • python get_codes.py

.. runs the network from above, using the seperate render_data saved from prepare.py (data/render_..pkl.gz) as input. Records the activations in middle layer given those inputs and calls them 'codes' (this is our semantic hash row for each input) saves codes to out/codes.csv. Writes trace information into trace/*png. (Depending on config).

  • python get_coords.py

.. runs the t-sne algorithm to generate a 2-d position given the code for input row of the render_data file. this uses cython code from compiled.pyx -> compiled.c -> compiled.so etc

  • python get_map.py

.. plots the above 2-d points. this is just a sanity check that things worked correctly. if so you should see distinct groups of dots on the map, per label. (The rows in render data are labeled when they are generated according to the stochastic model that generates it.)

--

  • ./run_to_end.sh

.. to run all the above in one step.

Other projects

mnist

This is the digit recognition problem. A harder problem, to check that we can correctly classify input rows where rows of input are pixel colors from black and white digits. We aim to get it to put all the images that represent the same digit close together in hashcode space (and hence 2-d space).

Being a harder problem it will take a while to train. (Currently the config.cfg is set up to train for 10,000 iterations, which seems like overkill though - 5,000 is probably more than sufficient for both cases).

--

projects/word_vectors

This is the 'actual' project which will generate our 'semantic hash' codes for our input of word frequencies per twitter use. To get data to run this code, you need to look for documentation elsewhere (TBD).

--

projects/fake_txt

Defunct. Should be removed. Just that thre are a couple of useful scripts relating to couchdb that might be of some use at some point later.

projects/tsne

An example that exercizes tsne in a standalone setting.

--

Addendum: Not really needed but potentially useful..

Bpython and ipython don't seem to play nicely with virtualenv.

Your mileage may very but FWIW this is what I have in place to deal with this - not ideal but it seems to work

  1. Add some relevant paths to your .bashrc. I have:

    #get virtualenvwrapper to play nicely with BPYTHON export PYTHONSTARTUP=~/.pythonrc export PYTHONPATH=/Users/utunga/dev/hashmapd:/Users/utunga/dev/Theano

  2. Setup or append a .pythonrc file to have the following in it

    try: from django.core.management import setup_environ import settings setup_environ(settings) if settings.PINAX_ROOT: import sys from os.path import join sys.path.insert(0, join(settings.PINAX_ROOT, "apps")) sys.path.insert(0, join(settings.PROJECT_ROOT, "apps")) except: pass

Clone this wiki locally