HadoopKNN

A web-scale feature imputation tool for finding K-Nearest Neighbors in a training set

Project lead

Devin Shields (@devin1shields)

Purpose

To demonstrate a scalable statistical algorithm in the Hadoop environment. To highlight the 'gotchas' of moving from local development to distributed computing.

Problem Statement

Suppose you have access to cheap data (e.g. web traffic) for a large population and more useful (but expensive) data about a small subset of the population (e.g. loan default likelihood). Assuming good correlation between the cheap and expensive data and sufficient training data, you should be able to make inferences about any member of the population by looking at the neighborhood of a sample.

A better discussion of value imputation is available here: http://www.fs.fed.us/rm/pubs_other/rmrs_2008_crookston_n001.pdf

This code demonstrates how to find the k-nearest neighbors of each member of a test set. All assigned neighbors are members of a training set.

Where To Start

The 'run.sh' bash script is a good place to start.

If you have difficulties understanding the script, a good bash reference is available here: http://www.tldp.org/LDP/abs/abs-guide.pdf

Required Tools

This code requires:

Java
R
R's FFN package
python
python's numpy, scipy and scikit-learn packages
bash

Please use a *nix environment

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
lib/jars		lib/jars
local_development		local_development
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HadoopKNN

Project lead

Purpose

Problem Statement

Where To Start

Required Tools

About

Releases

Packages

devinshields/HadoopKNN

Folders and files

Latest commit

History

Repository files navigation

HadoopKNN

Project lead

Purpose

Problem Statement

Where To Start

Required Tools

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages