Skip to content

Commit

Permalink
allow arbitrary vectors file.
Browse files Browse the repository at this point in the history
  • Loading branch information
mocobeta committed Nov 29, 2014
1 parent 02cd605 commit 3655a30
Show file tree
Hide file tree
Showing 5 changed files with 54 additions and 12 deletions.
18 changes: 15 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,10 @@ Once you got Lucene index, you can now create vectors.txt file.

$ ./demo-word2vec.sh collection1

With -f option, you can specify arbitrary output vectors file.

$ ./demo-word2vec.sh collection1 -f vectors_my.txt

## for people who has PDF file
If you have Lucene in Action book PDF file, post the file to Solr.

Expand All @@ -68,12 +72,18 @@ Index livedoor news corpus xml files to Solr.
### create vectors.txt file
Once you got Lucene index, you can now create vectors.txt file.

$ ./demo-word2vec.sh ldcc org.apache.lucene.analysis.ja.JapaneseAnalyzer
$ ./demo-word2vec.sh ldcc -a org.apache.lucene.analysis.ja.JapaneseAnalyzer

With -f option, you can specify arbitrary output vectors file.

$ ./demo-word2vec.sh ldcc -a org.apache.lucene.analysis.ja.JapaneseAnalyzer -f vectors_my.txt

## compute distance among word vectors
Once you got word vectors file vectors.txt, you can find top 40 words that are closest words to the word you specified.

$ ./demo-distance.sh
With -f option, you can specify arbitrary input vectors file.

$ ./demo-distance.sh [-f <vectors_file>]
cat
Word: cat
Position in vocabulary: 2601
Expand All @@ -90,7 +100,9 @@ Once you got word vectors file vectors.txt, you can find top 40 words that are c

Or, you can compute vector operations e.g. vector('paris') - vector('france') + vector('italy') or vector('king') - vector('man') + vector('woman')

$ ./demo-analogy.sh
With -f option, you can specify arbitrary input vectors file.

$ ./demo-analogy.sh [-f <vectors_file>]
france paris italy
man king woman

Expand Down
10 changes: 9 additions & 1 deletion demo-analogy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,12 @@ RHCOM_JAR=$(ls lib/RONDHUIT-COMMONS-*.jar)
SLF4J_JAR=$(ls lib/slf4j-api-*.jar)
SLF4J_JAR=${SLF4J_JAR}:$(ls lib/slf4j-jdk14-*.jar)

java -cp ${RHCOM_JAR}:${SLF4J_JAR}:classes com.rondhuit.w2v.demo.WordAnalogy vectors.txt
VECTOR_FILE=vectors.txt
while getopts f: OPT
do
case $OPT in
"f" ) VECTOR_FILE="$OPTARG" ;;
esac
done

java -cp ${RHCOM_JAR}:${SLF4J_JAR}:classes com.rondhuit.w2v.demo.WordAnalogy ${VECTOR_FILE}
11 changes: 10 additions & 1 deletion demo-cluster.sh
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,17 @@ if [ -z $1 ]; then
exit 1
else
K=$1
shift
fi

VECTOR_FILE=vectors.txt
while getopts f: OPT
do
case $OPT in
"f" ) VECTOR_FILE="$OPTARG" ;;
esac
done

RHCOM_JAR=$(ls lib/RONDHUIT-COMMONS-*.jar)

java -cp ${RHCOM_JAR}:classes com.rondhuit.w2v.demo.WordCluster vectors.txt ${K} word-clusters.txt
java -cp ${RHCOM_JAR}:classes com.rondhuit.w2v.demo.WordCluster ${VECTOR_FILE} ${K} word-clusters.txt
10 changes: 9 additions & 1 deletion demo-distance.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,12 @@ RHCOM_JAR=$(ls lib/RONDHUIT-COMMONS-*.jar)
SLF4J_JAR=$(ls lib/slf4j-api-*.jar)
SLF4J_JAR=${SLF4J_JAR}:$(ls lib/slf4j-jdk14-*.jar)

java -cp ${RHCOM_JAR}:${SLF4J_JAR}:classes com.rondhuit.w2v.demo.Distance vectors.txt
VECTOR_FILE=vectors.txt
while getopts f: OPT
do
case $OPT in
"f" ) VECTOR_FILE="$OPTARG" ;;
esac
done

java -cp ${RHCOM_JAR}:${SLF4J_JAR}:classes com.rondhuit.w2v.demo.Distance ${VECTOR_FILE}
17 changes: 11 additions & 6 deletions demo-word2vec.sh
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,18 @@ if [ -z $1 ]; then
exit 1
else
SOLRCORE=$1
shift
fi

if [ -z $2 ]; then
ANALYZER=org.apache.lucene.analysis.core.WhitespaceAnalyzer
else
ANALYZER=$2
fi
ANALYZER=org.apache.lucene.analysis.core.WhitespaceAnalyzer
VECTOR_FILE=vectors.txt
while getopts a:f: OPT
do
case $OPT in
"a" ) ANALYZER="$OPTARG" ;;
"f" ) VECTOR_FILE="$OPTARG" ;;
esac
done

LUCENE_JAR=$(ls lib/lucene-core-*.jar)
LUCENE_JAR=${LUCENE_JAR}:$(ls lib/lucene-analyzers-common-*.jar)
Expand All @@ -36,4 +41,4 @@ RHCOM_JAR=$(ls lib/RONDHUIT-COMMONS-*.jar)
SLF4J_JAR=$(ls lib/slf4j-api-*.jar)
SLF4J_JAR=${SLF4J_JAR}:$(ls lib/slf4j-jdk14-*.jar)

java -cp ${LUCENE_JAR}:${RHCOM_JAR}:${SLF4J_JAR}:classes com.rondhuit.w2v.demo.LuceneCreateVectors -index solrhome/${SOLRCORE}/data/index -output vectors.txt -field body -analyzer ${ANALYZER} -cbow 1 -size 200 -window 8 -negative 25 -sample 0.0001 -iter 15 -min-count 5
java -cp ${LUCENE_JAR}:${RHCOM_JAR}:${SLF4J_JAR}:classes com.rondhuit.w2v.demo.LuceneCreateVectors -index solrhome/${SOLRCORE}/data/index -output ${VECTOR_FILE} -field body -analyzer ${ANALYZER} -cbow 1 -size 200 -window 8 -negative 25 -sample 0.0001 -iter 15 -min-count 5

0 comments on commit 3655a30

Please sign in to comment.