Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Parquet data for HNSWDenseVector #2582

Merged
merged 8 commits into from
Sep 10, 2024

Conversation

valamuri2020
Copy link
Member

Overview

Adds functionality to run HNSW Indexer with Parquet data. Main functionality added:

  • json_to_parquet.py
  • ParquetDenseVectorCollection
  • ParquetDenseVectorDocumentGenerator

Tested on nfcorpus and robust04 datasets with evals matching.

Steps to reproduce

Env setup

  1. conda create -n parquet && conda activate parquet
  2. cd src/main/python/parquet
  3. pip install -r requirements.txt

Convert data to parquet format and run indexing

Run the following commands from repo root.

Download raw data

wget 'https://www.dropbox.com/scl/fi/1qnwq7s56muwudetqxgez/bge-base-en-v1.5-robust04.tar?rlkey=sd8zt0qnopwgbel43an46bggc&dl=0' -P collections/

Create parquet data

python src/main/python/parquet/json_to_parquet.py --input collections/robust04 --output collections/robust04.parquet/ --overwrite

Create index

bin/run.sh io.anserini.index.IndexHnswDenseVectors -collection ParquetDenseVectorCollection -input collections/robust04.parquet/ -generator ParquetDenseVectorDocumentGenerator -index indexes/parquet-robust04 -threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge

Run retrieval on index

bin/run.sh io.anserini.search.SearchHnswDenseVectors \
  -index indexes/parquet-robust04/ \
  -topics tools/topics-and-qrels/topics.beir-v1.0.0-robust04.test.tsv \
  -topicReader TsvString \
  -output runs/parquet-robust04.txt \
  -generator VectorQueryGenerator -topicField title -removeQuery -threads 16 -hits 1000 -efSearch 1000 -encoder BgeBaseEn15

Run evals

bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-robust04.test.txt runs/parquet-robust04.txt
bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.beir-v1.0.0-robust04.test.txt runs/parquet-robust04.txt
bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.beir-v1.0.0-robust04.test.txt runs/parquet-robust04.txt

pom.xml Outdated
@@ -535,6 +535,58 @@
<artifactId>spring-boot-starter-logging</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrm... can we pull in parquet without pulling in all of hadoop? i.e., minimize the dependencies we drag in...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed, I initially tried to do it without hadoop using parquet-mr, but it was not available in any of the mirrors, and I kept getting Missing artifact org.apache.parquet:parquet-mr:jar:1.12.3.

So I had to fall back to hadoop

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the size of the fatjar before and after? i.e., how much are you pulling in?

maybe try suppressing the transitive inclusion of jars with exclude tags?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

master: 179MB
this branch: 265 MB (after putting in exclude tags)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yikes! that's a lot of bloat for a relatively small feature...

can you poke around to see how we might make more lean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated dependencies, it's around 220 MB now

Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also -

Submodule tools

Update to latest in master please.

@lintool
Copy link
Member

lintool commented Aug 29, 2024

@valamuri2020 good job!

Compare against the safetensor implementation? I'm interested in collection size and also indexing time.

@valamuri2020
Copy link
Member Author

Also -

Submodule tools

Update to latest in master please.

I think it's already updated to the latest? This is what I see at the top of the git log, the commit hash matches what's on the tools repo.

commit 3a2b3cc5cfd915d707408aa5c4567185e9e4544f (HEAD -> master, origin/master, origin/HEAD)
Author: Ronak <rpradeep@uwaterloo.ca>
Date:   Wed Aug 7 14:20:18 2024 -0400

    Add RAG 24 Test Topics (#81)

@arjenpdevries
Copy link

I was just briefly looking at the discussion above, and was wondering whether a subset of the dependencies might be sufficient?

See here: https://github.com/apache/parquet-java?tab=readme-ov-file#add-parquet-as-a-dependency-in-maven

They list 4 separate dependencies, maybe it can work without using the hadoop one?

@lintool
Copy link
Member

lintool commented Aug 31, 2024

Also -

Submodule tools

Update to latest in master please.

I think it's already updated to the latest? This is what I see at the top of the git log, the commit hash matches what's on the tools repo.

commit 3a2b3cc5cfd915d707408aa5c4567185e9e4544f (HEAD -> master, origin/master, origin/HEAD)
Author: Ronak <rpradeep@uwaterloo.ca>
Date:   Wed Aug 7 14:20:18 2024 -0400

    Add RAG 24 Test Topics (#81)

Yes, this is indeed HEAD. Not sure why your diff is showing this then:

Screen Shot 2024-08-31 at 1 09 38 PM

Are you missing a git submodule update somewhere?

@valamuri2020
Copy link
Member Author

Are you missing a git submodule update somewhere?

Nope, running that command didn't update anything.

@valamuri2020
Copy link
Member Author

valamuri2020 commented Sep 1, 2024

I was just briefly looking at the discussion above, and was wondering whether a subset of the dependencies might be sufficient?

See here: https://github.com/apache/parquet-java?tab=readme-ov-file#add-parquet-as-a-dependency-in-maven

They list 4 separate dependencies, maybe it can work without using the hadoop one?

Thanks for the idea @arjenpdevries! I gave that a try and it took some digging into. Unfortunately, it seems that hadoop is still tightly involved under the hood. The top level API uses LocalInputFile instead of the HadoopFile, but everything under that is still using hadoop, ex. the configuration. I read through the PR and GitHub Issues, it seems like they plan to decouple them in the future, but nothing that is low-bloat and easy to use right now.

* @return the parsed vector as an array of doubles
*/

private float[] parseVectorFromString(String contents) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Document, we have

private final double[] vector;

Why do we need to parse from String?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Document interface doesn't expose the vector directly, but instead the contents() as a String. This convention of exposing a String, and parsing the vector in the Generator is also in JsonCollection and DenseVectorDocumentGenerator. It seems that this has been the convention for a while.

@lintool
Copy link
Member

lintool commented Sep 9, 2024

@valamuri2020

% ls -lh target/anserini-0.37.1-SNAPSHOT-fatjar.jar 
-rw-r--r--  1 jimmylin  staff   179M 26 Aug 19:23 target/anserini-0.37.1-SNAPSHOT-fatjar.jar

What's the latest update on bloat? 220MB? Is that the best we can do? And have we tried excluding as much as we can?

@valamuri2020
Copy link
Member Author

Yep, 226 MB to be exact. That is pretty close to the best we can do - I went through the mvn dependency tree and removed as many of the transitive dependencies as possible.

@lintool lintool merged commit 66d97d1 into castorini:master Sep 10, 2024
1 check passed
@arjenpdevries
Copy link

Perhaps... https://github.com/strategicblue/parquet-floor (I have not checked myself how good this is)

@lintool lintool mentioned this pull request Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants