-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Parquet data for HNSWDenseVector #2582
Conversation
pom.xml
Outdated
@@ -535,6 +535,58 @@ | |||
<artifactId>spring-boot-starter-logging</artifactId> | |||
</exclusion> | |||
</exclusions> | |||
</dependency> | |||
<dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hrm... can we pull in parquet without pulling in all of hadoop? i.e., minimize the dependencies we drag in...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agreed, I initially tried to do it without hadoop using parquet-mr
, but it was not available in any of the mirrors, and I kept getting Missing artifact org.apache.parquet:parquet-mr:jar:1.12.3
.
So I had to fall back to hadoop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the size of the fatjar before and after? i.e., how much are you pulling in?
maybe try suppressing the transitive inclusion of jars with exclude tags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
master: 179MB
this branch: 265 MB (after putting in exclude tags)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yikes! that's a lot of bloat for a relatively small feature...
can you poke around to see how we might make more lean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated dependencies, it's around 220 MB now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@valamuri2020 good job! Compare against the safetensor implementation? I'm interested in collection size and also indexing time. |
I think it's already updated to the latest? This is what I see at the top of the git log, the commit hash matches what's on the tools repo.
|
I was just briefly looking at the discussion above, and was wondering whether a subset of the dependencies might be sufficient? See here: https://github.com/apache/parquet-java?tab=readme-ov-file#add-parquet-as-a-dependency-in-maven They list 4 separate dependencies, maybe it can work without using the hadoop one? |
Yes, this is indeed HEAD. Not sure why your diff is showing this then: Are you missing a |
Nope, running that command didn't update anything. |
Thanks for the idea @arjenpdevries! I gave that a try and it took some digging into. Unfortunately, it seems that hadoop is still tightly involved under the hood. The top level API uses LocalInputFile instead of the HadoopFile, but everything under that is still using hadoop, ex. the configuration. I read through the PR and GitHub Issues, it seems like they plan to decouple them in the future, but nothing that is low-bloat and easy to use right now. |
* @return the parsed vector as an array of doubles | ||
*/ | ||
|
||
private float[] parseVectorFromString(String contents) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Document
, we have
private final double[] vector;
Why do we need to parse from String?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Document interface doesn't expose the vector directly, but instead the contents()
as a String. This convention of exposing a String, and parsing the vector in the Generator is also in JsonCollection
and DenseVectorDocumentGenerator
. It seems that this has been the convention for a while.
% ls -lh target/anserini-0.37.1-SNAPSHOT-fatjar.jar
-rw-r--r-- 1 jimmylin staff 179M 26 Aug 19:23 target/anserini-0.37.1-SNAPSHOT-fatjar.jar What's the latest update on bloat? 220MB? Is that the best we can do? And have we tried excluding as much as we can? |
Yep, 226 MB to be exact. That is pretty close to the best we can do - I went through the mvn dependency tree and removed as many of the transitive dependencies as possible. |
Perhaps... https://github.com/strategicblue/parquet-floor (I have not checked myself how good this is) |
Overview
Adds functionality to run HNSW Indexer with Parquet data. Main functionality added:
json_to_parquet.py
ParquetDenseVectorCollection
ParquetDenseVectorDocumentGenerator
Tested on
nfcorpus
androbust04
datasets with evals matching.Steps to reproduce
Env setup
conda create -n parquet && conda activate parquet
cd src/main/python/parquet
pip install -r requirements.txt
Convert data to parquet format and run indexing
Run the following commands from repo root.
Download raw data
Create parquet data
Create index
Run retrieval on index
Run evals