This repository was archived by the owner on Jul 14, 2022. It is now read-only.
Releases: neo4j-field/neo4j-arrow
Releases · neo4j-field/neo4j-arrow
v4.1 - Fix KHop bug
v4 - Bulk Database Imports
✨ New Stuff!
- Bulk import jobs (
import.bulk
) that support bootstrapping a new database on a Neo4j host by streaming nodes and relationships from aneo4j-arrow
client. See the example notebook for how it works. - New info jobs (
info.server
andinfo.jobs
) for querying the server-side plugin version and currently tracked jobs. (See theServerInfoHandler
class.) - The Python client/wrapper (
neo4j_arrow.py
) now has type annotations and passes MyPy in strict mode!
⚙️ Changes in the Guts
- Redesigned the Producer parts handling write streams (where the client pushes data to the server). Should solve some minor bugs in existing GDS Write Jobs.
- Lots of runtime type inspection in the Python client.
- Squashed some sequencing/race condition bugs in some of the GDS Write Jobs (previously wasn't properly advancing the status of the job).
🔨 Major Breaking Changes
- Python wrapper code is shuffled around and now in
./python
. (Will attach versions to GH releases to make it easier.) - Job names and parameters have been standardized using snake case for parameters (e.g.
idField => id_field
) and lowercase, dot notation for job namescypherRead => cypher.read
.
v3.1 - Plug some Memory Leaks
Some fixes to critical reliability issues with the v3 release:
- Setup the
VectorSchemaRoot
to use the memory allocator used by the flushing task - Close the
VectorSchemaRoot
before closing the allocators - Add in some delays in the busy loop when attempting to allocate memory (in
WorkBuffer.init()
)...we were failing too fast.
This version should be used in lieu of v3
.
v3 - 2-hops and New Plumbing
- 🧪 Experimental k-hop (for
k=2
) implementation...see KHOP.md for details - 👨🔧 Major replumbing of the
Producer
code for reading streams, removing semaphores and lots of lock contention points. Still WIP, but showing promise at increasing performance of all read-related jobs. - 👟 Snuck in some special "extra" parameters that can be passed in GDS Read actions to tweak partition count, batch size, and list length parameters (for khop) on a per-job basis.
Next up: more performance tuning! 🏎️
v2 - TLS Support & GDS Write Improvements
New Features
- TLS support (not yet supporting mutual TLS) for both client and server. A full-chain certificate and private key can be provided to the server via new
ARROW_TLS_CERTIFICATE
andARROW_TLS_PRIVATE_KEY
env vars. The Pythonneo4j_arrow.py
client has been updated to allow enabling TLS and also disabling certificate validation when needed.
Improvements & Fixes
- Easier to use Arrow memory settings, supporting suffixes (e.g.
g
,m
,t
) like when setting JVM heap size. For instance:MAX_MEM_GLOBAL=52g
- Longer default timeouts for write jobs
- Fixed memory leak when writing GDS graphs...now they clean up properly when using
call gds.graph.drop()
or when shutting down the server. - Support passing native PyArrow
Table
instances when putting a stream via theneo4j_arrow
client
Known Issues & Future Work
- No ability to write relationship properties
- Cypher support needs some more love
- Error handling of jobs could use improvements
- GDS Writes of relationships end up using inefficient Java types for adjacency lists, etc.
- GDS Write jobs could be improved by removing synchronous step of fully collecting the stream before processing it
v1 - The Line in the Sand
Figured need to start "tagging" something to have a referenceable build I've personally tested.
At this point, the following should be working:
- reading nodes and their labels and properties
- reading relationships and their types and properties
- writing nodes with labels and properties (those supported by GDS)
- writing relationships and types (no properties, yet!)
There are definite perf bottlenecks in some post-processing after doing writes as well as some timing issues in the write jobs.
For instance, if you want to build a graph you need to do the following:
- Write the nodes, supplying a new graph name (it will be created)
- Wait until you see on the server side (via logs) that it's complete as the client will report success after the data transfers. (I need a status indicator somewhere.)
- Then write the relationships.
- Same as with nodes, keep an eye on the server and see when it completes.
- The graph should be available for use now.