Octopus is a powerful and flexible data processing platform designed for large-scale analytics. By leveraging technologies like Neo4j, PySpark, and Apache Arrow Flight, it enables efficient querying, processing, and analysis of graph data. Additionally, Octopus offers a Lite execution pipeline optimized for lightweight queries that require minimal computation and have a small data footprint.
Octopus is offered as a Platform-as-a-Service (PaaS), allowing users to leverage its capabilities without concern for underlying computational resources or performance management. It seamlessly manages execution scalability, resource allocation, and performance optimization, enabling users to focus on data processing and analytics tasks.
Table of Contents
- Octopus vs Neo4j
- Features
- Server Building Blocks
- Octopus Lite
- Server Components Interaction
- Octopus Query Classifier
- Local Deployment
- Octopus Server Firefighting Guide
- Contact
Octopus is not Neo4j. It uses Neo4j as one of its components to store graphs and enable operations on graphs. Octopus allows for higher performance query execution and provides a Python-friendly output format compared to Neo4j.
Query executions in Octopus are asynchronous, meaning your code’s execution flow will not be blocked while queries are running. Queries are executed on the server without using resources from your client machine.
Query outputs in Octopus can be retrieved on demand, allowing users to decide when to pull output partitions to their client machine.
Query outputs in Octopus are in pyarrow dataframes, which function similarly to Python dataframes while offering high-performance manipulation due to their columnar format. In contrast, Neo4j outputs are Neo4j objects, which need conversion to a Python-friendly format for downstream tasks, a time-consuming process.
Due to its design, Octopus enables you to implement high-performance workflows in your application code, such as overlapping query execution with other tasks or overlapping communication (when retrieving outputs from Octopus) with other computations.
- Graph Database Integration: Utilize Neo4j for efficient graph data querying and management.
- Distributed Processing: Leverage PySpark for scalable and distributed data processing.
- High-Performance Data Transfer: Use Apache Arrow Flight for fast and efficient data transport.
- Iterable-like Python API: Octopus client uses iterable Python syntax to retrieve query results from the server.
The Octopus server is designed using a micro-services architecture to ensure flexibility, scalability, and ease of maintenance. Each core component is loosely coupled, allowing for independent development, deployment, and scaling of individual services. The key building blocks of the Octopus server are as follows:
The Job Dispatcher is responsible for managing and orchestrating execution requests. It receives queries from the Octopus client, builds the appropriate Spark-submit commands, and launches the job for execution. It also tracks the execution status and logs. In the current design, the Job Dispatcher service also runs Spark Master and Spark History Server.
Spark is the engine behind distributed query execution in Octopus. It processes complex, large-scale queries by leveraging its powerful distributed computing framework. The Job Dispatcher submits jobs to Spark, which handles both standalone and Kubernetes deployment modes, ensuring optimal resource management based on the environment.
Neo4j serves as the graph database backend, storing the graph structure and relationships. It enables fast, efficient querying of connected data, and its integration with Octopus ensures that users can easily retrieve insights from complex datasets.
The Arrow Flight Output Server is responsible for storing query results and making them available for consumption. After Spark processes a query, the result is stored in Arrow Flight, enabling rapid data retrieval and optimized data transfer to clients. The use of Arrow Flight ensures that query results can be streamed efficiently, improving overall performance for data-heavy applications.
Octopus Lite offers an optimized execution pipeline for lightweight queries on the Octopus server. It optimizes query execution by reducing overhead, making it ideal for queries with low computational time and minimal data footprint. Lite is not recommended for large-scale data processing.
- Optimized for Small Queries: Reduces execution overhead for faster response times.
- Seamless Integration: Works alongside the default Octopus execution model.
- Efficient Resource Utilization: Avoids unnecessary resource allocation for simple queries.
Octopus Lite modifies the default execution pipeline to minimize latency:
- Direct Execution: Queries are executed with minimal orchestration.
- Bypassing Components: Eliminates dependencies on Spark and Arrow Flight.
- Optimized Result Handling: Smaller results are returned directly without intermediate storage.
The following sequence diagram illustrates the interactions between the different components of Octopus server for the default execution pipeline:
Notes:
- The default execution pipeline of Octopus is asynchronous. The client code does not block to wait for query execution to finish on the server.
- Job Dispatcher responds to a client
executionrequest with an Execution ID, that can be used to check the execution status. - Client retrieves output partitions directly from the Arrow Flight Output Server.
The following sequence diagram shows the interactions between the different components of Octopus server for the lite execution pipeline:
Notes:
- The execution pipeline of Octopus Lite is synchronous. The client calling process blocks until the query execution on the server has completed.
- Job Dispatcher executes the query then responds to the client
executionrequest with an Execution ID and the query output.
Octopus Query Classifier (QClassifier) helps determine the most suitable execution pipeline for a given query. It uses a classification model that assigns each query a label of 1 or 0:
-
1 (lightweight): The query requires minimal time, compute, and memory resources, making it best suited for Octopus Lite.
-
0 (heavyweight): The query requires more time, compute, and memory resources, and is therefore best executed with the default Octopus pipeline.
The following diagram illustrates the flow of a user request through the system:


