Skip to content

SonyResearch/octopus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Octopus

Octopus is a powerful and flexible data processing platform designed for large-scale analytics. By leveraging technologies like Neo4j, PySpark, and Apache Arrow Flight, it enables efficient querying, processing, and analysis of graph data. Additionally, Octopus offers a Lite execution pipeline optimized for lightweight queries that require minimal computation and have a small data footprint.

Octopus is offered as a Platform-as-a-Service (PaaS), allowing users to leverage its capabilities without concern for underlying computational resources or performance management. It seamlessly manages execution scalability, resource allocation, and performance optimization, enabling users to focus on data processing and analytics tasks.

Table of Contents

Octopus vs Neo4j

Octopus is not Neo4j. It uses Neo4j as one of its components to store graphs and enable operations on graphs. Octopus allows for higher performance query execution and provides a Python-friendly output format compared to Neo4j.

Query executions in Octopus are asynchronous, meaning your code’s execution flow will not be blocked while queries are running. Queries are executed on the server without using resources from your client machine.

Query outputs in Octopus can be retrieved on demand, allowing users to decide when to pull output partitions to their client machine.

Query outputs in Octopus are in pyarrow dataframes, which function similarly to Python dataframes while offering high-performance manipulation due to their columnar format. In contrast, Neo4j outputs are Neo4j objects, which need conversion to a Python-friendly format for downstream tasks, a time-consuming process.

Due to its design, Octopus enables you to implement high-performance workflows in your application code, such as overlapping query execution with other tasks or overlapping communication (when retrieving outputs from Octopus) with other computations.

Features

  • Graph Database Integration: Utilize Neo4j for efficient graph data querying and management.
  • Distributed Processing: Leverage PySpark for scalable and distributed data processing.
  • High-Performance Data Transfer: Use Apache Arrow Flight for fast and efficient data transport.
  • Iterable-like Python API: Octopus client uses iterable Python syntax to retrieve query results from the server.

Server Building Blocks

The Octopus server is designed using a micro-services architecture to ensure flexibility, scalability, and ease of maintenance. Each core component is loosely coupled, allowing for independent development, deployment, and scaling of individual services. The key building blocks of the Octopus server are as follows:

Job Dispatcher

The Job Dispatcher is responsible for managing and orchestrating execution requests. It receives queries from the Octopus client, builds the appropriate Spark-submit commands, and launches the job for execution. It also tracks the execution status and logs. In the current design, the Job Dispatcher service also runs Spark Master and Spark History Server.

Spark

Spark is the engine behind distributed query execution in Octopus. It processes complex, large-scale queries by leveraging its powerful distributed computing framework. The Job Dispatcher submits jobs to Spark, which handles both standalone and Kubernetes deployment modes, ensuring optimal resource management based on the environment.

Neo4j

Neo4j serves as the graph database backend, storing the graph structure and relationships. It enables fast, efficient querying of connected data, and its integration with Octopus ensures that users can easily retrieve insights from complex datasets.

Arrow Flight Output Server

The Arrow Flight Output Server is responsible for storing query results and making them available for consumption. After Spark processes a query, the result is stored in Arrow Flight, enabling rapid data retrieval and optimized data transfer to clients. The use of Arrow Flight ensures that query results can be streamed efficiently, improving overall performance for data-heavy applications.

Octopus Lite

Octopus Lite offers an optimized execution pipeline for lightweight queries on the Octopus server. It optimizes query execution by reducing overhead, making it ideal for queries with low computational time and minimal data footprint. Lite is not recommended for large-scale data processing.

Key Features

  • Optimized for Small Queries: Reduces execution overhead for faster response times.
  • Seamless Integration: Works alongside the default Octopus execution model.
  • Efficient Resource Utilization: Avoids unnecessary resource allocation for simple queries.

Execution Model

Octopus Lite modifies the default execution pipeline to minimize latency:

  1. Direct Execution: Queries are executed with minimal orchestration.
  2. Bypassing Components: Eliminates dependencies on Spark and Arrow Flight.
  3. Optimized Result Handling: Smaller results are returned directly without intermediate storage.

Server Components Interaction

Default Execution Pipeline

The following sequence diagram illustrates the interactions between the different components of Octopus server for the default execution pipeline:

Octopus Server Sequence Diagram

Notes:

  • The default execution pipeline of Octopus is asynchronous. The client code does not block to wait for query execution to finish on the server.
  • Job Dispatcher responds to a client execution request with an Execution ID, that can be used to check the execution status.
  • Client retrieves output partitions directly from the Arrow Flight Output Server.

Octopus Lite Execution Pipeline

The following sequence diagram shows the interactions between the different components of Octopus server for the lite execution pipeline:

Octopus Lite Server Sequence Diagram

Notes:

  • The execution pipeline of Octopus Lite is synchronous. The client calling process blocks until the query execution on the server has completed.
  • Job Dispatcher executes the query then responds to the client execution request with an Execution ID and the query output.

Octopus Query Classifier

Octopus Query Classifier (QClassifier) helps determine the most suitable execution pipeline for a given query. It uses a classification model that assigns each query a label of 1 or 0:

  • 1 (lightweight): The query requires minimal time, compute, and memory resources, making it best suited for Octopus Lite.

  • 0 (heavyweight): The query requires more time, compute, and memory resources, and is therefore best executed with the default Octopus pipeline.

The following diagram illustrates the flow of a user request through the system:

Octopus Query Classifier Diagram

About

Octopus - A high performance data processing and analysis platform for graphs

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •