Welcome to the Apache Pinot Getting Started guide. This repository will help you set up and run a demonstration that involves streaming and batch data sources. The demonstration includes a real-time stream of movie ratings and a batch data source of movies, which can be joined in Apache Pinot for querying.
flowchart LR
Stream-->k[Apache Kafka]-->p[Apache Pinot]
Batch-->p
p-->mrp[Movie Ratings]
p-->Movies
To quickly see the demonstration in action, you can use the following command:
make
For a detailed step-by-step setup, please refer to the Step-by-Step Details section.
If you're ready to explore the advanced features, jump directly to the Apache Pinot Advanced Usage section to run a multi-stage join between the ratings and movies tables.
This section provides detailed instructions to get the demonstration up and running from scratch.
Apache Pinot queries real-time data through streaming platforms like Apache Kafka. This setup includes a mock stream producer using Python to write data into Kafka.
First, build the producer image and start all services using the following commands:
docker compose build --no-cache
docker compose up -d
The docker-compose.yml
file configures the following services:
- Zookeeper (dedicated to Pinot)
- Pinot Controller, Broker, and Server
- Kraft (Zookeeperless Kafka)
- Python producer
Next, create a Kafka topic for the producer to send data to, which Pinot will then read from:
docker exec -it kafka kafka-topics.sh \
--bootstrap-server localhost:9092 \
--create \
--topic movie_ratings
To verify the stream, check the data flowing into the Kafka topic:
docker exec -it kafka \
kafka-console-consumer.sh \
--bootstrap-server localhost:9092 \
--topic movie_ratings
In Pinot, create two types of tables:
- A REALTIME table for streaming data (
movie_ratings
). - An OFFLINE table for batch data (
movies
).
To query the Kafka topic in Pinot, we add the real-time table using the pinot-admin
CLI, providing it with a schema and a table configuration.
The table configuration contains the connection information to Kafka.
docker exec -it pinot-controller ./bin/pinot-admin.sh \
AddTable \
-tableConfigFile /tmp/pinot/table/ratings.table.json \
-schemaFile /tmp/pinot/table/ratings.schema.json \
-exec
At this point, you should be able to query the topic in the Pinot console.
We now do the same for the OFFLINE table using this schema and table configuration.
docker exec -it pinot-controller ./bin/pinot-admin.sh \
AddTable \
-tableConfigFile /tmp/pinot/table/movies.table.json \
-schemaFile /tmp/pinot/table/movies.schema.json \
-exec
Once added, the OFFLINE table will not have any data. Let's add data in the next step.
Use the following command to load data into the OFFLINE movies table:
docker exec -it pinot-controller ./bin/pinot-admin.sh \
LaunchDataIngestionJob \
-jobSpecFile /tmp/pinot/table/jobspec.yaml
Now, both the REALTIME and OFFLINE tables are queryable.
To perform complex queries such as joins, open the Pinot console here and enable Use Multi-Stage Engine
. Example query:
select
r.rating latest_rating,
m.rating initial_rating,
m.title,
m.genres,
m.releaseYear
from movies m
left join movie_ratings r on m.movieId = r.movieId
where r.rating > .9
order by r.rating desc
limit 10
To stop and remove all services related to the demonstration, run:
docker compose down
If you encounter "No space left on device" during the Docker build process, you can free up space with:
docker system prune -f
For more detailed tutorials and documentation, visit the StarTree developer page here