Open
Description
openedon Aug 30, 2022
Loading data into HBase is not trivial. We want the demo to show how this can be done and to provide some guidance and best practice.
Aims
- Load data row by row (NiFi)
- Batch processing CSV files (MapReduce)
- Direct load of HFiles
- Test HBase Spark connector (New demo: ingest data to hbase stackablectl#71)
Tasks
- Load data into HDFS from S3
- Parse CSV and create HFiles
- Load incremental HFiles into HBase
- Load a streaming data source into HBase
- Stackable cluster configuration
- Verify the data is there (sanity check) using HBase shell
- Create Phoenix view over table
- Configure Phoenix as a data source in SuperSet
- Create a visualisation using Phoenix JDBC and SuperSet
- Query HBase using Spark HBase connector
## Learning Points and Challenges
- Where does DistCP and HBase bulk load run, given there is no YARN cluster?
- Are these jobs scalable?
- Can we go near real time dashboards in Grafana and see instant updates
- Stress testing
- Test HBase region management - can we watch this in real time as part of a demo?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment