Open
Description
Loading data into HBase is not trivial. We want the demo to show how this can be done and to provide some guidance and best practice.
Aims
- Load data row by row (NiFi)
- Batch processing CSV files (MapReduce)
- Direct load of HFiles
- Test HBase Spark connector (New demo: ingest data to hbase stackablectl#71)
Tasks
- Load data into HDFS from S3
- Parse CSV and create HFiles
- Load incremental HFiles into HBase
- Load a streaming data source into HBase
- Stackable cluster configuration
- Verify the data is there (sanity check) using HBase shell
- Create Phoenix view over table
- Configure Phoenix as a data source in SuperSet
- Create a visualisation using Phoenix JDBC and SuperSet
- Query HBase using Spark HBase connector
## Learning Points and Challenges
- Where does DistCP and HBase bulk load run, given there is no YARN cluster?
- Are these jobs scalable?
- Can we go near real time dashboards in Grafana and see instant updates
- Stress testing
- Test HBase region management - can we watch this in real time as part of a demo?