Find the latest version of each product in every run, and save it as snapshot.
Product data stored under the data/homework folder.
Read data/homework/initial_data.json for the first run.
Read data/homework/cdc_data.json for the nex runs.
Save results as json, parquet or etc.
Note: You can use SQL, dataframe or dataset APIs, but type safe implementation is recommended.
I assume that this job will run once in a day. In this assumption there is only one json file in each partition_date folder.
If it needs to be run more than one in a day, we need to get latest json in the partition_date folder.
First, look at the batch output folder. If there is any exist data, read json as dataset and use it for merging with new dataset.
If there is no exist data, get initial data and use it for merging with new dataset.