Skip to content

ahmetlekesiz/trendyol-bootcamp-spark

 
 

Repository files navigation

trendyol-bootcamp-spark

HOMEWORK DEFINITION

Find the latest version of each product in every run, and save it as snapshot.

Product data stored under the data/homework folder.
Read data/homework/initial_data.json for the first run.
Read data/homework/cdc_data.json for the nex runs.

Save results as json, parquet or etc.

Note: You can use SQL, dataframe or dataset APIs, but type safe implementation is recommended.

HOMEWORK SOLUTION

ASSUMPTION

I assume that this job will run once in a day. In this assumption there is only one json file in each partition_date folder.
If it needs to be run more than one in a day, we need to get latest json in the partition_date folder.

SOLUTION APPROACH

First, look at the batch output folder. If there is any exist data, read json as dataset and use it for merging with new dataset.
If there is no exist data, get initial data and use it for merging with new dataset.

OUTPUT FOR INITIAL RUN

OUTPUT AFTER INITIAL RUN

About

Spark Snapshot Merger

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Scala 100.0%