trendyol-bootcamp-spark

HOMEWORK DEFINITION

Find the latest version of each product in every run, and save it as snapshot.

Product data stored under the data/homework folder.
Read data/homework/initial_data.json for the first run.
Read data/homework/cdc_data.json for the nex runs.

Save results as json, parquet or etc.

Note: You can use SQL, dataframe or dataset APIs, but type safe implementation is recommended.

HOMEWORK SOLUTION

ASSUMPTION

I assume that this job will run once in a day. In this assumption there is only one json file in each partition_date folder.
If it needs to be run more than one in a day, we need to get latest json in the partition_date folder.

SOLUTION APPROACH

First, look at the batch output folder. If there is any exist data, read json as dataset and use it for merging with new dataset.
If there is no exist data, get initial data and use it for merging with new dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
homework_output		homework_output
output/batch		output/batch
project		project
src/main/scala/com/trendyol/bootcamp		src/main/scala/com/trendyol/bootcamp
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

trendyol-bootcamp-spark

HOMEWORK DEFINITION

HOMEWORK SOLUTION

ASSUMPTION

SOLUTION APPROACH

OUTPUT FOR INITIAL RUN

OUTPUT AFTER INITIAL RUN

About

Releases

Packages

Languages

ahmetlekesiz/trendyol-bootcamp-spark

Folders and files

Latest commit

History

Repository files navigation

trendyol-bootcamp-spark

HOMEWORK DEFINITION

HOMEWORK SOLUTION

ASSUMPTION

SOLUTION APPROACH

OUTPUT FOR INITIAL RUN

OUTPUT AFTER INITIAL RUN

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages