Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segment file size much larger 4GB instead of 900mb #116

Open
Gauravshah opened this issue Apr 1, 2018 · 3 comments
Open

segment file size much larger 4GB instead of 900mb #116

Gauravshah opened this issue Apr 1, 2018 · 3 comments

Comments

@Gauravshah
Copy link

Trying to use this as a library from 0.10.6 commit for spark druid re-processor. Using dataframes and segment pusher to make this as independent process. For some reason when I use the default map-reduce task on 1 day worth of data at "fifteen_minute" granularity I get 700mb file, but if I use this library then I get 4 GB of file for the same spec. Is there some compression or configuration I am missing ?
Verified the dimensions & metrics are the same
verified its the same data that I am processing
verified that data is actually at fifteen_minute granularity by looking at footer of smoosh file.

My guess is that it is missing dimension compression, but unsure how to figure out by looking at the smoosh file.

using druid 0.10.1

@Gauravshah
Copy link
Author

@Igosuki the issue is because of the sorting. since the rows are unsorted aggs on grouping result in higher number of rows. We should sort within partitions so that groups have similar rows and aggs happen nicely. Will open a pull request soon

@Igosuki
Copy link

Igosuki commented May 28, 2018

@Gauravshah cool, I also noticed this, which makes it pretty unusable

@Gauravshah
Copy link
Author

I tried doing sortwithinpartions but does not scale well, with one segment data in one executor the sorting phase is very slow. Do not know how to move forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants