-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add support for custom DateTime patterns for WindowedFilenamePolicy class #159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for custom DateTime patterns for WindowedFilenamePolicy class #159
Conversation
sabhyankar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look at the review comments. My suggestion would be to look at a builder/fluent pattern so as to not make the constructor look unweildy.
You could look at an example of the pattern here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this never be null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was setting an empty String as the default value for customDateTimePattern. Added a check now to avoid the NPE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You would get an NPI if customDatetimePattern is a null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we need is a way for the user to override what the pattern would be for each of the 5 components (year, month, day, hour and minute).
For example, a user may specify that the replacement pattern for minute should be mx instead of the default mm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should the input format for this option be? A comma separated String of the overridden values? Something like mm=mx,HH=hx...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fluent builder pattern can have methods that allow us to override the patterns for each of these. So something like this:
WindowedFilenamePolicy policy = WindowedFilenamePolicy.of(<existing_input_params>);
...
// Override minute pattern from default "mm" to "xx"
policy.setMinutePattern("xx");
// Override day pattern from default "DD" to "ZZ"
policy.setDayPattern("ZZ");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So does this mean each template that uses the WindowedFilenamePolicy class would have to have a separate input parameter for each Date component? Like:
WindowedFilenamePolicy.newBuilder()
.withOutputDirectory(options.getOutputDirectory())
.withOutputFilenamePrefix(options.getOutputFilenamePrefix())
.withShardTemplate(options.getOutputShardTemplate())
.withSuffix(options.getOutputFilenameSuffix())
.withYearPattern(options.getYearPattern())
.withMonthPattern(options.getMonthPattern()) // and so on
.build();
Currently, I've changed the class to use the Builder pattern as follows:
(assuming we're taking all the patterns to be overridden in one input parameter --customDateTimePattern from the user)
WindowedFilenamePolicy.newBuilder()
.withOutputDirectory(options.getOutputDirectory())
.withOutputFilenamePrefix(options.getOutputFilenamePrefix())
.withShardTemplate(options.getOutputShardTemplate())
.withSuffix(options.getOutputFilenameSuffix())
.withPattern(options.getCustomDateTimePattern())
.build();
What should the format for options.getCustomDateTimePattern() be?
src/main/java/com/google/cloud/teleport/io/WindowedFilenamePolicy.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a default template given that we didnt have one earlier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I got these from the default values set in the Pubsub to Text template. I'll take them out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - Why would this be needed now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A newBuilder() should return a Builder and not the actual annotated class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change this to writeWindowedFiles() since we're not returning a Builder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer we call this (and below) yearPattern() so as to make it less ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have both a toBuilder() and a newBuilder()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toBuilder() would be called internally after a withX() method is called.
Changing newBuilder() to writeWindowedFiles()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would be useful to log this at a debug level perhaps? At most we have a log every minute? See how this is done using MoreObjects.firstNotNull
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I'll add a few log statements for this
ced606c to
6ea64b7
Compare
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
|
Hi @sabhyankar , Would be great to get your feedback on this. |
|
Hi @melbrodrigues - I will take a look at this. |
|
I'm not sure how this change resolves the original issue #155. The problem was that, for example, "mm" in "recommendation" in the directory name was getting replaced with minutes. Now, with this change, if I want to have the directory as something like I suggest implementing a slightly different fix:
I think this would be better because this approach is more flexible (what if I don't want minutes in my directory name at all?) and there is no need to set 5 different parameters (and if you do, you may run into many other issues, e.g. what if |
|
I think the intention was to give users the ability to override default patterns and avoid unintentional changes to their file paths. For the And if the user wants to include patterns in the directory name they can still do so with specific overrides. For example, an override of Note that the user does not need to pass all parameter overrides and the overrides must be a valid pattern as documented in DateTimeFormatter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| LOG.error("Invalid DateTime pattern: {}", e.getMessage()); | |
| throw new RuntimeException(e); | |
| throw new RuntimeException("Invalid DateTime pattern", e); |
Usually logging and re-throwing the same exception is considered an anti-pattern. Any strong reason to do it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. I saw it at a few other places in the repo back then. I can take out the log statements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I remove the try-catch block as well? The catch block was just to log the error message and throw a new RuntimeException. If I remove the log statement all we're doing is catching an IllegalArgumentException and throwing a RuntimeException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd keep it because IllegalArgumentException doesn't make it clear for the user what pattern is incorrect (e.g. if an empty pattern is supplied it just says "Invalid pattern specification"). Here the RuntimeException at least says that the error was caused by a DateTime pattern.
I'd go even further and include even more details in the error message, e.g. what pattern is incorrect (add a helper method to avoid try-catch around each of the 5 patterns). But it's up to you if you want to make such a change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if it's a good idea to throw a RuntimeException... Since this resolution happens after we read from Pub/Sub, users may lose some data. Maybe we should use the default values if the pattern provided is invalid? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still throw the Exception. If a user set a pattern override, this means the default pattern doesn't work for them for some reason. If the override was set incorrectly, simply ignoring this and falling back to the fault pattern may cause unwanted side effects (e.g. output written to a wrong folder, potentially overwriting existing data). IMO it's better to throw an exception and not process anything in this case.
As for the data loss, it shouldn't happen. Dataflow guarantees "exactly once" processing semantics (some info about it). In case of PubSub, messages read from a topic are not going to be ACKed if there is a failure in the middle of the pipeline and PubSub is going to re-deliver them on the next processing attempt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect. I'll leave the RuntimeException there then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest moving all these options (+ the old output* options like outputShardTemplate) to a separate class (e.g. WindowedFilenamePolicyOptions), similar to how StreamingOptions are extended here, to avoid duplication in PubsubToAvro and PubsubToText templates. I'd put it in com.google.cloud.teleport.options package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Will move these to a new WindowedFilenamePolicyOptions interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test to verify that when a custom pattern is set the default pattern is not being replaced.
E.g. if the directory is something like ".../recommendations/mmmm" and the minutePattern is mmmm, then the resulting directory is ".../recommendations/0047" and not ".../reco47endations/0047".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I had added a test for this after our discussion yesterday. Will push it along with the other changes.
OK, got it, thanks. I think this is good enough. I left a few other comments, should be the last review round :) |
Taking over this review from sabhyankar.
| @@ -0,0 +1,64 @@ | |||
| package com.google.cloud.teleport.options; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file seems to be the cause of the import/copybara failure preventing merging. You'll need to add a license header similar to the other files. You could copy/paste from another file, though please update the year to 2021.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll add that.
mvn clean install seems to complain about a missing package-info file for the new options package, so I'll add that as well.
Allows WindowedFilenamePolicy class to use a custom DateTime pattern for the output directory.
Resolves #155