Skip to content

Conversation

@melbrodrigues
Copy link
Contributor

Allows WindowedFilenamePolicy class to use a custom DateTime pattern for the output directory.

Resolves #155

@google-cla google-cla bot added the cla: yes The PR submitter has a CLA label Aug 13, 2020
Copy link
Member

@sabhyankar sabhyankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at the review comments. My suggestion would be to look at a builder/fluent pattern so as to not make the constructor look unweildy.

You could look at an example of the pattern here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this never be null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was setting an empty String as the default value for customDateTimePattern. Added a check now to avoid the NPE.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would get an NPI if customDatetimePattern is a null.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we need is a way for the user to override what the pattern would be for each of the 5 components (year, month, day, hour and minute).

For example, a user may specify that the replacement pattern for minute should be mx instead of the default mm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should the input format for this option be? A comma separated String of the overridden values? Something like mm=mx,HH=hx...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fluent builder pattern can have methods that allow us to override the patterns for each of these. So something like this:

WindowedFilenamePolicy policy = WindowedFilenamePolicy.of(<existing_input_params>);
...
// Override minute pattern from default "mm" to "xx"
policy.setMinutePattern("xx");

// Override day pattern from default "DD" to "ZZ"
policy.setDayPattern("ZZ");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does this mean each template that uses the WindowedFilenamePolicy class would have to have a separate input parameter for each Date component? Like:

WindowedFilenamePolicy.newBuilder()
                      .withOutputDirectory(options.getOutputDirectory())
                      .withOutputFilenamePrefix(options.getOutputFilenamePrefix())
                      .withShardTemplate(options.getOutputShardTemplate())
                      .withSuffix(options.getOutputFilenameSuffix())
                      .withYearPattern(options.getYearPattern())
                      .withMonthPattern(options.getMonthPattern()) // and so on
                      .build();


Currently, I've changed the class to use the Builder pattern as follows:
(assuming we're taking all the patterns to be overridden in one input parameter --customDateTimePattern from the user)

WindowedFilenamePolicy.newBuilder()
                      .withOutputDirectory(options.getOutputDirectory())
                      .withOutputFilenamePrefix(options.getOutputFilenamePrefix())
                      .withShardTemplate(options.getOutputShardTemplate())
                      .withSuffix(options.getOutputFilenameSuffix())
                      .withPattern(options.getCustomDateTimePattern())
                      .build();

What should the format for options.getCustomDateTimePattern() be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a default template given that we didnt have one earlier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I got these from the default values set in the Pubsub to Text template. I'll take them out.

Comment on lines 54 to 57
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - Why would this be needed now?

Comment on lines 59 to 53
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A newBuilder() should return a Builder and not the actual annotated class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change this to writeWindowedFiles() since we're not returning a Builder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer we call this (and below) yearPattern() so as to make it less ambiguous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have both a toBuilder() and a newBuilder()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toBuilder() would be called internally after a withX() method is called.

Changing newBuilder() to writeWindowedFiles()

Comment on lines 401 to 396
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be useful to log this at a debug level perhaps? At most we have a log every minute? See how this is done using MoreObjects.firstNotNull

https://github.com/apache/beam/blob/aa345dbeb9f6bdb36f5acf0fb8958777253acd51/sdks/java/io/splunk/src/main/java/org/apache/beam/sdk/io/splunk/SplunkEventWriter.java#L119-L129

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I'll add a few log statements for this

@stale
Copy link

stale bot commented Oct 17, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the obsolete The PR hasn't had activity in 45 days label Oct 17, 2020
@melbrodrigues
Copy link
Contributor Author

Hi @sabhyankar , Would be great to get your feedback on this.

@sabhyankar
Copy link
Member

Hi @melbrodrigues - I will take a look at this.

@stale stale bot closed this Oct 31, 2020
@an2x an2x self-requested a review June 21, 2021 16:11
@an2x an2x reopened this Jun 21, 2021
@stale stale bot removed the obsolete The PR hasn't had activity in 45 days label Jun 21, 2021
@an2x
Copy link
Member

an2x commented Jun 21, 2021

I'm not sure how this change resolves the original issue #155. The problem was that, for example, "mm" in "recommendation" in the directory name was getting replaced with minutes.

Now, with this change, if I want to have the directory as something like gs://bucket/recommendations/<minute> what do I do?
If I set minutePattern to m or mm, it's still going to replace all the letters m in "recommendations".
If I set minutePattern to something like %mm and the directory to gs://bucket/recommendations/%mm, it still won't work properly: the result will be gs://bucket/recommendations/%12 (% won't be replaced).

I suggest implementing a slightly different fix:

  1. Introduce just one new parameter, let's call it directoryDateTimePattern, example value that the user would set: YYYY/MM/DD/HH-mm.
  2. In the directory name, replace ${dateTime} placeholder with the datetime formatted using directoryDateTimePattern. If the placeholder doesn't exist, fallback to the old logic (replace YYYY, MM, DD, etc. separately), for backwards compatibility.
  3. The directory specified by the user would look like gs://bucket/recommendations/${dateTime}.

I think this would be better because this approach is more flexible (what if I don't want minutes in my directory name at all?) and there is no need to set 5 different parameters (and if you do, you may run into many other issues, e.g. what if minutePattern is overridden but hourPattern isn't - is this a misconfiguration?).

@melbrodrigues
Copy link
Contributor Author

melbrodrigues commented Jun 22, 2021

I think the intention was to give users the ability to override default patterns and avoid unintentional changes to their file paths.

For the gs://bucket-name/recommendations/ example, the user need only pass one additional parameter --minutePattern (which can be set to say mmmm) to override the default pattern mm.

And if the user wants to include patterns in the directory name they can still do so with specific overrides. For example, an override of --minutePattern=mmmm would result in the path gs://bucket-name/recommendations/mmmm being transformed as gs://bucket-name/recommendations/<minute>.

Note that the user does not need to pass all parameter overrides and the overrides must be a valid pattern as documented in DateTimeFormatter.

Comment on lines 432 to 433
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LOG.error("Invalid DateTime pattern: {}", e.getMessage());
throw new RuntimeException(e);
throw new RuntimeException("Invalid DateTime pattern", e);

Usually logging and re-throwing the same exception is considered an anti-pattern. Any strong reason to do it here?

Copy link
Contributor Author

@melbrodrigues melbrodrigues Jun 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. I saw it at a few other places in the repo back then. I can take out the log statements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I remove the try-catch block as well? The catch block was just to log the error message and throw a new RuntimeException. If I remove the log statement all we're doing is catching an IllegalArgumentException and throwing a RuntimeException.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep it because IllegalArgumentException doesn't make it clear for the user what pattern is incorrect (e.g. if an empty pattern is supplied it just says "Invalid pattern specification"). Here the RuntimeException at least says that the error was caused by a DateTime pattern.

I'd go even further and include even more details in the error message, e.g. what pattern is incorrect (add a helper method to avoid try-catch around each of the 5 patterns). But it's up to you if you want to make such a change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if it's a good idea to throw a RuntimeException... Since this resolution happens after we read from Pub/Sub, users may lose some data. Maybe we should use the default values if the pattern provided is invalid? WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still throw the Exception. If a user set a pattern override, this means the default pattern doesn't work for them for some reason. If the override was set incorrectly, simply ignoring this and falling back to the fault pattern may cause unwanted side effects (e.g. output written to a wrong folder, potentially overwriting existing data). IMO it's better to throw an exception and not process anything in this case.

As for the data loss, it shouldn't happen. Dataflow guarantees "exactly once" processing semantics (some info about it). In case of PubSub, messages read from a topic are not going to be ACKed if there is a failure in the middle of the pipeline and PubSub is going to re-deliver them on the next processing attempt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect. I'll leave the RuntimeException there then

Comment on lines 122 to 146
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest moving all these options (+ the old output* options like outputShardTemplate) to a separate class (e.g. WindowedFilenamePolicyOptions), similar to how StreamingOptions are extended here, to avoid duplication in PubsubToAvro and PubsubToText templates. I'd put it in com.google.cloud.teleport.options package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Will move these to a new WindowedFilenamePolicyOptions interface

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test to verify that when a custom pattern is set the default pattern is not being replaced.
E.g. if the directory is something like ".../recommendations/mmmm" and the minutePattern is mmmm, then the resulting directory is ".../recommendations/0047" and not ".../reco47endations/0047".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I had added a test for this after our discussion yesterday. Will push it along with the other changes.

@an2x
Copy link
Member

an2x commented Jun 22, 2021

And if the user wants to include patterns in the directory name they can still do so with specific overrides. For example, an override of --minutePattern=mmmm would result in the path gs://bucket-name/recommendations/mmmm being transformed as gs://bucket-name/recommendations/<minute>.

OK, got it, thanks. I think this is good enough. I left a few other comments, should be the last review round :)

@an2x an2x dismissed sabhyankar’s stale review July 9, 2021 16:39

Taking over this review from sabhyankar.

@an2x an2x added the Google LGTM Approval of a pull request to be merged into the repository label Jul 9, 2021
@an2x an2x closed this Jul 13, 2021
@an2x an2x reopened this Jul 13, 2021
@prathapreddy123 prathapreddy123 added Google LGTM Approval of a pull request to be merged into the repository and removed Google LGTM Approval of a pull request to be merged into the repository labels Jul 13, 2021
@@ -0,0 +1,64 @@
package com.google.cloud.teleport.options;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file seems to be the cause of the import/copybara failure preventing merging. You'll need to add a license header similar to the other files. You could copy/paste from another file, though please update the year to 2021.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll add that.

mvn clean install seems to complain about a missing package-info file for the new options package, so I'll add that as well.

@ryanmcdowell ryanmcdowell merged commit 297ad21 into GoogleCloudPlatform:master Jul 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes The PR submitter has a CLA Google LGTM Approval of a pull request to be merged into the repository

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cloud_PubSub_to_GCS_Text replaces mm in path

6 participants