Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example implementation of a GeneratorWrapperDoFn #1

Open
wants to merge 3 commits into
base: python-sdk
Choose a base branch
from

Conversation

elibixby
Copy link
Owner

@elibixby elibixby commented Jan 18, 2017

This PR serves as a reference implementation for a yet to be written design doc allowing the specification of DoFns as generators.

The use of re.compile in wordcount_generator.py highlights the easy start_bundle and finish_bundle capabilities implicit in a generators definition, without requiring bulky class definitions.

To highlight the difference in syntax, run:

git diff sdks/python/apache_beam/examples/wordcount_generator.py sdks/python/apache_beam/examples/wordcount.py

@elibixby elibixby changed the title Example implementation of a GeneratorDo Example implementation of a GeneratorWrapperDoFn Jan 18, 2017
@robertwb
Copy link

Thanks for the suggestion. We've actually considered syntax like this. One concern is that it would be a bit confusing, as typically one yields elements themselves, not sets of elements that get flattened.

An alternative we've toyed with would be to take a callable with an element_iterable rather than an element as input (annotated new-do-fn style https://s.apache.org/a-new-dofn) where yield would behave as normal (yielding elements to the subsequent PCollection) but the initialize/finalize could be built into a single function call.

@elibixby
Copy link
Owner Author

@robertwb How does this yield a set of elements that get flattened? It yields one element at a time (where in this case the element is a list of words, but that's parity with the non-generator example) and a new generator is created for each bundle?

@elibixby
Copy link
Owner Author

RE: the element-iterable approach, I actually think that's preferable. My first implementation of this was more along these lines (write a generator that takes a single "elements" iterable)

def wordcount_generator(element_gen):
   r = re.compile(...)
   yield from r.findall(element) for element in element_gen

The syntax is so much cleaner. However, your team didn't like it because they were worried people would accidentally read in the entire bundle and then process it, which is obviously bad practice. Still to me this reeks of protecting language users from their own languages power features, which Beam is currently ignoring, even though they are eminently suited to unbounded stream processing. @jonparrott for thoughts.

@robertwb
Copy link

ParDo.process(...) returns an iterable of words to be added to the PCollection. The output of the split operation consists of a PCollection of words. (We do provide beam.Map(...) to do a 1:1 mapping.) An iterable of iterables is conceptually harder to deal with.

I'm personally a fan of exposing power features like this, as long as there's a bit of "opt-in"-ness so they're less likely to cut beginners.

@elibixby
Copy link
Owner Author

ParDo.process(...) returns an iterable of words to be added to the PCollection

Right, I think there's still an understanding of processing in bundles that needs to happen between the dev and the framework. I don't think we should be trying to hide this notion as it leads to too many antipatterns (see: the thread we just came from). But a generator is a much cleaner abstraction for processing these bundles than the DoFn class, as it allows direct usage of objects instantiated within what would previously be the start_bundle or finish_bundle closure. Contexts managers, HTTP clients, inter-bundle throttlers and aggregators, file handles, etc etc, all become usable in a much more idiomatic way.

I'm personally a fan of exposing power features like this, as long as there's a bit of "opt-in"-ness so they're less likely to cut beginners.

I probably wouldn't have abandoned this if everyone on the email thread responded as enthusiastically as you. Hard to work up a full proposal when it feels like it's going to get shot down ;-)

@robertwb
Copy link

robertwb commented Mar 25, 2017 via email

@elibixby
Copy link
Owner Author

elibixby commented Mar 25, 2017

@robertwb implement which? I'm working up a little implementation right now...

EDIT: implemented in #2

@robertwb
Copy link

robertwb commented Mar 25, 2017 via email

@elibixby
Copy link
Owner Author

elibixby commented Nov 1, 2017

@holdenk Who was interested in this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants