-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example implementation of a GeneratorWrapperDoFn #1
base: python-sdk
Are you sure you want to change the base?
Conversation
Thanks for the suggestion. We've actually considered syntax like this. One concern is that it would be a bit confusing, as typically one yields elements themselves, not sets of elements that get flattened. An alternative we've toyed with would be to take a callable with an element_iterable rather than an element as input (annotated new-do-fn style https://s.apache.org/a-new-dofn) where yield would behave as normal (yielding elements to the subsequent PCollection) but the initialize/finalize could be built into a single function call. |
@robertwb How does this yield a set of elements that get flattened? It yields one element at a time (where in this case the element is a list of words, but that's parity with the non-generator example) and a new generator is created for each bundle? |
RE: the element-iterable approach, I actually think that's preferable. My first implementation of this was more along these lines (write a generator that takes a single "elements" iterable)
The syntax is so much cleaner. However, your team didn't like it because they were worried people would accidentally read in the entire bundle and then process it, which is obviously bad practice. Still to me this reeks of protecting language users from their own languages power features, which Beam is currently ignoring, even though they are eminently suited to unbounded stream processing. @jonparrott for thoughts. |
I'm personally a fan of exposing power features like this, as long as there's a bit of "opt-in"-ness so they're less likely to cut beginners. |
Right, I think there's still an understanding of processing in bundles that needs to happen between the dev and the framework. I don't think we should be trying to hide this notion as it leads to too many antipatterns (see: the thread we just came from). But a generator is a much cleaner abstraction for processing these bundles than the
I probably wouldn't have abandoned this if everyone on the email thread responded as enthusiastically as you. Hard to work up a full proposal when it feels like it's going to get shot down ;-) |
Unfortunately I'm somewhat at a loss as to how to implement this without
turning the whole stack from push based to pull based.
…On Mar 24, 2017 5:55 PM, "Eli Bixby" ***@***.***> wrote:
ParDo.process(...) returns an iterable of words to be added to the
PCollection
Right, I think there's still an understanding of processing in bundles
that needs to happen between the dev and the framework. I don't think we
should be trying to hide this notion as it leads to too many antipatterns
(see: the thread we just came from). But a generator is a much cleaner
abstraction for processing these bundles than the DoFn class, as it
allows direct usage of objects instantiated within what would previously be
the start_bundle or finish_bundle closure. Contexts managers, HTTP
clients, inter-bundle throttlers and aggregators, file handles, etc etc,
all become usable in a much more idiomatic way.
I'm personally a fan of exposing power features like this, as long as
there's a bit of "opt-in"-ness so they're less likely to cut beginners.
I probably wouldn't have abandoned this if everyone on the email thread
responded as enthusiastically as you. Hard to work up a full proposal when
it feels like it's going to get shot down ;-)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAdqgV_G_9mO0NNTL8AQfaROI-yj8bioks5rpGXogaJpZM4LniYu>
.
|
I ment implement
def process(elements_iter):
...
for element in elements_iter:
[yield individual elements any number of times, including 0]
…On Fri, Mar 24, 2017 at 6:26 PM, Eli Bixby ***@***.***> wrote:
@robertwb <https://github.com/robertwb> implement which? I'm working up a
little implementation right now...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAdqgbHUmma0yb9NnEL0wJewMeymoIFGks5rpG1jgaJpZM4LniYu>
.
|
@holdenk Who was interested in this |
This PR serves as a reference implementation for a yet to be written design doc allowing the specification of DoFns as generators.
The use of
re.compile
inwordcount_generator.py
highlights the easystart_bundle
andfinish_bundle
capabilities implicit in a generators definition, without requiring bulky class definitions.To highlight the difference in syntax, run: