-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Simple pipe reader for hdfs or other service #5282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,13 +14,16 @@ | |
|
||
__all__ = [ | ||
'map_readers', 'buffered', 'compose', 'chain', 'shuffle', | ||
'ComposeNotAligned', 'firstn', 'xmap_readers' | ||
'ComposeNotAligned', 'firstn', 'xmap_readers', 'pipe_reader' | ||
] | ||
|
||
from threading import Thread | ||
import subprocess | ||
|
||
from Queue import Queue | ||
import itertools | ||
import random | ||
from Queue import Queue | ||
from threading import Thread | ||
import zlib | ||
|
||
|
||
def map_readers(func, *readers): | ||
|
@@ -323,3 +326,101 @@ def xreader(): | |
yield sample | ||
|
||
return xreader | ||
|
||
|
||
def _buf2lines(buf, line_break="\n"): | ||
# FIXME: line_break should be automatically configured. | ||
lines = buf.split(line_break) | ||
return lines[:-1], lines[-1] | ||
|
||
|
||
def pipe_reader(left_cmd, | ||
parser, | ||
bufsize=8192, | ||
file_type="plain", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we just need to support "plain", the user can decompress it outside of Paddle using pipe. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thought it may be inconvenient for users to decompress stream data in their parsers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I meant the user can decompress the data using shell commands, not in the parsers, e.g.: hadoop fs -cat /path/to/some/file | gzip -d There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, this is simpler, but I'm considering the pipe size using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand bash very well, but does the pipe just "block" if it's full, and probably gzip can decode in a stream fashion, and will consume the pipe buffer, so it will be unblocked. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By default, pipes can block both producer and consumer:
Well, my point is, use pipes in python code, can let users to define pipe buffer size which is critical to the reader performance. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see, ok. Thanks! |
||
cut_lines=True, | ||
line_break="\n"): | ||
""" | ||
pipe_reader read data by stream from a command, take it's | ||
stdout into a pipe buffer and redirect it to the parser to | ||
parse, then yield data as your desired format. | ||
|
||
You can using standard linux command or call another program | ||
to read data, from HDFS, Ceph, URL, AWS S3 etc: | ||
|
||
cmd = "hadoop fs -cat /path/to/some/file" | ||
cmd = "cat sample_file.tar.gz" | ||
cmd = "curl http://someurl" | ||
cmd = "python print_s3_bucket.py" | ||
|
||
A sample parser: | ||
|
||
def sample_parser(lines): | ||
# parse each line as one sample data, | ||
# return a list of samples as batches. | ||
ret = [] | ||
for l in lines: | ||
ret.append(l.split(" ")[1:5]) | ||
return ret | ||
|
||
:param left_cmd: command to excute to get stdout from. | ||
:type left_cmd: string | ||
:param parser: parser function to parse lines of data. | ||
if cut_lines is True, parser will receive list | ||
of lines. | ||
if cut_lines is False, parser will receive a | ||
raw buffer each time. | ||
parser should return a list of parsed values. | ||
:type parser: callable | ||
:param bufsize: the buffer size used for the stdout pipe. | ||
:type bufsize: int | ||
:param file_type: can be plain/gzip, stream buffer data type. | ||
:type file_type: string | ||
:param cut_lines: whether to pass lines instead of raw buffer | ||
to the parser | ||
:type cut_lines: bool | ||
:param line_break: line break of the file, like \n or \r | ||
:type line_break: string | ||
|
||
:return: the reader generator. | ||
:rtype: callable | ||
""" | ||
if not isinstance(left_cmd, str): | ||
raise TypeError("left_cmd must be a string") | ||
if not callable(parser): | ||
raise TypeError("parser must be a callable object") | ||
|
||
process = subprocess.Popen( | ||
left_cmd.split(" "), bufsize=bufsize, stdout=subprocess.PIPE) | ||
# TODO(typhoonzero): add a thread to read stderr | ||
|
||
# Always init a decompress object is better than | ||
# create in the loop. | ||
dec = zlib.decompressobj( | ||
32 + zlib.MAX_WBITS) # offset 32 to skip the header | ||
|
||
def reader(): | ||
remained = "" | ||
while True: | ||
buff = process.stdout.read(bufsize) | ||
if buff: | ||
if file_type == "gzip": | ||
decomp_buff = dec.decompress(buff) | ||
elif file_type == "plain": | ||
decomp_buff = buff | ||
else: | ||
raise TypeError("file_type %s is not allowed" % file_type) | ||
|
||
if cut_lines: | ||
lines, remained = _buf2lines(''.join( | ||
[remained, decomp_buff]), line_break) | ||
parsed_list = parser(lines) | ||
for ret in parsed_list: | ||
yield ret | ||
else: | ||
for ret in parser(decomp_buff): | ||
yield ret | ||
else: | ||
break | ||
|
||
return reader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line break won't work in binary data, maybe we should let parser decide when to output a new data item?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If
cut_lines=False
the binary data will send to parser directly. Do you mean by should let user's parser generate data, and makepipe_reader
a decorator?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I thought maybe pipe_reader should not cut the lines, since it does not have sufficient information, we might want leave it to the user's parser to do so (cut and generate data).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, will update.