Skip to content

Conversation

daggaz
Copy link
Owner

@daggaz daggaz commented Jun 14, 2023

In response to #45

if self.readline_buffer:
result, self.readline_buffer = self.readline_buffer[:size], self.readline_buffer[size:]
return result
chunk = self.buffer or self.stream.read(size)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed a new issue here. When the buffering argument is passed, the string reader still reads DEFAULT_BUFFER_SIZE from the stream. Working on a fix...

@smheidrich
Copy link
Contributor

smheidrich commented Aug 22, 2023

While working on a 2nd attempt at implementing this in the Rust tokenizer (smheidrich/py-json-stream-rs-tokenizer#89), I noticed that my benchmarking test (which uses large randomly generated JSON files) exhibits transient failures for the Python tokenizer from this branch:

pytest error log
self = <TransientStreamingJSONList: TRANSIENT, STREAMING>

    def _iter_items(self):
        while True:
            if not self.streaming:
                return
            self._clear_child()
            try:
                item = self._load_item()
            except StopIteration:
                if self.streaming:
>                   raise ValueError(self.INCOMPLETE_ERROR)
E                   ValueError: Unterminated list at end of file

json-stream/src/json_stream/base.py:53: ValueError
----- Captured stdout call -----
generating random json...
generated random json /tmp/tmpshsa5y6s/random.json with size 1.000e+05 bytes
running with rust tokenizer
rust time: 0.03 s
running with python tokenizer
----- Captured stderr call -----
100%|██████████| 100000.0/100000.0 [00:00<00:00, 1289610.69it/s]
100%|██████████| 100/100 [00:00<00:00, 3080.47it/s]
100%|██████████| 100/100 [00:00<00:00, 1221.22it/s]
===== short test summary info =====
FAILED tests/test_via_benchmark.py::test_via_benchmark - ValueError: Unterminated list at end of file
===== 1 failed in 0.28s =====

@daggaz Could that be related to the bug you mentioned in #45 (comment)?

@daggaz
Copy link
Owner Author

daggaz commented Aug 22, 2023

hmm...I need to get back on this!

@smheidrich
Copy link
Contributor

Maybe worth mentioning: While doing benchmarks to check for performance regressions in smheidrich/py-json-stream-rs-tokenizer#91, I noticed that this branch here is only ~3-4 times slower than the Rust tokenizer. Thought I had a regression at first but tested against the other branches and they remained at 10-15 times slower. So I guess doing read(1) instead of proper buffering as introduced in this PR was the major bottleneck this entire time, not the "purely computational" Python instructions like I had thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants