String streams #47

daggaz · 2023-06-14T22:53:17Z

In response to #45

src/json_stream/tokenizer/strings.py

daggaz · 2023-07-02T17:47:18Z

src/json_stream/tokenizer/strings.py

+        if self.readline_buffer:
+            result, self.readline_buffer = self.readline_buffer[:size], self.readline_buffer[size:]
+            return result
+        chunk = self.buffer or self.stream.read(size)


I've noticed a new issue here. When the buffering argument is passed, the string reader still reads DEFAULT_BUFFER_SIZE from the stream. Working on a fix...

smheidrich · 2023-08-22T11:31:24Z

While working on a 2nd attempt at implementing this in the Rust tokenizer (smheidrich/py-json-stream-rs-tokenizer#89), I noticed that my benchmarking test (which uses large randomly generated JSON files) exhibits transient failures for the Python tokenizer from this branch:

pytest error log

self = <TransientStreamingJSONList: TRANSIENT, STREAMING>

    def _iter_items(self):
        while True:
            if not self.streaming:
                return
            self._clear_child()
            try:
                item = self._load_item()
            except StopIteration:
                if self.streaming:
>                   raise ValueError(self.INCOMPLETE_ERROR)
E                   ValueError: Unterminated list at end of file

json-stream/src/json_stream/base.py:53: ValueError
----- Captured stdout call -----
generating random json...
generated random json /tmp/tmpshsa5y6s/random.json with size 1.000e+05 bytes
running with rust tokenizer
rust time: 0.03 s
running with python tokenizer
----- Captured stderr call -----
100%|██████████| 100000.0/100000.0 [00:00<00:00, 1289610.69it/s]
100%|██████████| 100/100 [00:00<00:00, 3080.47it/s]
100%|██████████| 100/100 [00:00<00:00, 1221.22it/s]
===== short test summary info =====
FAILED tests/test_via_benchmark.py::test_via_benchmark - ValueError: Unterminated list at end of file
===== 1 failed in 0.28s =====

@daggaz Could that be related to the bug you mentioned in #45 (comment)?

daggaz · 2023-08-22T12:54:07Z

hmm...I need to get back on this!

smheidrich · 2023-08-25T21:53:30Z

Maybe worth mentioning: While doing benchmarks to check for performance regressions in smheidrich/py-json-stream-rs-tokenizer#91, I noticed that this branch here is only ~3-4 times slower than the Rust tokenizer. Thought I had a regression at first but tested against the other branches and they remained at 10-15 times slower. So I guess doing read(1) instead of proper buffering as introduced in this PR was the major bottleneck this entire time, not the "purely computational" Python instructions like I had thought.

daggaz added 4 commits June 30, 2023 10:34

added buffered reading to tokenizer

524ceaf

added json string stream

61ef0f9

added missing strings_as_files parameter to top level API

cb6ecaf

added readline support to json string streams

7bb00ec

daggaz force-pushed the string-streams branch from 0c895eb to 7bb00ec Compare June 30, 2023 09:36

smheidrich reviewed Jul 1, 2023

View reviewed changes

src/json_stream/tokenizer/strings.py Outdated Show resolved Hide resolved

daggaz added 2 commits July 2, 2023 18:04

fixed signature of _read_chunk() in JsonStringReader

5b1034d

added missing default for tokeinize(strings_as_files)

96ce4ba

daggaz commented Jul 2, 2023

View reviewed changes

This was referenced Aug 21, 2023

WIP: String streaming (approach 2: mostly within main struct) smheidrich/py-json-stream-rs-tokenizer#89

Closed

WIP: String streaming (approach 1: via separate struct) smheidrich/py-json-stream-rs-tokenizer#84

Closed

smheidrich mentioned this pull request Aug 25, 2023

WIP: String streaming (w/ separate JSON string reader cls) smheidrich/py-json-stream-rs-tokenizer#91

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

String streams #47

String streams #47

Uh oh!

daggaz commented Jun 14, 2023 •

edited

Loading

Uh oh!

Uh oh!

daggaz Jul 2, 2023

Uh oh!

smheidrich commented Aug 22, 2023 •

edited

Loading

Uh oh!

daggaz commented Aug 22, 2023

Uh oh!

smheidrich commented Aug 25, 2023

Uh oh!

Uh oh!

String streams #47

Are you sure you want to change the base?

String streams #47

Uh oh!

Conversation

daggaz commented Jun 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

daggaz Jul 2, 2023

Choose a reason for hiding this comment

Uh oh!

smheidrich commented Aug 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daggaz commented Aug 22, 2023

Uh oh!

smheidrich commented Aug 25, 2023

Uh oh!

Uh oh!

daggaz commented Jun 14, 2023 •

edited

Loading

smheidrich commented Aug 22, 2023 •

edited

Loading