Skip to content

Conversation

@runarmod
Copy link
Contributor

@runarmod runarmod commented Aug 7, 2025

Sometimes, the response from Gemini (and I assume the other LLM services) is not in a valid JSON format. This makes the json.loads-call throw an exception, and the service-call returns an empty dict.

This is frustrating, as the resulting markdown will have missing image descriptions, tables that have not been refined, etc. Since the error is ignored and the program moves on, one has to carefully check the logs during conversion, and then rerun the conversion if it happens.

IMHO, the most logical thing to do if the LLM service returns malformed JSON is to try again, in the same way we do if we are rate limited.

Note: this fix only handles the problem for the Gemini service. LMK if you want me to fix this for the other services as well.

Logs where this happens:

[...]
2025-08-07 11:10:27,331 - httpx - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent "HTTP/1.1 200 OK"
2025-08-07 11:10:27,335 - google_genai.models - INFO - AFC remote call 1 is done.
2025-08-07 11:10:27,336 - marker - ERROR - Exception: Unterminated string starting at: line 3 column 21 (char 160)
LLMTableProcessor running: 4it [00:24,  4.96s/it]Traceback (most recent call last):
  File "C:\Users\[...]\.venv\Lib\site-packages\marker\services\gemini.py", line 79, in __call__
    return json.loads(output)
           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\[...]\AppData\Roaming\uv\python\cpython-3.12.11-windows-x86_64-none\Lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\[...]\AppData\Roaming\uv\python\cpython-3.12.11-windows-x86_64-none\Lib\json\decoder.py", line 338, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\[...]\AppData\Roaming\uv\python\cpython-3.12.11-windows-x86_64-none\Lib\json\decoder.py", line 354, in raw_decode
    obj, end = self.scan_once(s, idx)
               ^^^^^^^^^^^^^^^^^^^^^^
json.decoder.JSONDecodeError: Unterminated string starting at: line 3 column 21 (char 160)
2025-08-07 11:10:27,928 - google_genai.models - INFO - AFC is enabled with max remote calls: 10.
2025-08-07 11:10:29,877 - httpx - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent "HTTP/1.1 200 OK"
[...]

@VikParuchuri VikParuchuri changed the base branch from master to dev August 20, 2025 16:22
@VikParuchuri
Copy link
Member

Thanks for the fix! Would definitely take a PR for the other services if you have time

@VikParuchuri VikParuchuri merged commit d96ddad into datalab-to:dev Aug 20, 2025
1 check passed
@github-actions github-actions bot locked and limited conversation to collaborators Aug 20, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants