Skip to content

🐛 Refactor data loaders to be lazy and use generators to prevent memory problems #103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

garlontas
Copy link
Member

@garlontas garlontas commented Jun 6, 2025

Summary by Sourcery

Refactor all data loaders (CSV, JSON, XML, YAML) to use lazy generators that yield items on demand instead of loading entire datasets into memory and update tests to exercise the new streaming behavior.

New Features:

  • Support reading CSV, JSON, XML, and YAML data directly from source strings via a read_from_src flag
  • Add custom context manager in tests to mock file operations uniformly across loaders

Enhancements:

  • Replace LazyFileIterable with native Python generators for all loaders
  • Unify CSV loader code into separate functions for file and string sources and a shared processing routine
  • Implement lazy JSON, XML, and YAML parsing functions that yield namedtuples progressively
  • Streamline XML parsing to yield elements or nested lists based on retrieve_children configuration

Tests:

  • Update loader tests to consume data via iterators and assert StopIteration at end
  • Add tests verifying loader laziness and custom delimiters for CSV and generator type for YAML
  • Refactor tests to use a mock_csv_file context manager for consistent mocking of file operations

@garlontas garlontas requested a review from Copilot June 6, 2025 17:55
@garlontas garlontas self-assigned this Jun 6, 2025
Copy link

sourcery-ai bot commented Jun 6, 2025

Reviewer's Guide

This PR refactors the CSV, JSON, XML, and YAML data loaders to use plain Python generators for lazy loading instead of eagerly building lists or relying on a custom LazyFileIterable, standardizes loader function signatures to accept either file paths or raw strings, and updates the corresponding tests to consume these iterators via next() and StopIteration assertions.

Sequence Diagram for Lazy Data Loading with Generators

sequenceDiagram
    actor Client
    participant DataLoader as "Loader Module (e.g., csv_loader.csv())"
    participant InternalProcessor as "Internal Generator Function (e.g., __process_csv)"
    participant DataSource as "File/String Source"

    Client->>DataLoader: load_data(src, ...)
    DataLoader->>InternalProcessor: (initiates lazy processing of src)
    Note right of DataLoader: Returns an iterator immediately
    DataLoader-->>Client: data_iterator

    loop Client requests next item
        Client->>data_iterator: next()
        data_iterator->>InternalProcessor: (requests next item)
        activate InternalProcessor
        InternalProcessor->>DataSource: Read minimal data needed (e.g., a line)
        DataSource-->>InternalProcessor: raw_item_data
        InternalProcessor-->>InternalProcessor: Parse data, create object (e.g., namedtuple)
        InternalProcessor-->>data_iterator: processed_item
        deactivate InternalProcessor
        data_iterator-->>Client: processed_item
    end
    Client->>data_iterator: next() # After all items are processed
    data_iterator-->>Client: StopIteration
Loading

File-Level Changes

Change Details Files
Refactored CSV loader to generator-based lazy loading
  • Changed csv() signature to accept src/read_from_src instead of file_path only
  • Split loading into __load_csv_from_file and __load_csv_from_string
  • Extracted row processing into a new __process_csv generator
  • Removed LazyFileIterable usage in favor of yield-based iteration
pystreamapi/loaders/__csv/__csv_loader.py
tests/_loaders/test_csv_loader.py
Refactored JSON loader to generator-based lazy loading
  • Removed LazyFileIterable and return iterator directly
  • Added __lazy_load_json_file and __lazy_load_json_string generator functions
  • Yield parsed objects via json.loads with object_hook
pystreamapi/loaders/__json/__json_loader.py
tests/_loaders/test_json_loader.py
Refactored XML loader to generator-based lazy parsing
  • Replaced LazyFileIterable with _lazy_parse_xml_file and _lazy_parse_xml_string generators
  • Yield parsed elements lazily using _parse_xml_string_lazy
  • Flatten nested children via yield from instead of building lists
pystreamapi/loaders/__xml/__xml_loader.py
tests/_loaders/test_xml_loader.py
Refactored YAML loader to generator-based lazy loading
  • Removed LazyFileIterable and return iterator directly
  • Yield documents via yaml.safe_load_all and convert to namedtuples
  • Support multi-document output with yield from __convert_to_namedtuples
pystreamapi/loaders/__yaml/__yaml_loader.py
tests/_loaders/test_yaml_loader.py
Updated tests for lazy iteration and consolidated mocking
  • Introduced a mock_csv_file contextmanager to DRY file mocking
  • Replaced length and list-based assertions with next()/StopIteration checks
  • Added explicit tests for iterator laziness (GeneratorType)
  • Removed redundant test cases and unified test patterns
tests/_loaders/test_csv_loader.py
tests/_loaders/test_json_loader.py
tests/_loaders/test_xml_loader.py
tests/_loaders/test_yaml_loader.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@garlontas garlontas linked an issue Jun 6, 2025 that may be closed by this pull request
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the data loaders to implement lazy loading using generators in order to alleviate memory issues. Key changes include:

  • Updating YAML, XML, JSON, and CSV loaders to return generator iterators instead of materialized lists.
  • Refactoring corresponding test cases to validate lazy loading behavior using generator assertions.
  • Adjusting file-specific helper functions and context managers to support the new lazy loading pattern.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/_loaders/test_yaml_loader.py Tests now use generator assertions and verify lazy evaluation.
tests/_loaders/test_xml_loader.py Updated to use a context manager and generator-based XML parsing.
tests/_loaders/test_json_loader.py Refactored to expect StopIteration for empty inputs in line with lazy loading.
tests/_loaders/test_csv_loader.py Updated tests to assert lazy iteration and custom delimiter support.
pystreamapi/loaders/__yaml/__yaml_loader.py Changed return types from LazyFileIterable to Iterator and used 'yield from'.
pystreamapi/loaders/__xml/__xml_loader.py Modified to return generators while maintaining XML parsing behavior.
pystreamapi/loaders/__json/__json_loader.py Refactored to use lazy parsing with generator functions.
pystreamapi/loaders/__csv/__csv_loader.py Updated CSV loader to process rows lazily using generators.
Comments suppressed due to low confidence (2)

tests/_loaders/test_xml_loader.py:35

  • [nitpick] The context manager name 'mock_csv_file' may be misleading in the context of XML tests. Consider renaming it to a more generic name like 'mock_file' to improve clarity.
@contextmanager
def mock_csv_file(self, content=None, exists=True, is_file=True):

tests/_loaders/test_json_loader.py:62

  • [nitpick] Changing the expected behavior for an empty JSON string from raising JSONDecodeError to StopIteration is a significant API change. Ensure this behavior is clearly documented for consumers of the JSON loader.
self.assertRaises(StopIteration, next, json('', read_from_src=True))

Copy link

sonarqubecloud bot commented Jun 6, 2025

Quality Gate Failed Quality Gate failed

Failed conditions
17.1% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @garlontas - I've reviewed your changes - here's some feedback:

  • The XML loader tests reference self.file_content and self.mock_csv_file but lack a setUp to define these, causing undefined attributes in TestXmlLoader.
  • Consider consolidating the file-mocking contextmanager used in CSV and XML tests into a shared utility to reduce duplication and avoid misnamed methods like mock_csv_file in XML tests.
  • The JSON loader’s generator uses yield-from on json.loads output, which will incorrectly iterate over the keys when the top-level JSON is a dict—wrap non-list results so they yield a single namedtuple.
Here's what I looked at during the review
  • 🟡 General issues: 6 issues found
  • 🟡 Testing: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

src = jsonfile.read()
if src == '':
return
yield from jsonlib.loads(src, object_hook=__dict_to_namedtuple)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Incorrect iteration over JSON file data for non-array top-level objects

Instead of using yield from directly, assign the loaded JSON to a variable and yield from it if it's a list, or yield it directly if not. This prevents iterating over the fields of a namedtuple when the top-level object is not a list.

def generator():
if not json_string.strip():
return
yield from jsonlib.loads(json_string, object_hook=__dict_to_namedtuple)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Incorrect iteration over JSON string data for non-array top-level objects

Handle lists and single objects separately instead of always using yield from, to avoid incorrect unpacking when the top-level object is not an array.

Comment on lines +33 to +34
if src == '':
return
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Whitespace-only JSON files not skipped

src == '' does not handle files with only whitespace. Use if not src.strip(): to skip such files and prevent JSON decode errors.

Suggested change
if src == '':
return
if not src.strip():
return

Comment on lines 41 to 42
config.cast_types = cast_types
config.retrieve_children = retrieve_children
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Module-level config mutation causes global side-effects

Passing these options as function arguments or using a per-call config object would prevent unintended side effects from shared state.

Comment on lines +51 to +55
def _lazy_parse_xml_file(file_path: str, encoding: str) -> Iterator[Any]:
def generator():
with open(file_path, mode='r', encoding=encoding) as xmlfile:
xml_string = xmlfile.read()
yield from _parse_xml_string_lazy(xml_string)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): Reads entire XML file into memory, may lead to high memory usage

Consider using a streaming parser like ElementTree.iterparse to handle large files more efficiently.

Suggested change
def _lazy_parse_xml_file(file_path: str, encoding: str) -> Iterator[Any]:
def generator():
with open(file_path, mode='r', encoding=encoding) as xmlfile:
xml_string = xmlfile.read()
yield from _parse_xml_string_lazy(xml_string)
def _lazy_parse_xml_file(file_path: str, encoding: str) -> Iterator[Any]:
# Use ElementTree.iterparse for efficient streaming parsing
# Note: ElementTree does not support encoding argument in iterparse, so we open the file accordingly
import io
with open(file_path, mode='r', encoding=encoding) as xmlfile:
# iterparse expects a file-like object opened in text mode for XML
context = ElementTree.iterparse(xmlfile, events=("end",))
for event, elem in context:
yield elem
# Optionally clear the element to free memory
elem.clear()

Comment on lines +36 to +45
def mock_csv_file(self, content=None, exists=True, is_file=True):
"""Context manager for mocking CSV file operations.

Args:
content: The content of the mocked file
exists: Whether the file exists
is_file: Whether the path points to a file
"""
content = content if content is not None else self.file_content
with (patch(OPEN, mock_open(read_data=content)),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: XML test helper mock_csv_file has naming and potential runtime issues.

Rename the method to mock_xml_file and update the docstring to reference XML files. Also, ensure self.file_content is defined in setUp() or require the content argument to avoid potential AttributeError.

@@ -54,11 +55,20 @@ def test_yaml_loader_from_string(self):
def test_yaml_loader_from_empty_string(self):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Consider testing malformed YAML content for robust error handling.

Add a test with invalid YAML to ensure the loader raises or handles parsing errors appropriately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loading big CSV files not safe
1 participant