-
Notifications
You must be signed in to change notification settings - Fork 5
🐛 Refactor data loaders to be lazy and use generators to prevent memory problems #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideThis PR refactors the CSV, JSON, XML, and YAML data loaders to use plain Python generators for lazy loading instead of eagerly building lists or relying on a custom LazyFileIterable, standardizes loader function signatures to accept either file paths or raw strings, and updates the corresponding tests to consume these iterators via next() and StopIteration assertions. Sequence Diagram for Lazy Data Loading with GeneratorssequenceDiagram
actor Client
participant DataLoader as "Loader Module (e.g., csv_loader.csv())"
participant InternalProcessor as "Internal Generator Function (e.g., __process_csv)"
participant DataSource as "File/String Source"
Client->>DataLoader: load_data(src, ...)
DataLoader->>InternalProcessor: (initiates lazy processing of src)
Note right of DataLoader: Returns an iterator immediately
DataLoader-->>Client: data_iterator
loop Client requests next item
Client->>data_iterator: next()
data_iterator->>InternalProcessor: (requests next item)
activate InternalProcessor
InternalProcessor->>DataSource: Read minimal data needed (e.g., a line)
DataSource-->>InternalProcessor: raw_item_data
InternalProcessor-->>InternalProcessor: Parse data, create object (e.g., namedtuple)
InternalProcessor-->>data_iterator: processed_item
deactivate InternalProcessor
data_iterator-->>Client: processed_item
end
Client->>data_iterator: next() # After all items are processed
data_iterator-->>Client: StopIteration
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the data loaders to implement lazy loading using generators in order to alleviate memory issues. Key changes include:
- Updating YAML, XML, JSON, and CSV loaders to return generator iterators instead of materialized lists.
- Refactoring corresponding test cases to validate lazy loading behavior using generator assertions.
- Adjusting file-specific helper functions and context managers to support the new lazy loading pattern.
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
tests/_loaders/test_yaml_loader.py | Tests now use generator assertions and verify lazy evaluation. |
tests/_loaders/test_xml_loader.py | Updated to use a context manager and generator-based XML parsing. |
tests/_loaders/test_json_loader.py | Refactored to expect StopIteration for empty inputs in line with lazy loading. |
tests/_loaders/test_csv_loader.py | Updated tests to assert lazy iteration and custom delimiter support. |
pystreamapi/loaders/__yaml/__yaml_loader.py | Changed return types from LazyFileIterable to Iterator and used 'yield from'. |
pystreamapi/loaders/__xml/__xml_loader.py | Modified to return generators while maintaining XML parsing behavior. |
pystreamapi/loaders/__json/__json_loader.py | Refactored to use lazy parsing with generator functions. |
pystreamapi/loaders/__csv/__csv_loader.py | Updated CSV loader to process rows lazily using generators. |
Comments suppressed due to low confidence (2)
tests/_loaders/test_xml_loader.py:35
- [nitpick] The context manager name 'mock_csv_file' may be misleading in the context of XML tests. Consider renaming it to a more generic name like 'mock_file' to improve clarity.
@contextmanager
def mock_csv_file(self, content=None, exists=True, is_file=True):
tests/_loaders/test_json_loader.py:62
- [nitpick] Changing the expected behavior for an empty JSON string from raising JSONDecodeError to StopIteration is a significant API change. Ensure this behavior is clearly documented for consumers of the JSON loader.
self.assertRaises(StopIteration, next, json('', read_from_src=True))
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @garlontas - I've reviewed your changes - here's some feedback:
- The XML loader tests reference self.file_content and self.mock_csv_file but lack a setUp to define these, causing undefined attributes in TestXmlLoader.
- Consider consolidating the file-mocking contextmanager used in CSV and XML tests into a shared utility to reduce duplication and avoid misnamed methods like mock_csv_file in XML tests.
- The JSON loader’s generator uses yield-from on json.loads output, which will incorrectly iterate over the keys when the top-level JSON is a dict—wrap non-list results so they yield a single namedtuple.
Here's what I looked at during the review
- 🟡 General issues: 6 issues found
- 🟡 Testing: 1 issue found
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
src = jsonfile.read() | ||
if src == '': | ||
return | ||
yield from jsonlib.loads(src, object_hook=__dict_to_namedtuple) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Incorrect iteration over JSON file data for non-array top-level objects
Instead of using yield from
directly, assign the loaded JSON to a variable and yield from it if it's a list, or yield it directly if not. This prevents iterating over the fields of a namedtuple when the top-level object is not a list.
def generator(): | ||
if not json_string.strip(): | ||
return | ||
yield from jsonlib.loads(json_string, object_hook=__dict_to_namedtuple) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Incorrect iteration over JSON string data for non-array top-level objects
Handle lists and single objects separately instead of always using yield from
, to avoid incorrect unpacking when the top-level object is not an array.
if src == '': | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): Whitespace-only JSON files not skipped
src == ''
does not handle files with only whitespace. Use if not src.strip():
to skip such files and prevent JSON decode errors.
if src == '': | |
return | |
if not src.strip(): | |
return |
config.cast_types = cast_types | ||
config.retrieve_children = retrieve_children |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Module-level config mutation causes global side-effects
Passing these options as function arguments or using a per-call config object would prevent unintended side effects from shared state.
def _lazy_parse_xml_file(file_path: str, encoding: str) -> Iterator[Any]: | ||
def generator(): | ||
with open(file_path, mode='r', encoding=encoding) as xmlfile: | ||
xml_string = xmlfile.read() | ||
yield from _parse_xml_string_lazy(xml_string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (performance): Reads entire XML file into memory, may lead to high memory usage
Consider using a streaming parser like ElementTree.iterparse
to handle large files more efficiently.
def _lazy_parse_xml_file(file_path: str, encoding: str) -> Iterator[Any]: | |
def generator(): | |
with open(file_path, mode='r', encoding=encoding) as xmlfile: | |
xml_string = xmlfile.read() | |
yield from _parse_xml_string_lazy(xml_string) | |
def _lazy_parse_xml_file(file_path: str, encoding: str) -> Iterator[Any]: | |
# Use ElementTree.iterparse for efficient streaming parsing | |
# Note: ElementTree does not support encoding argument in iterparse, so we open the file accordingly | |
import io | |
with open(file_path, mode='r', encoding=encoding) as xmlfile: | |
# iterparse expects a file-like object opened in text mode for XML | |
context = ElementTree.iterparse(xmlfile, events=("end",)) | |
for event, elem in context: | |
yield elem | |
# Optionally clear the element to free memory | |
elem.clear() |
def mock_csv_file(self, content=None, exists=True, is_file=True): | ||
"""Context manager for mocking CSV file operations. | ||
|
||
Args: | ||
content: The content of the mocked file | ||
exists: Whether the file exists | ||
is_file: Whether the path points to a file | ||
""" | ||
content = content if content is not None else self.file_content | ||
with (patch(OPEN, mock_open(read_data=content)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: XML test helper mock_csv_file
has naming and potential runtime issues.
Rename the method to mock_xml_file
and update the docstring to reference XML files. Also, ensure self.file_content
is defined in setUp()
or require the content
argument to avoid potential AttributeError
.
@@ -54,11 +55,20 @@ def test_yaml_loader_from_string(self): | |||
def test_yaml_loader_from_empty_string(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): Consider testing malformed YAML content for robust error handling.
Add a test with invalid YAML to ensure the loader raises or handles parsing errors appropriately.
Summary by Sourcery
Refactor all data loaders (CSV, JSON, XML, YAML) to use lazy generators that yield items on demand instead of loading entire datasets into memory and update tests to exercise the new streaming behavior.
New Features:
Enhancements:
Tests: