|
| 1 | +--- |
| 2 | +title: "Efficiently Processing Large JSON Files in Python Without Loading Everything Into Memory" |
| 3 | +date: 2025-06-11T12:00:00+01:00 |
| 4 | +draft: false |
| 5 | +language: en |
| 6 | +summary: Learn how to process large JSON datasets efficiently in Python using streaming and minimal memory, with practical code and profiling tips. |
| 7 | +description: A practical guide to efficiently processing large JSON files in Python without loading the entire file into memory. Covers streaming with ijson, memory profiling, and best practices for handling big data. |
| 8 | +author: raydak |
| 9 | +tags: [ |
| 10 | + "Python", |
| 11 | + "JSON", |
| 12 | + "Big Data", |
| 13 | + "Streaming", |
| 14 | + "ijson", |
| 15 | + "memory-profiler" |
| 16 | +] |
| 17 | +categories: [ |
| 18 | + "Development", |
| 19 | + "Data Processing" |
| 20 | +] |
| 21 | +--- |
| 22 | + |
| 23 | +## Introduction |
| 24 | + |
| 25 | +Processing large JSON files can quickly exhaust your system's memory if you try to load the entire file at once. This is a common challenge in data engineering, ETL, and analytics workflows. Fortunately, Python offers tools to process such files efficiently by streaming the data and only keeping what's necessary in memory. |
| 26 | + |
| 27 | +This post demonstrates how to use [`ijson`](https://pypi.org/project/ijson/) for streaming JSON parsing and [`memory-profiler`](https://pypi.org/project/memory-profiler/) to monitor memory usage. We'll also show how to set up your environment with [`uv`](https://github.com/astral-sh/uv) for reproducible installs. |
| 28 | + |
| 29 | +## Why Not Just Use `json.load()`? |
| 30 | + |
| 31 | +The standard `json` module's `json.load()` reads the entire file into memory. For files larger than your available RAM, this leads to crashes or severe slowdowns. Instead, streaming parsers like `ijson` process the file incrementally. |
| 32 | + |
| 33 | +## Setting Up Your Environment |
| 34 | + |
| 35 | +First, initialize your Python project with `uv` and add the required dependencies: |
| 36 | + |
| 37 | +```sh |
| 38 | +uv init |
| 39 | +uv pip install ijson memory-profiler |
| 40 | +``` |
| 41 | + |
| 42 | +## Streaming JSON Processing Example |
| 43 | + |
| 44 | +Suppose you have a large JSON array of objects and want to filter items matching a specific field (e.g., `"vpn": "ABC"`), writing only those to an output file. Here's how you can do it efficiently: |
| 45 | + |
| 46 | +```python |
| 47 | +from memory_profiler import profile |
| 48 | +import json |
| 49 | +import ijson |
| 50 | +import time |
| 51 | + |
| 52 | +backend = ijson # You can also use ijson.get_backend("yajl2_c") for speed |
| 53 | + |
| 54 | +objects_num = 0 |
| 55 | + |
| 56 | +# References: |
| 57 | +# https://pythonspeed.com/articles/json-memory-streaming/ |
| 58 | +# https://www.dataquest.io/blog/python-json-tutorial/ |
| 59 | +# https://pytutorial.com/python-json-streaming-handle-large-datasets-efficiently/ |
| 60 | +# https://github.com/kashifrazzaqui/json-streamer |
| 61 | + |
| 62 | +@profile |
| 63 | +def filter_large_json(input_file, output_file, target_vpn): |
| 64 | + global objects_num |
| 65 | + with open(input_file, "rb") as infile, open(output_file, "w") as outfile: |
| 66 | + outfile.write("[") |
| 67 | + first = True |
| 68 | + for obj in backend.items(infile, "item"): |
| 69 | + if obj.get("vpn") == target_vpn: |
| 70 | + if not first: |
| 71 | + outfile.write(",") |
| 72 | + json.dump(obj, outfile) |
| 73 | + first = False |
| 74 | + objects_num += 1 |
| 75 | + outfile.write("]") |
| 76 | + print(f"Filtered {objects_num} objects with vpn '{target_vpn}'.") |
| 77 | + |
| 78 | +start_time = time.time() |
| 79 | +filter_large_json("test.json", "output.json", "ABC") |
| 80 | +print("--- %s seconds ---" % (time.time() - start_time)) |
| 81 | +``` |
| 82 | + |
| 83 | +## Profiling Memory Usage |
| 84 | + |
| 85 | +The `@profile` decorator from `memory-profiler` will show you the memory usage line by line when you run your script with: |
| 86 | + |
| 87 | +```sh |
| 88 | +mprof run your_script.py |
| 89 | +mprof plot |
| 90 | +``` |
| 91 | + |
| 92 | +## Conclusion |
| 93 | + |
| 94 | +By streaming your JSON processing, you can handle files of virtually any size, limited only by disk space, not RAM. This approach is essential for scalable data pipelines and analytics. |
| 95 | + |
| 96 | +## References |
| 97 | + |
| 98 | +- [Streaming large JSON files in Python](https://pythonspeed.com/articles/json-memory-streaming/) |
| 99 | +- [Python JSON streaming: handle large datasets efficiently](https://pytutorial.com/python-json-streaming-handle-large-datasets-efficiently/) |
| 100 | +- [Dataquest: Working with JSON in Python](https://www.dataquest.io/blog/python-json-tutorial/) |
| 101 | +- [json-streamer GitHub](https://github.com/kashifrazzaqui/json-streamer) |
0 commit comments