Skip to content

Commit 5adf3f9

Browse files
authored
Merge pull request #129 from raydak-labs/chore/docs/python-large
chore: add blog entry python large json
2 parents 7af3f56 + a4f2162 commit 5adf3f9

File tree

1 file changed

+101
-0
lines changed

1 file changed

+101
-0
lines changed
Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
---
2+
title: "Efficiently Processing Large JSON Files in Python Without Loading Everything Into Memory"
3+
date: 2025-06-11T12:00:00+01:00
4+
draft: false
5+
language: en
6+
summary: Learn how to process large JSON datasets efficiently in Python using streaming and minimal memory, with practical code and profiling tips.
7+
description: A practical guide to efficiently processing large JSON files in Python without loading the entire file into memory. Covers streaming with ijson, memory profiling, and best practices for handling big data.
8+
author: raydak
9+
tags: [
10+
"Python",
11+
"JSON",
12+
"Big Data",
13+
"Streaming",
14+
"ijson",
15+
"memory-profiler"
16+
]
17+
categories: [
18+
"Development",
19+
"Data Processing"
20+
]
21+
---
22+
23+
## Introduction
24+
25+
Processing large JSON files can quickly exhaust your system's memory if you try to load the entire file at once. This is a common challenge in data engineering, ETL, and analytics workflows. Fortunately, Python offers tools to process such files efficiently by streaming the data and only keeping what's necessary in memory.
26+
27+
This post demonstrates how to use [`ijson`](https://pypi.org/project/ijson/) for streaming JSON parsing and [`memory-profiler`](https://pypi.org/project/memory-profiler/) to monitor memory usage. We'll also show how to set up your environment with [`uv`](https://github.com/astral-sh/uv) for reproducible installs.
28+
29+
## Why Not Just Use `json.load()`?
30+
31+
The standard `json` module's `json.load()` reads the entire file into memory. For files larger than your available RAM, this leads to crashes or severe slowdowns. Instead, streaming parsers like `ijson` process the file incrementally.
32+
33+
## Setting Up Your Environment
34+
35+
First, initialize your Python project with `uv` and add the required dependencies:
36+
37+
```sh
38+
uv init
39+
uv pip install ijson memory-profiler
40+
```
41+
42+
## Streaming JSON Processing Example
43+
44+
Suppose you have a large JSON array of objects and want to filter items matching a specific field (e.g., `"vpn": "ABC"`), writing only those to an output file. Here's how you can do it efficiently:
45+
46+
```python
47+
from memory_profiler import profile
48+
import json
49+
import ijson
50+
import time
51+
52+
backend = ijson # You can also use ijson.get_backend("yajl2_c") for speed
53+
54+
objects_num = 0
55+
56+
# References:
57+
# https://pythonspeed.com/articles/json-memory-streaming/
58+
# https://www.dataquest.io/blog/python-json-tutorial/
59+
# https://pytutorial.com/python-json-streaming-handle-large-datasets-efficiently/
60+
# https://github.com/kashifrazzaqui/json-streamer
61+
62+
@profile
63+
def filter_large_json(input_file, output_file, target_vpn):
64+
global objects_num
65+
with open(input_file, "rb") as infile, open(output_file, "w") as outfile:
66+
outfile.write("[")
67+
first = True
68+
for obj in backend.items(infile, "item"):
69+
if obj.get("vpn") == target_vpn:
70+
if not first:
71+
outfile.write(",")
72+
json.dump(obj, outfile)
73+
first = False
74+
objects_num += 1
75+
outfile.write("]")
76+
print(f"Filtered {objects_num} objects with vpn '{target_vpn}'.")
77+
78+
start_time = time.time()
79+
filter_large_json("test.json", "output.json", "ABC")
80+
print("--- %s seconds ---" % (time.time() - start_time))
81+
```
82+
83+
## Profiling Memory Usage
84+
85+
The `@profile` decorator from `memory-profiler` will show you the memory usage line by line when you run your script with:
86+
87+
```sh
88+
mprof run your_script.py
89+
mprof plot
90+
```
91+
92+
## Conclusion
93+
94+
By streaming your JSON processing, you can handle files of virtually any size, limited only by disk space, not RAM. This approach is essential for scalable data pipelines and analytics.
95+
96+
## References
97+
98+
- [Streaming large JSON files in Python](https://pythonspeed.com/articles/json-memory-streaming/)
99+
- [Python JSON streaming: handle large datasets efficiently](https://pytutorial.com/python-json-streaming-handle-large-datasets-efficiently/)
100+
- [Dataquest: Working with JSON in Python](https://www.dataquest.io/blog/python-json-tutorial/)
101+
- [json-streamer GitHub](https://github.com/kashifrazzaqui/json-streamer)

0 commit comments

Comments
 (0)