Description
Description
The trigger_batch
and trigger
functions in the library are using sys.getsizeof(event['data'])
to measure the size of the event data. However, sys.getsizeof()
returns the size of the object in memory, which includes overhead and doesn't accurately represent the actual byte size of the data when encoded for transmission over HTTP. This can lead to inconsistencies and false positives when checking against the 10KB limit, resulting in ValueError: Too much data
exceptions even when the data is within acceptable limits.
pusher-http-python/pusher/pusher_client.py
Lines 117 to 143 in 239d67b
Steps to Reproduce:
- Prepare a batch of events with a good number of non-ascii characters where the data size is below 10KB when encoded in UTF-8.
- Use the trigger_batch function to send the batch.
- Observe that a ValueError is raised despite the data being within the size limit.
Upon modifying the trigger_batch
function to add some logging as follows:
@request_method
def trigger_batch(self, batch=[], already_encoded=False):
"""Trigger multiple events with a single HTTP call.
http://pusher.com/docs/rest_api#method-post-batch-events
"""
if not already_encoded:
for event in batch:
validate_channel(event['channel'])
event_name = ensure_text(event['name'], "event_name")
if len(event['name']) > 200:
raise ValueError("event_name too long")
event['data'] = data_to_string(event['data'], self._json_encoder)
print("---- EVENT SIZE DETAILS ----")
print("len(event['data'])")
print(len(event['data']))
print()
print("sys.getsizeof(event['data'])")
print(sys.getsizeof(event['data']))
print()
print("len(event['data'].encode('utf-8'))")
print(len(event['data'].encode('utf-8')))
print()
if sys.getsizeof(event['data']) > 10240:
raise ValueError("Too much data")
if is_encrypted_channel(event['channel']):
event['data'] = json.dumps(encrypt(event['channel'], event['data'], self._encryption_master_key), ensure_ascii=False)
params = {
'batch': batch}
return Request(
self, POST, "/apps/%s/batch_events" % self.app_id, params)
I get the following output:
---- EVENT SIZE DETAILS ----
len(event['data'])
5778 <-- Length of the string
sys.getsizeof(event['data'])
5827 <-- In-memory object size
len(event['data'].encode('utf-8')).
5778 <-- Actual byte-size of the data
---- EVENT SIZE DETAILS ----
len(event['data'])
5671 <-- Length of the string
sys.getsizeof(event['data'])
11416 <-- In-memory object size
len(event['data'].encode('utf-8'))
5673 <-- Actual byte-size of the data
ERROR:root:Too much data
Notice how the result of sys.getsizeof
and the UTF-8 encoded byte size is drastically different for the last event just because it contains one non-ascii character (≥
).
Expected Behavior:
The function should allow sending event data that is under the 10KB limit when encoded, without raising an exception.
Actual Behavior:
A ValueError is raised stating "Too much data" even when the actual encoded data size is under 10KB.
Analysis:
Using sys.getsizeof()
is not reliable for measuring the size of the data to be sent over the network. It measures the memory footprint of the object in Python, which can include additional overhead and doesn't correspond to the actual size of the encoded data.
Here is some more proof on how sys.getsizeof
can be wildly inaccurate for calculating the byte size of data:
>>> a = '2!=1'
>>> b = '2≥1'
>>>
>>> len(a)
4
>>> len(b)
3
>>> sys.getsizeof(a)
53
>>> sys.getsizeof(b)
80
>>> len(a.encode('utf-8'))
4
>>> len(b.encode('utf-8'))
5
Proposed Solution:
Replace the size check using sys.getsizeof(event['data'])
with len(event['data'].encode('utf-8'))
to accurately measure the byte size of the data when encoded.
Additional Information:
- Python's sys.getsizeof() documentation: https://docs.python.org/3/library/sys.html#sys.getsizeof