Skip to content

Incorrect usage of sys.getsizeof to calculate the byte size of event data #236

@asymness

Description

@asymness

Description
The trigger_batch and trigger functions in the library are using sys.getsizeof(event['data']) to measure the size of the event data. However, sys.getsizeof() returns the size of the object in memory, which includes overhead and doesn't accurately represent the actual byte size of the data when encoded for transmission over HTTP. This can lead to inconsistencies and false positives when checking against the 10KB limit, resulting in ValueError: Too much data exceptions even when the data is within acceptable limits.

@request_method
def trigger_batch(self, batch=[], already_encoded=False):
"""Trigger multiple events with a single HTTP call.
http://pusher.com/docs/rest_api#method-post-batch-events
"""
if not already_encoded:
for event in batch:
validate_channel(event['channel'])
event_name = ensure_text(event['name'], "event_name")
if len(event['name']) > 200:
raise ValueError("event_name too long")
event['data'] = data_to_string(event['data'], self._json_encoder)
if sys.getsizeof(event['data']) > 10240:
raise ValueError("Too much data")
if is_encrypted_channel(event['channel']):
event['data'] = json.dumps(encrypt(event['channel'], event['data'], self._encryption_master_key), ensure_ascii=False)
params = {
'batch': batch}
return Request(
self, POST, "/apps/%s/batch_events" % self.app_id, params)

Steps to Reproduce:

  • Prepare a batch of events with a good number of non-ascii characters where the data size is below 10KB when encoded in UTF-8.
  • Use the trigger_batch function to send the batch.
  • Observe that a ValueError is raised despite the data being within the size limit.

Upon modifying the trigger_batch function to add some logging as follows:

    @request_method
    def trigger_batch(self, batch=[], already_encoded=False):
        """Trigger multiple events with a single HTTP call.

        http://pusher.com/docs/rest_api#method-post-batch-events
        """
        if not already_encoded:
            for event in batch:
                validate_channel(event['channel'])

                event_name = ensure_text(event['name'], "event_name")
                if len(event['name']) > 200:
                    raise ValueError("event_name too long")

                event['data'] = data_to_string(event['data'], self._json_encoder)
                print("---- EVENT SIZE DETAILS ----")
                print("len(event['data'])")
                print(len(event['data']))
                print()
                print("sys.getsizeof(event['data'])")
                print(sys.getsizeof(event['data']))
                print()
                print("len(event['data'].encode('utf-8'))")
                print(len(event['data'].encode('utf-8')))
                print()
                
                if sys.getsizeof(event['data']) > 10240:
                    raise ValueError("Too much data")

                if is_encrypted_channel(event['channel']):
                    event['data'] = json.dumps(encrypt(event['channel'], event['data'], self._encryption_master_key), ensure_ascii=False)

        params = {
            'batch': batch}

        return Request(
            self, POST, "/apps/%s/batch_events" % self.app_id, params)

I get the following output:

---- EVENT SIZE DETAILS ----
len(event['data'])
5778  <-- Length of the string

sys.getsizeof(event['data'])  
5827  <-- In-memory object size

len(event['data'].encode('utf-8')). 
5778  <-- Actual byte-size of the data

---- EVENT SIZE DETAILS ----
len(event['data'])
5671  <-- Length of the string

sys.getsizeof(event['data'])
11416  <-- In-memory object size

len(event['data'].encode('utf-8'))
5673  <-- Actual byte-size of the data

ERROR:root:Too much data

Notice how the result of sys.getsizeof and the UTF-8 encoded byte size is drastically different for the last event just because it contains one non-ascii character ().

Expected Behavior:
The function should allow sending event data that is under the 10KB limit when encoded, without raising an exception.

Actual Behavior:
A ValueError is raised stating "Too much data" even when the actual encoded data size is under 10KB.

Analysis:
Using sys.getsizeof() is not reliable for measuring the size of the data to be sent over the network. It measures the memory footprint of the object in Python, which can include additional overhead and doesn't correspond to the actual size of the encoded data.

Here is some more proof on how sys.getsizeof can be wildly inaccurate for calculating the byte size of data:

>>> a = '2!=1'
>>> b = '2≥1'
>>> 
>>> len(a)
4
>>> len(b)
3
>>> sys.getsizeof(a)
53
>>> sys.getsizeof(b)
80
>>> len(a.encode('utf-8'))
4
>>> len(b.encode('utf-8'))
5

Proposed Solution:
Replace the size check using sys.getsizeof(event['data']) with len(event['data'].encode('utf-8')) to accurately measure the byte size of the data when encoded.

Additional Information:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions