Skip to content

Deserializing JSON to an AnalyzeResult is 20x slower in azure-ai-documentintelligence than in azure-ai-formrecognizer  #36765

Closed
@stevesimmons

Description

@stevesimmons

Hi,

I noticed an unexpected and quite serious performance degradation while migrating from azure-ai-formrecognizer (version 3.3.3) to azure-ai-documentintelligence (version 1.0.0b3), both the latest versions. I am using Python 3.11.4.

Reloading a saved JSON OCR result is 21x slower in the new beta version of azure-ai-documentintelligence compared with azure-ai-formrecognizer. For me, this is slow enough to make it the new code unusable, which also block use of the new 2024-02-29-preview models.

Here are test results for a 50 page document, whose serialised OCR JSON output is 12MB in size:

import json, pathlib
import azure.ai.documentintelligence.models
import azure.ai.formrecognizer

obj = json.loads(pathlib.Path("src.ocr.json").read_bytes())

# Form Recognizer 3.3.3 takes 1.66 secs to reload AnalyzeResult
result_fr = azure.ai.formrecognizer.AnalyzeResult.from_dict(obj)

# Document Intelligence 1.0.0b takes 35.6 secs (21.5x longer)
result_di = azure.ai.documentintelligence.models.AnalyzeResult(obj)

I'd expect the two load times to be similar.

I used the Python profiler to get an idea where all the time is spent. It looks like the _model_base.Model is doing a ton of work repeatedly figuring out dynamically generated base classes at the start of _model_base.__new__() for every single object instance in the OCR results. Can some of this be cached?

Here is profiling output (slower than above because of the profiling hooks):

          112692731 function calls (99259951 primitive calls) in 138.527 seconds

   Ordered by: internal time

   ncalls       tottime  percall  cumtime  percall filename:lineno(function)
8456260/166821   26.026    0.000   87.971    0.001 copy.py:128(deepcopy)
        532080   23.577    0.000   27.177    0.000 _model_base.py:518(<dictcomp>)
1625780/166821    8.324    0.000   86.489    0.001 copy.py:227(_deepcopy_dict)
        532080    7.293    0.000   44.040    0.000 _model_base.py:515(__new__)
1208569/409967    6.768    0.000   13.617    0.000 _model_base.py:434(_serialize)
        532080    5.109    0.000    7.003    0.000 _model_base.py:521(<dictcomp>)
      18289276    4.771    0.000    4.771    0.000 {method 'get' of 'dict' objects}
      17230602    4.501    0.000    4.501    0.000 {built-in method builtins.hasattr}
       1011497    3.782    0.000    4.337    0.000 _model_base.py:597(_get_deserialize_callable_from_annotation)
      10796013    3.349    0.000    3.349    0.000 {built-in method builtins.isinstance}
 365259/166820    3.285    0.000   70.740    0.000 copy.py:259(_reconstruct)
       2303084    3.225    0.000    4.282    0.000 copy.py:243(_keep_alive)
      13229249    3.152    0.000    3.152    0.000 {built-in method builtins.id}
      537948/6    2.753    0.000  103.032   17.172 _model_base.py:749(_deserialize_with_callable)
        860113    1.995    0.000    2.234    0.000 _model_base.py:106(_serialize_datetime)
  312045/98139    1.795    0.000   77.118    0.001 copy.py:201(_deepcopy_list)
        473301    1.677    0.000    4.015    0.000 _model_base.py:463(_get_rest_field)
      166821/1    1.618    0.000  138.212  138.212 _model_base.py:485(__init__)
        365259    1.582    0.000    1.582    0.000 {method '__reduce_ex__' of 'object' objects}
        371128    1.489    0.000    2.242    0.000 typing.py:1611(__getitem__)
       6847519    1.475    0.000    1.475    0.000 {method 'items' of 'dict' objects}
        860113    1.317    0.000    1.317    0.000 _model_base.py:42(_timedelta_as_isostr)
        896701    1.314    0.000    1.785    0.000 _model_base.py:467(<genexpr>)
       6153176    1.276    0.000    1.276    0.000 copy.py:182(_deepcopy_atomic)
      166821/1    1.200    0.000  103.033  103.033 _model_base.py:496(<dictcomp>)
       3724560    1.192    0.000    1.192    0.000 {method 'items' of 'mappingproxy' objects}
        844677    1.103    0.000    1.767    0.000 _model_base.py:844(_get_deserialize_callable_from_annotation)
      370155/6    1.055    0.000  103.032   17.172 _model_base.py:771(_deserialize)
       1064158    1.040    0.000    1.040    0.000 {built-in method builtins.getattr}
        365259    0.815    0.000   31.077    0.000 copyreg.py:104(__newobj__)
 266198/183654    0.784    0.000    7.622    0.000 _model_base.py:438(<dictcomp>)
       3149490    0.760    0.000    0.760    0.000 {method 'append' of 'list' objects}
  166820/14561    0.746    0.000  101.664    0.007 _model_base.py:551(_deserialize)
     473301/10    0.706    0.000  103.032   10.303 _model_base.py:472(_create_value)
        166821    0.660    0.000   88.631    0.001 _model_base.py:341(__init__)
        371128    0.522    0.000    0.522    0.000 typing.py:349(inner)
        730518    0.514    0.000    2.198    0.000 copy.py:264(<genexpr>)
        897338    0.476    0.000    0.762    0.000 {built-in method builtins.issubclass}
       1262047    0.471    0.000    0.471    0.000 _model_base.py:813(_rest_name)
        532080    0.456    0.000    0.456    0.000 {method 'update' of 'dict' objects}
        473301    0.411    0.000    2.015    0.000 {built-in method builtins.next}
  166820/14561    0.394    0.000  102.626    0.007 _model_base.py:624(_deserialize_model)
        532080    0.359    0.000    0.359    0.000 {built-in method __new__ of type object at 0x7f2dc68eba20}
   82258/61624    0.319    0.000    8.557    0.000 _model_base.py:436(<listcomp>)
        104899    0.308    0.000    9.988    0.000 _models.py:2425(__init__)
  176516/14567    0.249    0.000  103.020    0.007 _model_base.py:725(<genexpr>)
       36515/6    0.246    0.000  103.032   17.172 _model_base.py:719(_deserialize_sequence)
        166821    0.226    0.000    0.226    0.000 _model_base.py:489(<dictcomp>)
             1    0.199    0.199  138.411  138.411 _models.py:259(__init__)
        640122    0.196    0.000    0.196    0.000 {method 'values' of 'dict' objects}
        166820    0.189    0.000    0.250    0.000 _model_base.py:309(get_deserializer)
           242    0.167    0.001   60.233    0.249 _models.py:2249(__init__)
  166820/14561    0.166    0.000  101.836    0.007 _model_base.py:731(_deserialize_default)
        166820    0.162    0.000    0.162    0.000 {built-in method _abc._abc_subclasscheck}
        537949    0.151    0.000    0.151    0.000 {built-in method builtins.len}
         26819    0.139    0.000   22.247    0.001 _models.py:2730(__init__)
        166820    0.127    0.000    0.191    0.000 _model_base.py:430(_is_model)
        166820    0.125    0.000    0.287    0.000 <frozen abc>:121(__subclasscheck__)
             1    0.115    0.115  138.527  138.527 <string>:1(<module>)
        371128    0.097    0.000    0.097    0.000 typing.py:2246(cast)
         13582    0.084    0.000   14.761    0.001 _models.py:2298(__init__)
         13936    0.076    0.000   11.766    0.001 _models.py:1787(__init__)
          6605    0.032    0.000    5.923    0.001 _models.py:2629(__init__)
           439    0.020    0.000   10.620    0.024 _models.py:2556(__init__)
           298    0.014    0.000   14.492    0.049 _models.py:2495(__init__)
           973    0.001    0.000   83.690    0.086 _model_base.py:647(_deserialize_with_optional)
             1    0.000    0.000  138.527  138.527 {built-in method builtins.exec}
             1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Metadata

Metadata

Labels

ClientThis issue points to a problem in the data-plane of the library.Document Intelligencecustomer-reportedIssues that are reported by GitHub users external to the Azure organization.needs-team-attentionWorkflow: This issue needs attention from Azure service team or SDK teamquestionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions