Description
Hi,
I noticed an unexpected and quite serious performance degradation while migrating from azure-ai-formrecognizer
(version 3.3.3) to azure-ai-documentintelligence
(version 1.0.0b3), both the latest versions. I am using Python 3.11.4.
Reloading a saved JSON OCR result is 21x slower in the new beta version of azure-ai-documentintelligence
compared with azure-ai-formrecognizer
. For me, this is slow enough to make it the new code unusable, which also block use of the new 2024-02-29-preview
models.
Here are test results for a 50 page document, whose serialised OCR JSON output is 12MB in size:
import json, pathlib
import azure.ai.documentintelligence.models
import azure.ai.formrecognizer
obj = json.loads(pathlib.Path("src.ocr.json").read_bytes())
# Form Recognizer 3.3.3 takes 1.66 secs to reload AnalyzeResult
result_fr = azure.ai.formrecognizer.AnalyzeResult.from_dict(obj)
# Document Intelligence 1.0.0b takes 35.6 secs (21.5x longer)
result_di = azure.ai.documentintelligence.models.AnalyzeResult(obj)
I'd expect the two load times to be similar.
I used the Python profiler to get an idea where all the time is spent. It looks like the _model_base.Model
is doing a ton of work repeatedly figuring out dynamically generated base classes at the start of _model_base.__new__()
for every single object instance in the OCR results. Can some of this be cached?
Here is profiling output (slower than above because of the profiling hooks):
112692731 function calls (99259951 primitive calls) in 138.527 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
8456260/166821 26.026 0.000 87.971 0.001 copy.py:128(deepcopy)
532080 23.577 0.000 27.177 0.000 _model_base.py:518(<dictcomp>)
1625780/166821 8.324 0.000 86.489 0.001 copy.py:227(_deepcopy_dict)
532080 7.293 0.000 44.040 0.000 _model_base.py:515(__new__)
1208569/409967 6.768 0.000 13.617 0.000 _model_base.py:434(_serialize)
532080 5.109 0.000 7.003 0.000 _model_base.py:521(<dictcomp>)
18289276 4.771 0.000 4.771 0.000 {method 'get' of 'dict' objects}
17230602 4.501 0.000 4.501 0.000 {built-in method builtins.hasattr}
1011497 3.782 0.000 4.337 0.000 _model_base.py:597(_get_deserialize_callable_from_annotation)
10796013 3.349 0.000 3.349 0.000 {built-in method builtins.isinstance}
365259/166820 3.285 0.000 70.740 0.000 copy.py:259(_reconstruct)
2303084 3.225 0.000 4.282 0.000 copy.py:243(_keep_alive)
13229249 3.152 0.000 3.152 0.000 {built-in method builtins.id}
537948/6 2.753 0.000 103.032 17.172 _model_base.py:749(_deserialize_with_callable)
860113 1.995 0.000 2.234 0.000 _model_base.py:106(_serialize_datetime)
312045/98139 1.795 0.000 77.118 0.001 copy.py:201(_deepcopy_list)
473301 1.677 0.000 4.015 0.000 _model_base.py:463(_get_rest_field)
166821/1 1.618 0.000 138.212 138.212 _model_base.py:485(__init__)
365259 1.582 0.000 1.582 0.000 {method '__reduce_ex__' of 'object' objects}
371128 1.489 0.000 2.242 0.000 typing.py:1611(__getitem__)
6847519 1.475 0.000 1.475 0.000 {method 'items' of 'dict' objects}
860113 1.317 0.000 1.317 0.000 _model_base.py:42(_timedelta_as_isostr)
896701 1.314 0.000 1.785 0.000 _model_base.py:467(<genexpr>)
6153176 1.276 0.000 1.276 0.000 copy.py:182(_deepcopy_atomic)
166821/1 1.200 0.000 103.033 103.033 _model_base.py:496(<dictcomp>)
3724560 1.192 0.000 1.192 0.000 {method 'items' of 'mappingproxy' objects}
844677 1.103 0.000 1.767 0.000 _model_base.py:844(_get_deserialize_callable_from_annotation)
370155/6 1.055 0.000 103.032 17.172 _model_base.py:771(_deserialize)
1064158 1.040 0.000 1.040 0.000 {built-in method builtins.getattr}
365259 0.815 0.000 31.077 0.000 copyreg.py:104(__newobj__)
266198/183654 0.784 0.000 7.622 0.000 _model_base.py:438(<dictcomp>)
3149490 0.760 0.000 0.760 0.000 {method 'append' of 'list' objects}
166820/14561 0.746 0.000 101.664 0.007 _model_base.py:551(_deserialize)
473301/10 0.706 0.000 103.032 10.303 _model_base.py:472(_create_value)
166821 0.660 0.000 88.631 0.001 _model_base.py:341(__init__)
371128 0.522 0.000 0.522 0.000 typing.py:349(inner)
730518 0.514 0.000 2.198 0.000 copy.py:264(<genexpr>)
897338 0.476 0.000 0.762 0.000 {built-in method builtins.issubclass}
1262047 0.471 0.000 0.471 0.000 _model_base.py:813(_rest_name)
532080 0.456 0.000 0.456 0.000 {method 'update' of 'dict' objects}
473301 0.411 0.000 2.015 0.000 {built-in method builtins.next}
166820/14561 0.394 0.000 102.626 0.007 _model_base.py:624(_deserialize_model)
532080 0.359 0.000 0.359 0.000 {built-in method __new__ of type object at 0x7f2dc68eba20}
82258/61624 0.319 0.000 8.557 0.000 _model_base.py:436(<listcomp>)
104899 0.308 0.000 9.988 0.000 _models.py:2425(__init__)
176516/14567 0.249 0.000 103.020 0.007 _model_base.py:725(<genexpr>)
36515/6 0.246 0.000 103.032 17.172 _model_base.py:719(_deserialize_sequence)
166821 0.226 0.000 0.226 0.000 _model_base.py:489(<dictcomp>)
1 0.199 0.199 138.411 138.411 _models.py:259(__init__)
640122 0.196 0.000 0.196 0.000 {method 'values' of 'dict' objects}
166820 0.189 0.000 0.250 0.000 _model_base.py:309(get_deserializer)
242 0.167 0.001 60.233 0.249 _models.py:2249(__init__)
166820/14561 0.166 0.000 101.836 0.007 _model_base.py:731(_deserialize_default)
166820 0.162 0.000 0.162 0.000 {built-in method _abc._abc_subclasscheck}
537949 0.151 0.000 0.151 0.000 {built-in method builtins.len}
26819 0.139 0.000 22.247 0.001 _models.py:2730(__init__)
166820 0.127 0.000 0.191 0.000 _model_base.py:430(_is_model)
166820 0.125 0.000 0.287 0.000 <frozen abc>:121(__subclasscheck__)
1 0.115 0.115 138.527 138.527 <string>:1(<module>)
371128 0.097 0.000 0.097 0.000 typing.py:2246(cast)
13582 0.084 0.000 14.761 0.001 _models.py:2298(__init__)
13936 0.076 0.000 11.766 0.001 _models.py:1787(__init__)
6605 0.032 0.000 5.923 0.001 _models.py:2629(__init__)
439 0.020 0.000 10.620 0.024 _models.py:2556(__init__)
298 0.014 0.000 14.492 0.049 _models.py:2495(__init__)
973 0.001 0.000 83.690 0.086 _model_base.py:647(_deserialize_with_optional)
1 0.000 0.000 138.527 138.527 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}