Description
Bug report
Bug description:
Lately, I've been testing IA code on various Python interpreters and their corresponding profilers across multiple platforms. After multiple attempts, I've noticed that CPython profilers consistently fail to analyze the following code.
import tensorflow as tf
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = tf.keras.utils.normalize(x_train, axis=1)
x_test = tf.keras.utils.normalize(x_test, axis=1)
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128, activation=tf.nn.relu))
model.add(tf.keras.layers.Dense(128, activation=tf.nn.relu))
model.add(tf.keras.layers.Dense(10, activation=tf.nn.softmax))
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=2)
val_loss, val_acc = model.evaluate(x_test, y_test)
print(val_loss, val_acc)
model.save("epic_num_reader.keras")
predictions = model.predict([x_test])
print(predictions)
I've been testing an IA code on different Python interpreters and their respective profilers across multiple platforms. Specifically, I've been working with a simple TensorFlow+Keras code that classifies number image inputs. Interestingly, I found that the code works well with IntelPython, which uses Python 3.9.19 as its latest version. When I tested the code on multiple versions of CPython, I noticed that the profiler works well and returns information for CPython versions less than 3.12.1. However, since CPython 3.12.1, the code crashes with an error.
2024-07-19 14:53:51.996897: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-19 14:53:51.997294: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-07-19 14:53:51.999400: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-07-19 14:53:52.005233: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-19 14:53:52.014843: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-19 14:53:52.017590: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-19 14:53:52.025009: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/profile.py", line 615, in <module>
main()
File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/profile.py", line 604, in main
runctx(code, globs, None, options.outfile, options.sort)
File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/profile.py", line 101, in runctx
return _Utils(Profile).runctx(statement, globals, locals, filename, sort)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/profile.py", line 64, in runctx
prof.runctx(statement, globals, locals)
File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/profile.py", line 424, in runctx
exec(cmd, globals, locals)
File "mnist_number.py", line 1, in <module>
import tensorflow as tf
File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/__init__.py", line 47, in <module>
from tensorflow._api.v2 import __internal__
File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/_api/v2/__internal__/__init__.py", line 8, in <module>
from tensorflow._api.v2.__internal__ import autograph
File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/_api/v2/__internal__/autograph/__init__.py", line 8, in <module>
from tensorflow.python.autograph.core.ag_ctx import control_status_ctx # line: 34
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/python/autograph/core/ag_ctx.py", line 21, in <module>
from tensorflow.python.autograph.utils import ag_logging
File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/python/autograph/utils/__init__.py", line 17, in <module>
from tensorflow.python.autograph.utils.context_managers import control_dependency_on_returns
File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/python/autograph/utils/context_managers.py", line 19, in <module>
from tensorflow.python.framework import ops
File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/python/framework/ops.py", line 50, in <module>
from tensorflow.python.eager import context
File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/python/eager/context.py", line 37, in <module>
from tensorflow.python.eager import execute
File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/python/eager/execute.py", line 21, in <module>
from tensorflow.python.framework import dtypes
File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py", line 308, in <module>
resource = DType(types_pb2.DT_RESOURCE)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py", line 81, in __init__
self._handle_data = handle_data
^^^^^^^^^^^^^^^^^
File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/profile.py", line 209, in trace_dispatch_i
if self.dispatch[event](self, frame, t):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/profile.py", line 293, in trace_dispatch_return
assert frame is self.cur[-2].f_back, ("Bad return", self.cur[-3])
AssertionError: ('Bad return', ('/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py', 1, '<module>'))
I ran the code on my laptop, which has a Tiger Lake architecture and no NVIDIA GPU or Tensor Cores. As I didn't recompile the TensorFlow library. Therefore, it's expected to see warning messages related to the lack of AVX512 and GPU acceleration.
Test Environnement:
WSL2 Ubuntu 20.04
- Python 3.12.4
Manjaro
- Python 3.9.19
- PYthon 3.10.14
- Python 3.11.9
- Python 3.12.0
- Python 3.12.1
- Python 3.12.2
- Python 3.12.3
- Python 3.12.4
Fedora
- Python 3.12.4
For Python versions prior to 3.12.1, I only received warning messages, and the profiler worked as expected. However, since upgrading to 3.12.1, I've started encountering AssertError issues. Interestingly, I've compared the profile.py file between versions 3.12.0 and 3.12.1, and they appear to be identical. It's possible that the introduction of PEP 695 in Python 3.12 is causing this occasional error.
While waiting for your response, I wish you a good day.
Aaron SU
CPython versions tested on:
3.9, 3.10, 3.11, 3.12
Operating systems tested on:
Linux, Windows
Linked PRs
- gh-122029: Log call events in sys.setprofile when it's a method with c function #122072
- GH-122029: Break INSTRUMENTED_CALL into micro-ops, so that its behavior is consistent with
CALL
#122177 - [3.13] gh-122029: Log call events in sys.setprofile when it's a method with c function (GH-122072) #122205
- [3.12] gh-122029: Log call events in sys.setprofile when it's a method with c function (GH-122072) #122206
- gh-122029: Move monitoring after method expand for CALL_KW #130488
- gh-122029: Do not unpack method for legacy tracing anymore #130898