RetinaNet predict() is too slow #2014

axelusarov · 2023-08-06T09:51:34Z

After upgrading keras-cv to v0.6.1 I noticed that predict method of RetinaNet model became really slow comparing with v0.5.1 as result it complicates COCO metrics evaluation.

inputs = [SINGLE_IMAGE]
model = keras_cv.models.RetinaNet.from_preset(...)
model.predict(inputs)

When gettting predictions for a single image in 0.6.1:
1/1 [==============================] - 42s 42s/step

And in 0.5.1:
1/1 [==============================] - 6s 6s/step

Another problem with predictions is that this function throws an exception when passing a generator as an argument. And again, in 0.5.1 it worked perfectly.

y_pred = model.predict(image_generator(...))
TypeError: in user code:

    File "/usr/local/lib/python3.10/dist-packages/keras/engine/training.py", line 2169, in predict_function  *
        return step_function(self, iterator)
    File "/usr/local/lib/python3.10/dist-packages/keras/engine/training.py", line 2155, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/usr/local/lib/python3.10/dist-packages/keras/engine/training.py", line 2143, in run_step  **
        outputs = model.predict_step(data)
    File "/usr/local/lib/python3.10/dist-packages/keras_cv/models/object_detection/retinanet/retinanet.py", line 248, in predict_step
        return self.decode_predictions(outputs, args[-1])
    File "/usr/local/lib/python3.10/dist-packages/keras_cv/models/object_detection/retinanet/retinanet.py", line 289, in decode_predictions
        anchors = self.anchor_generator(image_shape=image_shape)
    File "/usr/local/lib/python3.10/dist-packages/keras_cv/layers/object_detection/anchor_generator.py", line 179, in __call__
        generator(image_shape),
    File "/usr/local/lib/python3.10/dist-packages/keras_cv/layers/object_detection/anchor_generator.py", line 264, in __call__
        0.5 * stride, math.ceil(image_width / stride) * stride, stride

    TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

I'd stick with the older version until the issue is resolved, but unfortuantely COCO metrics seem to be broken in 0.5.1 as was reported in the following issue #1994

Colab for reproducing the issue: https://colab.research.google.com/drive/1dzJFiVIxXtJCoj-ShjRyu-ZPkf6ClCdj?usp=sharing

The text was updated successfully, but these errors were encountered:

sampathweb · 2023-08-08T23:10:46Z

Thanks for reporting the issue. keras-cv was updated to support keras-core along with the existing tf.keras backend and might have had this regression. I will take a look and get back.

sampathweb · 2023-08-08T23:25:20Z

On the above error, you could wrap it in tf.data.DataSet.from_generator for your generator function and provide it the proper shape of the image -

from functools import partial

def image_generator(file_path):
    image_batch = preprocess_image(file_path)
    for i in range(3):
        yield image_batch


ds = tf.data.Dataset.from_generator(
    partial(image_generator, image_path),
    output_types=tf.float32,
    output_shapes=(None, 640, 640, 3),
)

y_pred = model.predict(ds)

This will resolve the error.

axelusarov · 2023-08-09T10:17:00Z

Thanks for help, the trick with dataset worked fine.

With regards to the slow prediction. It becomes much faster after assigning a custom NMS with any thresholds:

model.prediction_decoder = keras_cv.layers.MultiClassNonMaxSuppression(
    bounding_box_format="xywh",
    from_logits=True,
    iou_threshold=1.0,
    confidence_threshold=0.0
)

However, it cannot be used for training, because COCO metrics validation callback raises an exception. I'm new to the object detection so not sure if it's expected behavior or not.

Traceback (most recent call last):
  File ".../object_detection/model.py", line 570, in <module>
    run_training()
  File ".../object_detection/model.py", line 334, in run_training
    model.fit(
  File "/Users/axel/miniforge3/envs/ml/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/Users/axel/miniforge3/envs/ml/lib/python3.11/site-packages/keras_cv/callbacks/pycoco_callback.py", line 132, in on_epoch_end
    metrics = compute_pycoco_metrics(ground_truth, predictions)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/axel/miniforge3/envs/ml/lib/python3.11/site-packages/keras_cv/metrics/coco/pycoco_wrapper.py", line 219, in compute_pycoco_metrics
    coco_predictions = _convert_predictions_to_coco_annotations(predictions)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/axel/miniforge3/envs/ml/lib/python3.11/site-packages/keras_cv/metrics/coco/pycoco_wrapper.py", line 131, in _convert_predictions_to_coco_annotations
    predictions["detection_boxes"][i][j]
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: too many indices for array: array is 0-dimensional, but 1 were indexed

I have this validation callback in my code:

keras_cv.callbacks.PyCOCOCallback(validation_data=eval_ds, bounding_box_format="xywh", cache=False)

ianstenbit · 2023-08-16T13:38:06Z

@axelusarov is the slower prediction just for the first step (e.g. due to graph tracing), or is it also that every step is slower?

I suppose that the single-class NMS is the likely issue here in terms of performance.

I don't think this error is expected behavior -- this looks like a bug. I believe that #2030 should fix it. Sorry for the regression and thanks for the issue report!

axelusarov · 2023-08-18T11:57:16Z

@ianstenbit prediction is slow for each step. Here is predict() example for 5 images before/after assigning new MultiClassNonMaxSuppression with default parameters to the model.

(ml) % python test.py
Using TensorFlow backend
5/5 [==============================] - 121s 23s/step
(ml) % python test.py
Using TensorFlow backend
5/5 [==============================] - 2s 119ms/step

After using the code from master, validation step is much faster now when using custom NMS and I didn't notice errors this time. I didn't check training quality though.

ianstenbit · 2023-09-08T23:13:17Z

Yeah this looks like there's some significant slowdown with the single-class NMS for TensorFlow, and it's probably just just due to graph tracing. This is something we should look into further, but I can't prioritize it right now

EdIzaguirre · 2023-10-19T16:48:33Z

Was about to write a GitHub issue about this, but yes single-class NMS is much slower then mutli-class. Not just prediction, training is 10x slower. I haven't spent much time looking at the source code, graph tracing, etc., but I can take a look at this to see what is going on.

EdIzaguirre · 2023-10-20T23:30:25Z

After looking into this, it seems that NonMaxSuppression calls tf.image.non_max_suppression_padded() in image_ops_impl.py. This calls non_max_suppression_padded_v2 in the same file, which proceeds to conduct NMS in pure Python. This is in contrast to MultiClassNonMaxSuppression, which calls tf.image.combined_non_max_suppression () in image_ops_impl.py. This proceeds to call gen_image_ops.combined_non_max_suppression(), which I believe runs the NMS in C++. This makes makes the 10x speed up possible. Anyone have any idea why this was done? For reference here is tf.image.non_max_suppression_padded():

@tf_export('image.non_max_suppression_padded')
@dispatch.add_dispatch_support
def non_max_suppression_padded(boxes,
                               scores,
                               max_output_size,
                               iou_threshold=0.5,
                               score_threshold=float('-inf'),
                               pad_to_max_output_size=False,
                               name=None,
                               sorted_input=False,
                               canonicalized_coordinates=False,
                               tile_size=512):
  with ops.name_scope(name, 'non_max_suppression_padded'):
    if not pad_to_max_output_size:
      # pad_to_max_output_size may be set to False only when the shape of
      # boxes is [num_boxes, 4], i.e., a single image. We make best effort to
      # detect violations at compile time. If `boxes` does not have a static
      # rank, the check allows computation to proceed.
      if boxes.get_shape().rank is not None and boxes.get_shape().rank > 2:
        raise ValueError("'pad_to_max_output_size' (value {}) must be True for "
                         'batched input'.format(pad_to_max_output_size))
    if name is None:
      name = ''
    # idx, num_valid = non_max_suppression_padded_v2(
    #     boxes, scores, max_output_size, iou_threshold, score_threshold,
    #     sorted_input, canonicalized_coordinates, tile_size)
    
    idx, num_valid = non_max_suppression_padded_v1(
      boxes, scores, max_output_size, iou_threshold, score_threshold,
      pad_to_max_output_size, name)
    
    # def_function.function seems to lose shape information, so set it here.
    if not pad_to_max_output_size:
      idx = idx[0, :num_valid]
    else:
      batch_dims = array_ops.concat([
          array_ops.shape(boxes)[:-2],
          array_ops.expand_dims(max_output_size, 0)
      ], 0)
      idx = array_ops.reshape(idx, batch_dims)
    return idx, num_valid

# TODO(b/158709815): Improve performance regression due to
# def_function.function.
# @def_function.function(
#     experimental_implements='non_max_suppression_padded_v2')
def non_max_suppression_padded_v2(boxes,
                                  scores,
                                  max_output_size,
                                  iou_threshold=0.5,
                                  score_threshold=float('-inf'),
                                  sorted_input=False,
                                  canonicalized_coordinates=False,
                                  tile_size=512):

    with ops.name_scope('sort_scores_and_boxes'):
      sorted_scores_indices = sort_ops.argsort(
          scores, axis=1, direction='DESCENDING')
      sorted_scores = array_ops.gather(
          scores, sorted_scores_indices, axis=1, batch_dims=1
      )
      sorted_boxes = array_ops.gather(
          boxes, sorted_scores_indices, axis=1, batch_dims=1
      )
    return sorted_scores, sorted_boxes, sorted_scores_indices

  batch_dims = array_ops.shape(boxes)[:-2]
  num_boxes = array_ops.shape(boxes)[-2]
  boxes = array_ops.reshape(boxes, [-1, num_boxes, 4])
  scores = array_ops.reshape(scores, [-1, num_boxes])
  batch_size = array_ops.shape(boxes)[0]
  if score_threshold != float('-inf'):
    with ops.name_scope('filter_by_score'):
      score_mask = math_ops.cast(scores > score_threshold, scores.dtype)
      scores *= score_mask
      box_mask = array_ops.expand_dims(
          math_ops.cast(score_mask, boxes.dtype), 2)
      boxes *= box_mask
  if not canonicalized_coordinates:
    with ops.name_scope('canonicalize_coordinates'):
      y_1, x_1, y_2, x_2 = array_ops.split(
          value=boxes, num_or_size_splits=4, axis=2)
      y_1_is_min = math_ops.reduce_all(
          math_ops.less_equal(y_1[0, 0, 0], y_2[0, 0, 0]))
      y_min, y_max = tf_cond.cond(
          y_1_is_min, lambda: (y_1, y_2), lambda: (y_2, y_1))
      x_1_is_min = math_ops.reduce_all(
          math_ops.less_equal(x_1[0, 0, 0], x_2[0, 0, 0]))
      x_min, x_max = tf_cond.cond(
          x_1_is_min, lambda: (x_1, x_2), lambda: (x_2, x_1))
      boxes = array_ops.concat([y_min, x_min, y_max, x_max], axis=2)
  # TODO(@bhack): https://github.com/tensorflow/tensorflow/issues/56089
  # this will be required after deprecation
  #else:
  #  y_1, x_1, y_2, x_2 = array_ops.split(
  #      value=boxes, num_or_size_splits=4, axis=2)

  if not sorted_input:
    scores, boxes, sorted_indices = _sort_scores_and_boxes(scores, boxes)
  else:
    # Default value required for Autograph.
    sorted_indices = array_ops.zeros_like(scores, dtype=dtypes.int32)

  pad = math_ops.cast(
      math_ops.ceil(
          math_ops.cast(
              math_ops.maximum(num_boxes, max_output_size), dtypes.float32) /
          math_ops.cast(tile_size, dtypes.float32)),
      dtypes.int32) * tile_size - num_boxes
  boxes = array_ops.pad(
      math_ops.cast(boxes, dtypes.float32), [[0, 0], [0, pad], [0, 0]])
  scores = array_ops.pad(
      math_ops.cast(scores, dtypes.float32), [[0, 0], [0, pad]])
  num_boxes_after_padding = num_boxes + pad
  num_iterations = num_boxes_after_padding // tile_size
  def _loop_cond(unused_boxes, unused_threshold, output_size, idx):
    return math_ops.logical_and(
        math_ops.reduce_min(output_size) < max_output_size,
        idx < num_iterations)

  def suppression_loop_body(boxes, iou_threshold, output_size, idx):
    return _suppression_loop_body(
        boxes, iou_threshold, output_size, idx, tile_size)
  
  selected_boxes, _, output_size, _ = while_loop.while_loop(
      _loop_cond,
      suppression_loop_body,
      [
          boxes, iou_threshold,
          array_ops.zeros([batch_size], dtypes.int32),
          constant_op.constant(0)
      ],
      shape_invariants=[
          tensor_shape.TensorShape([None, None, 4]),
          tensor_shape.TensorShape([]),
          tensor_shape.TensorShape([None]),
          tensor_shape.TensorShape([]),
      ],
  )

  num_valid = math_ops.minimum(output_size, max_output_size)
  idx = num_boxes_after_padding - math_ops.cast(
      nn_ops.top_k(
          math_ops.cast(math_ops.reduce_any(
              selected_boxes > 0, [2]), dtypes.int32) *
          array_ops.expand_dims(
              math_ops.range(num_boxes_after_padding, 0, -1), 0),
          max_output_size)[0], dtypes.int32)
  idx = math_ops.minimum(idx, num_boxes - 1)

  if not sorted_input:
    index_offsets = math_ops.range(batch_size) * num_boxes
    gather_idx = array_ops.reshape(
        idx + array_ops.expand_dims(index_offsets, 1), [-1])
    idx = array_ops.reshape(
        array_ops.gather(array_ops.reshape(sorted_indices, [-1]),
                         gather_idx),
        [batch_size, -1])
  invalid_index = array_ops.zeros([batch_size, max_output_size],
                                  dtype=dtypes.int32)
  idx_index = array_ops.expand_dims(math_ops.range(max_output_size), 0)
  num_valid_expanded = array_ops.expand_dims(num_valid, 1)
  idx = array_ops.where(idx_index < num_valid_expanded,
                        idx, invalid_index)

  num_valid = array_ops.reshape(num_valid, batch_dims)
  return idx, num_valid

Note that non-max suppression is done totally manually. Interestingly enough, non_max_suppression_v1 did refer to a C++ implementation, only v2 does it in Python. And here is tf.image.combined_non_max_suppression():

@tf_export('image.combined_non_max_suppression')
@dispatch.add_dispatch_support
def combined_non_max_suppression(boxes,
                                 scores,
                                 max_output_size_per_class,
                                 max_total_size,
                                 iou_threshold=0.5,
                                 score_threshold=float('-inf'),
                                 pad_per_class=False,
                                 clip_boxes=True,
                                 name=None):
  with ops.name_scope(name, 'combined_non_max_suppression'):
    iou_threshold = ops.convert_to_tensor(
        iou_threshold, dtype=dtypes.float32, name='iou_threshold')
    score_threshold = ops.convert_to_tensor(
        score_threshold, dtype=dtypes.float32, name='score_threshold')

    # Convert `max_total_size` to tensor *without* setting the `dtype` param.
    # This allows us to catch `int32` overflow case with `max_total_size`
    # whose expected dtype is `int32` by the op registration. Any number within
    # `int32` will get converted to `int32` tensor. Anything larger will get
    # converted to `int64`. Passing in `int64` for `max_total_size` to the op
    # will throw dtype mismatch exception.
    # TODO(b/173251596): Once there is a more general solution to warn against
    # int overflow conversions, revisit this check.
    max_total_size = ops.convert_to_tensor(max_total_size)

    return gen_image_ops.combined_non_max_suppression(
        boxes, scores, max_output_size_per_class, max_total_size, iou_threshold,
        score_threshold, pad_per_class, clip_boxes)

Anyone know why regular non max suppression is being done in python? This issue seems to imply that the solution will have to come from updating tensorflow, not keras-cv, let me know if I am wrong on this.

getStRiCtd · 2024-02-12T00:08:56Z

And that is still the issue. :)

axelusarov changed the title ~~RetinaNet prediction() is too slow~~ RetinaNet predict() is too slow Aug 6, 2023

ianstenbit mentioned this issue Aug 16, 2023

Use non-ragged outputs in MultiClassNMS #2030

Merged

ianstenbit added stat:contributions welcome type:Bug Something isn't working labels Sep 8, 2023

EdIzaguirre mentioned this issue Oct 26, 2023

non_max_suppression_padded is very slow and doesn't appear to be using a cuda or GPU implementation tensorflow/tensorflow#62244

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RetinaNet predict() is too slow #2014

RetinaNet predict() is too slow #2014

axelusarov commented Aug 6, 2023

sampathweb commented Aug 8, 2023

sampathweb commented Aug 8, 2023

axelusarov commented Aug 9, 2023

ianstenbit commented Aug 16, 2023

axelusarov commented Aug 18, 2023

ianstenbit commented Sep 8, 2023

EdIzaguirre commented Oct 19, 2023 •

edited

Loading

EdIzaguirre commented Oct 20, 2023 •

edited

Loading

getStRiCtd commented Feb 12, 2024

RetinaNet predict() is too slow #2014

RetinaNet predict() is too slow #2014

Comments

axelusarov commented Aug 6, 2023

sampathweb commented Aug 8, 2023

sampathweb commented Aug 8, 2023

axelusarov commented Aug 9, 2023

ianstenbit commented Aug 16, 2023

axelusarov commented Aug 18, 2023

ianstenbit commented Sep 8, 2023

EdIzaguirre commented Oct 19, 2023 • edited Loading

EdIzaguirre commented Oct 20, 2023 • edited Loading

getStRiCtd commented Feb 12, 2024

EdIzaguirre commented Oct 19, 2023 •

edited

Loading

EdIzaguirre commented Oct 20, 2023 •

edited

Loading