Warn & replace dataframes with non-unique indexes #691

dagardner-nv · 2023-02-11T00:55:10Z

Add has_unique_index & replace_non_unique_index helper methods to MessageMeta
Create a bunch of unittests
Only use the index for slicing if it is unique, otherwise use a boolean mask
The file_type argument of read_file_to_df now has a default value of Auto
DeserializeStage checks for non-unique indexes and replaces them if needed.

This comes at a performance cost in that the DeserializeStage needs to acquire the GIL in order to check if the Dataframe has a unique index, impacting users who never run into this issue. We could work around this by providing a no-check argument to the stage, or we could do the check in the constructor of MessageMeta when we already have the GIL and can perform the check quite cheapely.

Fixes #689
Fixes #686
Fixes #687
Fixes #286
Fixes #626
Fixes #393

…sue 686

… can be updated

…exception

mdemoret-nv

Change the destructor of MutableTableInfo to LOG(ERROR) instead of LOG(FATAL).

morpheus/_lib/include/morpheus/messages/meta.hpp

morpheus/_lib/include/morpheus/utilities/cudf_util.hpp

morpheus/_lib/src/messages/multi.cpp

morpheus/_lib/src/messages/multi_tensor.cpp

morpheus/_lib/src/objects/table_info.cpp

morpheus/_lib/src/messages/multi_tensor.cpp

morpheus/_lib/include/morpheus/objects/tensor_object.hpp

mdemoret-nv · 2023-03-17T16:58:08Z

After many additions to this PR, here is the final list of changes.

Breaking Changes

Changes to MultiMessage
- Consistency checks are now performed on creation. This will raise errors for invalid offset and count configurations. Such as:
  - meta cannot be None
  - mess_offset must be in the range [0, meta.count)
  - mess_count must be in the range (0, meta.count - mess_offset]
- Derived classes must define their own __init__ to enforce keyword-only arguments
  - More info here
- get_meta and set_meta use .iloc instead of .loc
  - .iloc is faster but less flexible. This may cause some issues with uncommon data types
- Bounds are enforced on get_slice
  - Before, there was no checking on the start/stop parameters. So you could get a slice that was larger than your current one
  - Now, slicing ensures that start is [0, mess_count) and stop is (start, mess_count]. Errors are thrown otherwise
- All __init__ parameters must be passed by keyword
  - This allows properly creating the right derived types in get_slice and copy_ranges. Before each derived type needed to implement these individually, leading to multiple conflicting implementations
Changes to MultiTensorMessage
- Consistency checks are now performed on creation. This will raise errors for invalid offset and count configurations. Such as:
  - memory cannot be None
  - offset must be in the range [0, memory.count)
  - count must be in the range (0, memory.count - offset]
  - count must be >= mess_count
- If a seq_ids tensor is supplied on creation, it must have absolute message IDs instead of relative. This is to standardize whether seq_ids is relative to mess_offset or not. This also guarantees the above checks will not fail due to incorrect seq_ids. This check amounts to:
  - The first element, seq_ids[offset] must be equal to mess_offset
  - The last element, seq_ids[offset + count - 1] must be equal to mess_offset + mess_count
- Bounds are enforced on get_slice
  - Similar to the bounds check on MultiMessage, this checks count and offset

Additional Changes

C++ and Python message classes have significantly improved tests to ensure consistent behavior
- This was responsible for fixing several issues listed above
- Tests are also parameterized on cudf/pandas and C++/Python to increase coverage
- Tests now check non-unique and non-monotonic index types
Corrent inheritance has been implemented for all MultiInferenceXXX and MultiResponseXXX messages
New Tensor.from_cupy() and Tensor.to_cupy() to test converting between Python/C++ data structures
New MultiMessage.from_message() utility class to streamline creating messages from other messages
- This allows for things like MultiInferenceMessage.from_message(incoming_message, count=10) where all properties from incoming_message will be used to create a new MultiInferenceMessage. Additional supplied keyword arguments, count in the example, will override the values in the from message, incoming_message
- This should be used as much as possible to set consistent offset and count values
All MultiMessage implementations now take default values for mess_offset, mess_count, offset, count to make creation easier

mdemoret-nv · 2023-03-17T18:05:57Z

/merge

dagardner-nv added 16 commits February 9, 2023 14:24

Add unittest for issue nv-morpheus#686

5b90244

wip

1f168a3

wip

f38a07f

Add 'has_unique_index' helper method to MessageMeta

eca479a

Add integration test for desrialization stage, along with test for is…

d824440

…sue 686

Test for has_unique_index method

5bf99bf

Remove parametrize variables not needed for this test

3add91b

First pass at replacing a non-unique index

e3be4cf

Add cpp impl for has_unique_index

e43ac89

wip

e60742c

Move index reset to MutableTableInfo so that the column & index names…

53ee170

… can be updated

use logger.warning instead of logger.warn

3fd0ea3

Update multi-segment test

c651744

Select only the columns in the view when writing json

1d41fd6

Log and ignore include_index_col=false, otherwise cudf will throw an …

f9396be

…exception

wip

fb141c9

dagardner-nv added bug Something isn't working help wanted Extra attention is needed non-breaking Non-breaking change 2 - In Progress labels Feb 11, 2023

dagardner-nv requested a review from a team as a code owner February 11, 2023 00:55

dagardner-nv added 2 commits February 10, 2023 17:01

Document work-around

c77360f

Fix casing for cuDF

ef3eb30

mdemoret-nv requested changes Feb 13, 2023

View reviewed changes

dagardner-nv and others added 6 commits February 13, 2023 11:20

Merge branch 'branch-23.03' into david-warn-non-unique-686

0554785

Change fatal log to an error log

ae7d4af

Only set include_index_col=False when writing CSV

ccc6e6c

Merge branch 'branch-23.03' into david-warn-non-unique-686

d9669e2

Merge branch 'branch-23.03' into david-warn-non-unique-686

bf4d4e6

wip

7acff20

dagardner-nv and others added 6 commits February 27, 2023 12:02

Merge branch 'branch-23.03' into david-warn-non-unique-686

e4976db

Merge branch 'branch-23.03' into david-warn-non-unique-686

0b2d13c

Merge branch 'branch-23.03' into david-warn-non-unique-686

4b99d1e

Adding additional tests to MultiMessage and fixing the bugs it discovers

482fd45

All multi message tests passing

bb54dad

Most tests now passing

d4b8761

mdemoret-nv force-pushed the david-warn-non-unique-686 branch from 96a2ee0 to d4b8761 Compare March 15, 2023 00:03

mdemoret-nv requested a review from a team as a code owner March 15, 2023 00:03

mdemoret-nv added 10 commits March 14, 2023 18:45

Merge branch 'branch-23.03' into david-warn-non-unique-686

c002a93

Removing files that should not have been committed

f4fb726

Removing stub generation

51e4e71

Fixing up post merge failures

76921d3

Large cleanup and added multi tensor tests

65e7edb

Merge branch 'branch-23.03' into david-warn-non-unique-686

b55f50d

Style cleanup

4e92c8b

Merge branch 'branch-23.03' into david-warn-non-unique-686

68ff815

Cleaning up the code

77e2db0

Large cleanup

1ac0c6a

dagardner-nv commented Mar 16, 2023

View reviewed changes

mdemoret-nv added 5 commits March 16, 2023 20:48

Non-slow tests passing

39beb1f

Large cleanup. All tests passing locally

42a70b9

Merge branch 'branch-23.03' into david-warn-non-unique-686

1cfa57d

Removing stubs from the build in CI

5bf02e9

IWYU fixes

345fa78

mdemoret-nv added 2 commits March 17, 2023 10:59

Final changes to get CI to pass

365f583

Style fixes

1d9fe36

mdemoret-nv approved these changes Mar 17, 2023

View reviewed changes

rapids-bot bot merged commit 7aa6a7f into nv-morpheus:branch-23.03 Mar 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warn & replace dataframes with non-unique indexes #691

Warn & replace dataframes with non-unique indexes #691

dagardner-nv commented Feb 11, 2023 •

edited by mdemoret-nv

Loading

mdemoret-nv left a comment

mdemoret-nv commented Mar 17, 2023

mdemoret-nv commented Mar 17, 2023

Warn & replace dataframes with non-unique indexes #691

Warn & replace dataframes with non-unique indexes #691

Conversation

dagardner-nv commented Feb 11, 2023 • edited by mdemoret-nv Loading

mdemoret-nv left a comment

Choose a reason for hiding this comment

mdemoret-nv commented Mar 17, 2023

Breaking Changes

Additional Changes

mdemoret-nv commented Mar 17, 2023

dagardner-nv commented Feb 11, 2023 •

edited by mdemoret-nv

Loading