Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn & replace dataframes with non-unique indexes #691

Merged

Conversation

dagardner-nv
Copy link
Contributor

@dagardner-nv dagardner-nv commented Feb 11, 2023

  • Add has_unique_index & replace_non_unique_index helper methods to MessageMeta
  • Create a bunch of unittests
  • Only use the index for slicing if it is unique, otherwise use a boolean mask
  • The file_type argument of read_file_to_df now has a default value of Auto
  • DeserializeStage checks for non-unique indexes and replaces them if needed.

This comes at a performance cost in that the DeserializeStage needs to acquire the GIL in order to check if the Dataframe has a unique index, impacting users who never run into this issue. We could work around this by providing a no-check argument to the stage, or we could do the check in the constructor of MessageMeta when we already have the GIL and can perform the check quite cheapely.

Fixes #689
Fixes #686
Fixes #687
Fixes #286
Fixes #626
Fixes #393

@dagardner-nv dagardner-nv added bug Something isn't working help wanted Extra attention is needed non-breaking Non-breaking change 2 - In Progress labels Feb 11, 2023
@dagardner-nv dagardner-nv requested a review from a team as a code owner February 11, 2023 00:55
Copy link
Contributor

@mdemoret-nv mdemoret-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the destructor of MutableTableInfo to LOG(ERROR) instead of LOG(FATAL).

@mdemoret-nv mdemoret-nv force-pushed the david-warn-non-unique-686 branch from 96a2ee0 to d4b8761 Compare March 15, 2023 00:03
@mdemoret-nv mdemoret-nv requested a review from a team as a code owner March 15, 2023 00:03
@mdemoret-nv
Copy link
Contributor

After many additions to this PR, here is the final list of changes.

Breaking Changes

  • Changes to MultiMessage
    • Consistency checks are now performed on creation. This will raise errors for invalid offset and count configurations. Such as:
      • meta cannot be None
      • mess_offset must be in the range [0, meta.count)
      • mess_count must be in the range (0, meta.count - mess_offset]
    • Derived classes must define their own __init__ to enforce keyword-only arguments
    • get_meta and set_meta use .iloc instead of .loc
      • .iloc is faster but less flexible. This may cause some issues with uncommon data types
    • Bounds are enforced on get_slice
      • Before, there was no checking on the start/stop parameters. So you could get a slice that was larger than your current one
      • Now, slicing ensures that start is [0, mess_count) and stop is (start, mess_count]. Errors are thrown otherwise
    • All __init__ parameters must be passed by keyword
      • This allows properly creating the right derived types in get_slice and copy_ranges. Before each derived type needed to implement these individually, leading to multiple conflicting implementations
  • Changes to MultiTensorMessage
    • Consistency checks are now performed on creation. This will raise errors for invalid offset and count configurations. Such as:
      • memory cannot be None
      • offset must be in the range [0, memory.count)
      • count must be in the range (0, memory.count - offset]
      • count must be >= mess_count
    • If a seq_ids tensor is supplied on creation, it must have absolute message IDs instead of relative. This is to standardize whether seq_ids is relative to mess_offset or not. This also guarantees the above checks will not fail due to incorrect seq_ids. This check amounts to:
      • The first element, seq_ids[offset] must be equal to mess_offset
      • The last element, seq_ids[offset + count - 1] must be equal to mess_offset + mess_count
    • Bounds are enforced on get_slice
      • Similar to the bounds check on MultiMessage, this checks count and offset

Additional Changes

  • C++ and Python message classes have significantly improved tests to ensure consistent behavior
    • This was responsible for fixing several issues listed above
    • Tests are also parameterized on cudf/pandas and C++/Python to increase coverage
    • Tests now check non-unique and non-monotonic index types
  • Corrent inheritance has been implemented for all MultiInferenceXXX and MultiResponseXXX messages
  • New Tensor.from_cupy() and Tensor.to_cupy() to test converting between Python/C++ data structures
  • New MultiMessage.from_message() utility class to streamline creating messages from other messages
    • This allows for things like MultiInferenceMessage.from_message(incoming_message, count=10) where all properties from incoming_message will be used to create a new MultiInferenceMessage. Additional supplied keyword arguments, count in the example, will override the values in the from message, incoming_message
    • This should be used as much as possible to set consistent offset and count values
  • All MultiMessage implementations now take default values for mess_offset, mess_count, offset, count to make creation easier

@mdemoret-nv
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit 7aa6a7f into nv-morpheus:branch-23.03 Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment