Skip to content

[BUG]: In DFP, inconsistent training time on a large training job #932

Open

Description

Version

23.07

Which installation method(s) does this occur on?

Docker

Describe the bug.

Training a large data set, with ~400 users, (called "source_ip", and having a standard ip address as username), 20 millions rows.

Noticed that while, as per logs output, typical model takes 100ms - 1s to train, sometimes there is a delay of 20minutes between log entries (see the "Relevant log output section below", where some typical username models are followed at 22:04:44 by a 16 minute gap). This is not associated with a particularly big sub-set of the data being modeled (according to the logs).

I'm using an AWS g4dn.4xlarge instance, and it look like:

  • memory is not depleted, swap not used.
  • gpu memory is not depleted
  • gpu is working particularly hard (~20-30%)

There are two issues here, I think: First, the delay itself, and second, that the logs "pause" for 15 minutes. If the delay is justified (say it just takes this much time to train this user) - then perhaps a more detailed log outputs, such that they capture what happens in those 15 minutes, should be considered?

Minimum reproducible example

# This is based off the examples/digital_fingerprinting/production/dfp_duo_training.ipynb 
# Running with a big data set. In particular the pipeline is build and run using the following 
# (unchanged example code)

# Create a linear pipeline object
pipeline = LinearPipeline(config)

# Source stage
pipeline.set_source(MultiFileSource(config, filenames=input_files))

# Batch files into buckets by time. Use the default ISO date extractor from the filename
pipeline.add_stage(
    DFPFileBatcherStage(config,
                        period="D",
                        date_conversion_func=functools.partial(date_extractor, filename_regex=iso_date_regex)))

# Output is S3 Buckets. Convert to DataFrames. This caches downloaded S3 data
pipeline.add_stage(
    DFPFileToDataFrameStage(config,
                            schema=source_schema,
                            file_type=FileTypes.JSON, # originally FileTypes.JSON
                            parser_kwargs={
                                "lines": False, "orient": "records"
                            },
                            cache_dir=cache_dir))


# This will split users or just use one single user
pipeline.add_stage(
    DFPSplitUsersStage(config,
                        include_generic=include_generic,
                        include_individual=include_individual,
                        skip_users=skip_users))

# Next, have a stage that will create rolling windows
pipeline.add_stage(
    DFPRollingWindowStage(
        config,
        min_history=300 if is_training else 1,
        min_increment=300 if is_training else 0,
        # For inference, we only ever want 1 day max
        max_history="60d" if is_training else "1d",
        cache_dir=cache_dir))

# Output is UserMessageMeta -- Cached frame set
pipeline.add_stage(DFPPreprocessingStage(config, input_schema=preprocess_schema))

# Finally, perform training which will output a model
pipeline.add_stage(DFPTraining(config, validation_size=0.10))

# Write that model to MLFlow
pipeline.add_stage(
    DFPMLFlowModelWriterStage(config,
                                model_name_formatter=model_name_formatter,
                                experiment_name_formatter=experiment_name_formatter))

# Run the pipeline
await pipeline.run_async()

Relevant log output

2023/04/28 22:04:43 INFO mlflow.tracking.fluent: Experiment with name 'dfp/tcp-open/training/source-ip-130.129.48.88' does not exist. Creating a new experiment.
Preprocessed 688 data for logs in 2022-10-26 04:08:08+00:00 to 2022-11-01 04:59:48+00:00 in 55.34172058105469 ms
Rolling window complete for 91.229.45.70 in 66.14 ms. Input: 2740 rows from 2022-10-26 10:09:55+00:00 to 2022-10-31 18:22:29+00:00. Output: 2740 rows from 2022-10-26 10:09:55+00:00 to 2022-10-31 18:22:29+00:00
Successfully registered model 'source-ip-130.129.48.88'.
2023/04/28 22:04:44 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: source-ip-130.129.48.88, version 1
ML Flow model upload complete: 130.129.48.88:source-ip-130.129.48.88:1
Training AE model for user: '130.129.48.89'...
Training AE model for user: '130.129.48.89'... Complete.
Training AE model for user: '130.129.48.92'...
Preprocessed 641 data for logs in 2022-10-26 00:47:10+00:00 to 2022-11-01 06:46:04+00:00 in 110.55374145507812 ms
2023/04/28 22:20:53 INFO mlflow.tracking.fluent: Experiment with name 'dfp/tcp-open/training/source-ip-130.129.48.89' does not exist. Creating a new experiment.
Rolling window complete for 91.229.45.75 in 102.77 ms. Input: 3010 rows from 2022-11-01 01:10:38+00:00 to 2022-11-01 01:20:54+00:00. Output: 3010 rows from 2022-11-01 01:10:38+00:00 to 2022-11-01 01:20:54+00:00
Successfully registered model 'source-ip-130.129.48.89'.
2023/04/28 22:20:55 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: source-ip-130.129.48.89, version 1
ML Flow model upload complete: 130.129.48.89:source-ip-130.129.48.89:1
Training AE model for user: '130.129.48.92'... Complete.
Training AE model for user: '130.129.48.93'...

Full env printout

No response

Other/Misc.

Maybe related to #816? But that one was about consistently large training times, I think. Here, some models train fast, but some train slow.

Code of Conduct

  • I agree to follow Morpheus' Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdfp[Workflow] Related to the Digital Fingerprinting (DFP) workflowexternalThis issue was filed by someone outside of the Morpheus team

    Type

    No type

    Projects

    • Status

      Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions