[BUG]: In DFP, inconsistent training time on a large training job

### Version

23.07

### Which installation method(s) does this occur on?

Docker

### Describe the bug.

Training a large data set, with ~400 users, (called "source_ip", and having a standard ip address as username), 20 millions rows.

Noticed that while, as per logs output, typical model takes 100ms - 1s to train, sometimes there is a delay of 20minutes between log entries (see the "Relevant log output section below", where some typical username models are followed at 22:04:44 by a 16 minute gap). This is not associated with a particularly big sub-set of the data being modeled (according to the logs). 

I'm using an AWS g4dn.4xlarge instance, and it look like:

* memory is not depleted, swap not used. 
* gpu memory is not depleted
* gpu is working particularly hard (~20-30%)

There are two issues here, I think: First, the delay itself, and second, that the logs "pause"  for 15 minutes. If the delay is justified (say it just takes this much time to train this user) - then perhaps a more detailed log outputs, such that they capture what happens in those 15 minutes, should be considered?


### Minimum reproducible example

```shell
# This is based off the examples/digital_fingerprinting/production/dfp_duo_training.ipynb 
# Running with a big data set. In particular the pipeline is build and run using the following 
# (unchanged example code)

# Create a linear pipeline object
pipeline = LinearPipeline(config)

# Source stage
pipeline.set_source(MultiFileSource(config, filenames=input_files))

# Batch files into buckets by time. Use the default ISO date extractor from the filename
pipeline.add_stage(
    DFPFileBatcherStage(config,
                        period="D",
                        date_conversion_func=functools.partial(date_extractor, filename_regex=iso_date_regex)))

# Output is S3 Buckets. Convert to DataFrames. This caches downloaded S3 data
pipeline.add_stage(
    DFPFileToDataFrameStage(config,
                            schema=source_schema,
                            file_type=FileTypes.JSON, # originally FileTypes.JSON
                            parser_kwargs={
                                "lines": False, "orient": "records"
                            },
                            cache_dir=cache_dir))


# This will split users or just use one single user
pipeline.add_stage(
    DFPSplitUsersStage(config,
                        include_generic=include_generic,
                        include_individual=include_individual,
                        skip_users=skip_users))

# Next, have a stage that will create rolling windows
pipeline.add_stage(
    DFPRollingWindowStage(
        config,
        min_history=300 if is_training else 1,
        min_increment=300 if is_training else 0,
        # For inference, we only ever want 1 day max
        max_history="60d" if is_training else "1d",
        cache_dir=cache_dir))

# Output is UserMessageMeta -- Cached frame set
pipeline.add_stage(DFPPreprocessingStage(config, input_schema=preprocess_schema))

# Finally, perform training which will output a model
pipeline.add_stage(DFPTraining(config, validation_size=0.10))

# Write that model to MLFlow
pipeline.add_stage(
    DFPMLFlowModelWriterStage(config,
                                model_name_formatter=model_name_formatter,
                                experiment_name_formatter=experiment_name_formatter))

# Run the pipeline
await pipeline.run_async()
```


### Relevant log output

```shell
2023/04/28 22:04:43 INFO mlflow.tracking.fluent: Experiment with name 'dfp/tcp-open/training/source-ip-130.129.48.88' does not exist. Creating a new experiment.
Preprocessed 688 data for logs in 2022-10-26 04:08:08+00:00 to 2022-11-01 04:59:48+00:00 in 55.34172058105469 ms
Rolling window complete for 91.229.45.70 in 66.14 ms. Input: 2740 rows from 2022-10-26 10:09:55+00:00 to 2022-10-31 18:22:29+00:00. Output: 2740 rows from 2022-10-26 10:09:55+00:00 to 2022-10-31 18:22:29+00:00
Successfully registered model 'source-ip-130.129.48.88'.
2023/04/28 22:04:44 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: source-ip-130.129.48.88, version 1
ML Flow model upload complete: 130.129.48.88:source-ip-130.129.48.88:1
Training AE model for user: '130.129.48.89'...
Training AE model for user: '130.129.48.89'... Complete.
Training AE model for user: '130.129.48.92'...
Preprocessed 641 data for logs in 2022-10-26 00:47:10+00:00 to 2022-11-01 06:46:04+00:00 in 110.55374145507812 ms
2023/04/28 22:20:53 INFO mlflow.tracking.fluent: Experiment with name 'dfp/tcp-open/training/source-ip-130.129.48.89' does not exist. Creating a new experiment.
Rolling window complete for 91.229.45.75 in 102.77 ms. Input: 3010 rows from 2022-11-01 01:10:38+00:00 to 2022-11-01 01:20:54+00:00. Output: 3010 rows from 2022-11-01 01:10:38+00:00 to 2022-11-01 01:20:54+00:00
Successfully registered model 'source-ip-130.129.48.89'.
2023/04/28 22:20:55 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: source-ip-130.129.48.89, version 1
ML Flow model upload complete: 130.129.48.89:source-ip-130.129.48.89:1
Training AE model for user: '130.129.48.92'... Complete.
Training AE model for user: '130.129.48.93'...
```


### Full env printout

_No response_

### Other/Misc.

Maybe related to #816? But that one was about consistently large training times, I think. Here, some models train fast, but some train slow.

### Code of Conduct

- [X] I agree to follow Morpheus' Code of Conduct
- [X] I have searched the [open bugs](https://github.com/nv-morpheus/Morpheus/issues?q=is%3Aopen+is%3Aissue+label%3Abug) and have found no duplicates for this bug report

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: In DFP, inconsistent training time on a large training job #932

tgrunzweig-cpacket
openedon May 1, 2023

Version

Which installation method(s) does this occur on?

Describe the bug.

Minimum reproducible example

Relevant log output

Full env printout

Other/Misc.

Code of Conduct

Assignees

Labels

Type

Projects

Milestone

Relationships

Development