Description
Version
23.07
Which installation method(s) does this occur on?
Docker
Describe the bug.
Training a large data set, with ~400 users, (called "source_ip", and having a standard ip address as username), 20 millions rows.
Noticed that while, as per logs output, typical model takes 100ms - 1s to train, sometimes there is a delay of 20minutes between log entries (see the "Relevant log output section below", where some typical username models are followed at 22:04:44 by a 16 minute gap). This is not associated with a particularly big sub-set of the data being modeled (according to the logs).
I'm using an AWS g4dn.4xlarge instance, and it look like:
- memory is not depleted, swap not used.
- gpu memory is not depleted
- gpu is working particularly hard (~20-30%)
There are two issues here, I think: First, the delay itself, and second, that the logs "pause" for 15 minutes. If the delay is justified (say it just takes this much time to train this user) - then perhaps a more detailed log outputs, such that they capture what happens in those 15 minutes, should be considered?
Minimum reproducible example
# This is based off the examples/digital_fingerprinting/production/dfp_duo_training.ipynb
# Running with a big data set. In particular the pipeline is build and run using the following
# (unchanged example code)
# Create a linear pipeline object
pipeline = LinearPipeline(config)
# Source stage
pipeline.set_source(MultiFileSource(config, filenames=input_files))
# Batch files into buckets by time. Use the default ISO date extractor from the filename
pipeline.add_stage(
DFPFileBatcherStage(config,
period="D",
date_conversion_func=functools.partial(date_extractor, filename_regex=iso_date_regex)))
# Output is S3 Buckets. Convert to DataFrames. This caches downloaded S3 data
pipeline.add_stage(
DFPFileToDataFrameStage(config,
schema=source_schema,
file_type=FileTypes.JSON, # originally FileTypes.JSON
parser_kwargs={
"lines": False, "orient": "records"
},
cache_dir=cache_dir))
# This will split users or just use one single user
pipeline.add_stage(
DFPSplitUsersStage(config,
include_generic=include_generic,
include_individual=include_individual,
skip_users=skip_users))
# Next, have a stage that will create rolling windows
pipeline.add_stage(
DFPRollingWindowStage(
config,
min_history=300 if is_training else 1,
min_increment=300 if is_training else 0,
# For inference, we only ever want 1 day max
max_history="60d" if is_training else "1d",
cache_dir=cache_dir))
# Output is UserMessageMeta -- Cached frame set
pipeline.add_stage(DFPPreprocessingStage(config, input_schema=preprocess_schema))
# Finally, perform training which will output a model
pipeline.add_stage(DFPTraining(config, validation_size=0.10))
# Write that model to MLFlow
pipeline.add_stage(
DFPMLFlowModelWriterStage(config,
model_name_formatter=model_name_formatter,
experiment_name_formatter=experiment_name_formatter))
# Run the pipeline
await pipeline.run_async()
Relevant log output
2023/04/28 22:04:43 INFO mlflow.tracking.fluent: Experiment with name 'dfp/tcp-open/training/source-ip-130.129.48.88' does not exist. Creating a new experiment.
Preprocessed 688 data for logs in 2022-10-26 04:08:08+00:00 to 2022-11-01 04:59:48+00:00 in 55.34172058105469 ms
Rolling window complete for 91.229.45.70 in 66.14 ms. Input: 2740 rows from 2022-10-26 10:09:55+00:00 to 2022-10-31 18:22:29+00:00. Output: 2740 rows from 2022-10-26 10:09:55+00:00 to 2022-10-31 18:22:29+00:00
Successfully registered model 'source-ip-130.129.48.88'.
2023/04/28 22:04:44 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: source-ip-130.129.48.88, version 1
ML Flow model upload complete: 130.129.48.88:source-ip-130.129.48.88:1
Training AE model for user: '130.129.48.89'...
Training AE model for user: '130.129.48.89'... Complete.
Training AE model for user: '130.129.48.92'...
Preprocessed 641 data for logs in 2022-10-26 00:47:10+00:00 to 2022-11-01 06:46:04+00:00 in 110.55374145507812 ms
2023/04/28 22:20:53 INFO mlflow.tracking.fluent: Experiment with name 'dfp/tcp-open/training/source-ip-130.129.48.89' does not exist. Creating a new experiment.
Rolling window complete for 91.229.45.75 in 102.77 ms. Input: 3010 rows from 2022-11-01 01:10:38+00:00 to 2022-11-01 01:20:54+00:00. Output: 3010 rows from 2022-11-01 01:10:38+00:00 to 2022-11-01 01:20:54+00:00
Successfully registered model 'source-ip-130.129.48.89'.
2023/04/28 22:20:55 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: source-ip-130.129.48.89, version 1
ML Flow model upload complete: 130.129.48.89:source-ip-130.129.48.89:1
Training AE model for user: '130.129.48.92'... Complete.
Training AE model for user: '130.129.48.93'...
Full env printout
No response
Other/Misc.
Maybe related to #816? But that one was about consistently large training times, I think. Here, some models train fast, but some train slow.
Code of Conduct
- I agree to follow Morpheus' Code of Conduct
- I have searched the open bugs and have found no duplicates for this bug report
Metadata
Assignees
Labels
Type
Projects
Status
Todo