Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split data dir, moving large files into examples/data #130

Merged
43 commits merged into from
Jun 7, 2022
Merged
Changes from 1 commit
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
61680dc
Add/update unittests to check for issue #60
dagardner-nv Apr 27, 2022
f840074
Ensure default path values are no longer relative to the current dir,…
dagardner-nv Apr 27, 2022
edc75bd
Move simple file reads to a helper function
dagardner-nv May 3, 2022
a7263fc
Merge branch 'branch-22.06' into david-cli-rel-paths
dagardner-nv May 16, 2022
06fb137
WIP
dagardner-nv May 16, 2022
c2c467b
Move data
dagardner-nv May 16, 2022
ce01a4a
Add missing dep for pybind11-stubgen
dagardner-nv May 17, 2022
0b6d959
Don't add deps for pybind11 stub files when we aren't doing an inplac…
dagardner-nv May 17, 2022
827ee41
Add MANIFEST.in to list of installed files
dagardner-nv May 17, 2022
4ef5624
Copy data dir, and files previously set by package_data
dagardner-nv May 17, 2022
c2c5975
Remove package_data, unfortunately the setuptools docs are vague and …
dagardner-nv May 17, 2022
4186357
Remove unused MORPHEUS_ROOT attr
dagardner-nv May 17, 2022
65473c6
Update path in examples for new data location
dagardner-nv May 17, 2022
be44798
Merge branch 'branch-22.06' into david-cli-rel-paths
dagardner-nv May 17, 2022
7ae1e30
Fix import path
dagardner-nv May 17, 2022
329a6a6
Update paths in examples
dagardner-nv May 17, 2022
405b539
Update data path in docs
dagardner-nv May 17, 2022
1c7f421
fix path
dagardner-nv May 17, 2022
c0d5281
Update lfs to reflect data dir move
dagardner-nv May 17, 2022
ce37b33
Remove unneded fea_length
dagardner-nv May 17, 2022
61ebfcf
Style fixes
dagardner-nv May 18, 2022
5a84ff2
Update docs/source/basics/examples.rst
dagardner-nv May 18, 2022
dfdeacc
Merge branch 'branch-22.06' into david-cli-rel-paths
dagardner-nv May 23, 2022
f59dcac
Fixing non-inplace builds install of stub files
mdemoret-nv May 23, 2022
7801803
Move data into previous install command
dagardner-nv May 23, 2022
f398f78
Merge branch 'david-cli-rel-paths' of github.com:dagardner-nv/Morpheu…
dagardner-nv May 23, 2022
798953a
Remove lfs filter for old data location
dagardner-nv May 23, 2022
a94dd62
Merge branch 'branch-22.06' into david-cli-rel-paths
dagardner-nv May 24, 2022
7cfafcf
examples/data/with_data_len.json,examples/data/without_data_len.json:…
dagardner-nv May 27, 2022
950b0d4
Move larger files from morpheus/data into examples/data
dagardner-nv May 27, 2022
4627709
Add new glob path to lfs
dagardner-nv May 27, 2022
0219ae0
Update path in launcher
dagardner-nv May 27, 2022
e32a3c6
Update paths for example data in examples & docs
dagardner-nv May 27, 2022
dac94a9
Add email_with_addresses.jsonlines used in the phishing developer gui…
dagardner-nv May 27, 2022
6448e7f
Merge branch 'branch-22.06' into david-split-data-dir
dagardner-nv May 31, 2022
fc3f06f
Merge branch 'branch-22.06' into david-split-data-dir
dagardner-nv Jun 2, 2022
203c6d6
Remove unused data files
dagardner-nv Jun 3, 2022
e036f7b
Merge branch 'branch-22.06' into david-split-data-dir
dagardner-nv Jun 3, 2022
435c74a
Pin to older neo
dagardner-nv Jun 3, 2022
f13da0b
Merge branch 'david-split-data-dir' of github.com:dagardner-nv/Morphe…
dagardner-nv Jun 3, 2022
5da2d94
Revert "Pin to older neo"
dagardner-nv Jun 6, 2022
d2d35e8
Manually ensure that the build is clean
dagardner-nv Jun 6, 2022
c989b15
Re-source the conda env
dagardner-nv Jun 6, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Merge branch 'branch-22.06' into david-cli-rel-paths
  • Loading branch information
dagardner-nv committed May 24, 2022
commit a94dd6280ad549d9d20076d2bb8c2c4f6ef791ce
25 changes: 15 additions & 10 deletions docs/source/developer_guide/guides/2_real_world_phishing.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,12 +185,13 @@ From this information, we can see that the expected shape of the model inputs is
Let's set up the paths for our input and output files. For simplicity, we assume that the `MORPHEUS_ROOT` environment variable is set to the root of the Morpheus project repository. In a production deployment, it may be more prudent to replace our usage of environment variables with command-line flags or a dedicated configuration management library.

```python
import morpheus

root_dir = os.environ['MORPHEUS_ROOT']
out_dir = os.environ.get('OUT_DIR', '/tmp')

data_dir = os.path.join(root_dir, 'data')
labels_file = os.path.join(data_dir, 'labels_phishing.txt')
vocab_file = os.path.join(data_dir, 'bert-base-uncased-hash.txt')
labels_file = os.path.join(morpheus.DATA_DIR, 'labels_phishing.txt')
vocab_file = os.path.join(morpheus.DATA_DIR, 'bert-base-uncased-hash.txt')

input_file = os.path.join(root_dir, 'examples/data/email.jsonlines')
results_file = os.path.join(out_dir, 'detections.jsonlines')
Expand All @@ -215,7 +216,11 @@ The `feature_length` property needs to match the length of the model inputs, whi

Ground truth classification labels are read from the `morpheus/data/labels_phishing.txt` file included in Morpheus.

Now that our config object is populated we move on to the pipeline itself. We are using the same input file from the previous examples, and to tokenize the input data we add Morpheus' `PreprocessNLPStage` with the `morpheus/data/bert-base-uncased-hash.txt` vocabulary file.
Now that our config object is populated, we move on to the pipeline itself. We will be using the same input file from the previous examples, and to tokenize the input data we will use Morpheus' `PreprocessNLPStage`.

This stage uses the [cudf subword tokenizer](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.core.subword_tokenizer.SubwordTokenizer.__call__.html) to transform strings into a tensor of numbers to be fed into the neural network model. Rather than split the string by characters or whitespaces, we split them into meaningful subwords based upon the occurrence of the subwords in a large training corpus. You can find more details here: [https://arxiv.org/abs/1810.04805v2](https://arxiv.org/abs/1810.04805v2). All we need to know for now is that the text will be converted to subword token ids based on the vocabulary file that we provide (`vocab_hash_file=vocab file`).

Let's go ahead and instantiate our `PreprocessNLPStage` and add it to the pipeline:

```python
pipeline.add_stage(
Expand All @@ -228,7 +233,7 @@ pipeline.add_stage(
```

In addition to providing the `Config` object that we defined above, we also configure this stage to:
* Use the `data/bert-base-uncased-hash.txt` vocabulary file for its subword token ids (`vocab_hash_file=vocab_file`).
* Use the `morpheus/data/bert-base-uncased-hash.txt` vocabulary file for its subword token ids (`vocab_hash_file=vocab_file`).
* Truncate the length of the text to a max number of tokens (`truncation=True`).
* Change the casing to all lowercase (`do_lower_case=True`).
* Refrain from adding the default BERT special tokens like `[SEP]` for separation between two sentences and `[CLS]` at the start of the text (`add_special_tokens=False`).
Expand All @@ -254,7 +259,7 @@ pipeline.add_stage(MonitorStage(config, description="Inference Rate", smoothing=
pipeline.add_stage(FilterDetectionsStage(config, threshold=0.9))
```

Lastly, we will save our results to disk. For this purpose, we are using two stages that are often used in conjunction with each other: `SerializeStage` and `WriteToFileStage`.
Lastly, we will save our results to disk. For this purpose, we are using two stages that are often used in conjunction with each other: `SerializeStage` and `WriteToFileStage`.

The `SerializeStage` is used to include and exclude columns as desired in the output. Importantly, it also handles conversion from the `MultiMessage`-derived output type that is used by the `FilterDetectionsStage` to the `MessageMeta` class that is expected as input by the `WriteToFileStage`.

Expand All @@ -281,6 +286,7 @@ import os

import psutil

import morpheus
from morpheus.config import Config
from morpheus.config import PipelineModes
from morpheus.pipeline import LinearPipeline
Expand All @@ -304,9 +310,8 @@ def run_pipeline():
root_dir = os.environ['MORPHEUS_ROOT']
out_dir = os.environ.get('OUT_DIR', '/tmp')

data_dir = os.path.join(root_dir, 'data')
labels_file = os.path.join(data_dir, 'labels_phishing.txt')
vocab_file = os.path.join(data_dir, 'bert-base-uncased-hash.txt')
labels_file = os.path.join(morpheus.DATA_DIR, 'labels_phishing.txt')
vocab_file = os.path.join(morpheus.DATA_DIR, 'bert-base-uncased-hash.txt')

input_file = os.path.join(root_dir, 'examples/data/email.jsonlines')
results_file = os.path.join(out_dir, 'detections.jsonlines')
Expand Down Expand Up @@ -373,7 +378,7 @@ if __name__ == "__main__":

In our previous examples, we didn't define a constructor for the Python classes that we were building for our stages. However, there are many cases where we will need to receive configuration parameters. Every stage constructor must receive an instance of a `morpheus.config.Config` object as its first argument and is then free to define additional stage-specific arguments after that. The Morpheus config object will contain configuration parameters needed by multiple stages in the pipeline, and the constructor in each Morpheus stage is free to inspect these. In contrast, parameters specific to a single stage are typically defined as constructor arguments.

Note that it is a best practice to perform any necessary validation checks in the constructor. This allows us to fail early rather than after the pipeline has started.
Note that it is a best practice to perform any necessary validation checks in the constructor. This allows us to fail early rather than after the pipeline has started.

In our `RecipientFeaturesStage` example, we hard-coded the Bert separator token. Let's instead refactor the code to receive that as a constructor argument. Let's also take the opportunity to verify that the pipeline mode is set to `morpheus.config.PipelineModes.NLP`. Our refactored class definition now looks like:

Expand Down
You are viewing a condensed version of this merge commit. You can view the full changes here.