The primary goal of this example is to showcase the creation of a pipeline that integrates an LLM service with Morpheus. Although this example features a single implementation, the pipeline and its components are versatile and can be adapted to various scenarios with unique requirements. The following highlights different customization points within the pipeline and the specific choices made for this example:
- The pipeline is designed to support any LLM service that adheres to our LLMService interface. Compatible services include OpenAI, NeMo, or even local execution using llama-cpp-python. In this demonstration, we focus on utilizing NeMo as the LLM service, highlighting the advantages it offers over other LLM services and the seamless integration with the NeMo ecosystem. Furthermore, the pipeline can accommodate more complex configurations using NeMo + Inform without necessitating changes to the core pipeline.
- Post LLM execution, the model's output can be leveraged for various tasks, including model training, analysis, or simulating an attack. In this particular example, we have simplified the implementation and focused solely on the LLMEngine.
This example Morpheus pipeline is built using the following components:
- InMemorySourceStage: Manages LLM queries in a DataFrame.
- DeserializationStage: Converts MessageMeta objects into ControlMessages required by the LLMEngine.
- LLMEngineStage: Encompasses the core LLMEngine functionality.
- An
ExtracterNode
extracts the questions from the DataFrame. - A
PromptTemplateNode
converts data and a template into the final inputs for the LLM. - The LLM executes using an
LLMGenerateNode
to run the LLM queries. - Finally, the responses are incorporated back into the ControlMessage using a
SimpleTaskHandler
.
- An
- InMemorySinkStage: Store the results.
Before running the pipeline, ensure that the NGC_API_KEY
environment variable is set.
Install the required dependencies.
export CUDA_VER=11.8
mamba install -n base -c conda-forge conda-merge
conda run -n base --live-stream conda-merge docker/conda/environments/cuda${CUDA_VER}_dev.yml \
docker/conda/environments/cuda${CUDA_VER}_examples.yml > .tmp/merged.yml \
&& mamba env update -n ${CONDA_DEFAULT_ENV} --file .tmp/merged.yml
For this example, we utilize the NeMo Service within NGC. To gain access, an NGC API key is required. Follow the instructions outlined here to generate your NGC API key.
Configure the following environment variables, with NGC_ORG_ID being optional:
export NGC_API_KEY=<YOUR_API_KEY>
export NGC_ORG_ID=<YOUR_NGC_ORG_ID>
The top level entrypoint to each of the LLM example pipelines is examples/llm/main.py
. This script accepts a set
of Options and a Pipeline to run. Baseline options are below, and for the purposes of this document we'll assume a
pipeline option of completion
:
python examples/llm/main.py completion [OPTIONS] COMMAND [ARGS]...
pipeline
-
--num_threads INTEGER RANGE
- Description: Number of internal pipeline threads to use.
- Default:
12
-
--pipeline_batch_size INTEGER RANGE
- Description: Internal batch size for the pipeline. Can be much larger than the model batch size. Also used for Kafka consumers.
- Default:
1024
-
--model_max_batch_size INTEGER RANGE
- Description: Max batch size to use for the model.
- Default:
64
-
--repeat_count INTEGER RANGE
- Description: Number of times to repeat the input query. Useful for testing performance.
- Default:
64
-
--llm_service [NemoLLM|OpenAI]
- Description: LLM service to issue requests to.
- Default:
NemoLLM
-
--help
- Description: Show the help message with options and commands details.
python examples/llm/main.py completion pipeline --llm_service OpenAI