Requirement and Environtment

docker container for dev environment

It will be convenitent to build docker container to set up the environment for development

build docker image

$ cd LLM4LOG/docker
$ docker build -t llm4log:latest .

start container

$ cd LLM4LOG
$ docker compose up -d

enter dev environment

After the container is up, you can enter the dev environment either by bash or other dev tools, like visual studio code

bash

$ docker exec -it llm4log:latest bash

visual studio code
- choose Open a Remote Window at the left-bottom corner
- choose Attach to Running Container... from the pop-up list and choose container /llm4log
- choose project directory /app
- then, you can edit and debug the project source codes

gcloud setting

Different from openAI api or other api using api key, Gemini api need set gcloud cli to authentication.

https://cloud.google.com/vertex-ai/generative-ai/docs/start/quickstarts/quickstart-multimodal

install gcloud cli
gcloud init and install components

gcloud init
gcloud components update
gcloud components install beta

set account and project

gcloud auth application-default login

if need, switch the account and project:

https://stackoverflow.com/questions/46770900/how-to-change-the-project-in-gcp-using-cli-commands

gcloud auth list # accout list
# account 1 
# account 2
gcloud config set account `ACCOUNT`
gcloud projects list
# project 1
# project 2
gcloud config set project `PROJECT ID`

The use of api need specific the project in the google cloud account.

test running

Due to the network problem, sometimes google service is not stable, run test.ipynb to test the connection of vertextai api(default model is Gemini-1.5-flash).

If there is no problem of connection, and experiment get no response, the request may limited by the quota (filte for model Gemini-1.5-pro)).

Project Structure

Here is the project structure:

————————————
|-- config
|-- data
|-- docker
|-- llm4dos
|-- llm4extract
    |-- comparision
    |-- extraction
|-- llms
|-- .gitignore
|-- docker-compose.yml
|-- readme.md

config

dataset:
  log_file: "/data/Linux.txt"
project:
  PROJECT_ID : "log-analysis-433902"
  # LOCATION : "us-central1"
  LOCATION : "asia-east1"
inference:
  save_path: "result"
  chunk_size: 128*1024
test:
  llm_result_file: "/result/result_2024-09-03_14-46-46_192k file1_counts/inference_output/output0.json"
  re_result_file: "/data/splited data/12-192x1024/new_result_human_Linux_1.json"
  human_result_file: "/data/splited data/12-192x1024/new_result_human_Linux_1.json"
comment: "192k file1 count"
gemini:
  MODEL_ID : "gemini-1.5-pro"
  LOCATION : "asia-east1"
  temperature: 0.2
  top_p : 0.8
  top_k : 32
  candidate_count : 1
  max_output_tokens : 8192
llama:
claude:
mistral:

dataset - log_file: input server log file path

inference - save_path: save folder path for inference result - chunk_size: chunk size for spliting the log file, 192k=192*1024 - count: count or extract result with llm - prompt: prompt file path

test - llm_result_file: inference result file path - re_result_file: regular expression result file path - human_result_file: human evaluation result file path

comment: experiment comments, shows in the result file name

model(gemini, llama, claude, mistral): - PROJECT_ID: google cloud project id - LOCATION: google cloud location - MODEL_NAME: model name, gemini-1.5-pro for this project - temperature : control the diversity of out put token, smaller for less diversity, $\frac{e^\frac{x_{i}}{T}}{\sum\limits_{j=1}^{V}e^{\frac{x^{j}}{T}}}$ - top_p (nucleus): The cumulative probability cutoff for token selection. Lower values mean sampling from a smaller, more top-weighted nucleus(0.8 means chose token from total probability of 0.8) - top_k: Sample from the k most likely next tokens at each step. Lower k focuses on higher probability tokens. Some models may not need this parameter - candidate_count: Number of candidates to generate, some models may not need this parameter - max_output_tokens: The maximum number of output tokens to generate per message, some models may not need this parameter

google project ID setting see google cloud vertext ai

data

In data folder:

├── Linux.txt
├── splited data
├── result-re_Linux.json
├── count_human_evaluation.json
├── count_result-re_Linux.json
├── human_evaluation.json
└── dos_data

The original log file is data/Linux.txt , using count the file are split into 1152k=1.152M tokens

splited data saved the chunked log file, each folder include splited log files,regular expression result and human evaluation result.

count_human_evaluation.json and count_result-re_Linux.json are the standard result, they are manually extracted and count with regular expression.

human_evaluation.json and result-re_Linux.json are the standard result, they are manually extracted.

dos_data saved the data for dos attatck event analsysis.

dos_prompt.py: the designed prompt to query llm for json output of dos attack event analysis, including instrution, reasoning chain and few-shots(example) in-context-learning
dos_log.py: 4 chunks of logs from Linux.txt for testing llm outputs follwing the designed prompt
dos_baseline.py: the baseline of 4 chunks of logs in dos_log.py to compare llm's outputs

docker

files needed to build the container for dev environment.

Dockerfile: docker build file
requirements: required python packages

llm4dos

files to do dos attack analysis by llm

dos_analyze.py: compare different models' behavious on dos attack analysis based on designed prompt, metrics rouge ([rouge1, rouge2, rougeL]) of each model will be evaluated just finished codes, not tested yet

llm4extract

entity extraction experiments

comparision

comparision of different LLM models' behaviour on entity extraction

config.py: load config parameters for different LLM models
example_prompt.py: the designed prompt for LLM model to follow and output extracted entity in Json format
extraction_result.py: the generated results of different LLM models on 1st chunk of Linux.txt via. google cloud vertax ai

the comparision result shows gemini pro 1.5 currently make the best behaviour on entity (IP/URL) extration, here is the experiment results:

extraction result on 1st 128k chunk:
                    LLM Model IP precesion IP recall URL precision URL recall
0              gemini_pro_1_5         0.96      0.94          0.94       0.85
1  claude_3_5_sonnet_20240620         0.64      0.19          0.50       0.35
2      claude_3_opus_20240229         0.87      0.96          0.67       0.70
3              llama_3_1_405B         0.74      0.36          0.75       0.45
4               mistral_large         1.00      0.53          1.00       0.10

extraction

extraction of entity(IP/URL) and entity appearence times by Gemini 1.5 pro, please find detailed description in extraction readme

llms

infer.py: an encapusulated interface llm_infer to call a specific llm model for inference
claude.py, gemimi.py, llama.py, mistral.py: call different llms via google cloud vertex ai platform to do the inference
mistral_small: call mistralai/Mistral-Small-Instruct-2409 from huggingface to do the inference, huggingface access token is required for the 1st run to cache the model to local disk, which will be located in /root/.cache/huggingface

`docker-compose.yml`

docker compose file to launch the llm4log container as the dev environment

volumes: different directory mapping from local disk to container
- /app: the dir for project files in the container
- /root/.cache/huggingface: the dir to cache models from huggingface in the container
deploy: settings to utilize local gpu resoureces, comment it if no gpu resource.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Requirement and Environtment

docker container for dev environment

build docker image

start container

enter dev environment

gcloud setting

test running

Project Structure

config

data

docker

llm4dos

llm4extract

comparision

extraction

llms

`docker-compose.yml`

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.vscode		.vscode
config		config
data		data
docker		docker
llm4dos		llm4dos
llm4extract		llm4extract
llms		llms
.gitignore		.gitignore
docker-compose.yml		docker-compose.yml
readme.md		readme.md

wdxpz/LLM4log

Folders and files

Latest commit

History

Repository files navigation

Requirement and Environtment

docker container for dev environment

build docker image

start container

enter dev environment

gcloud setting

test running

Project Structure

config

data

docker

llm4dos

llm4extract

comparision

extraction

llms

docker-compose.yml

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`docker-compose.yml`

Packages