This repo contains the data generator for LogosDB, which equips small Language Model with Knowledge and Accuracy of LLM.
LogosDB:
- Blazingly Fast
- Super Liteweight
- Simple Setup
- Working fully offline
- Working machine with GPU Nvidia 1080Ti or better
- Capable of running Small Language Model like Mixtral 8x7B, LLama3 7B, etc.
- 2GB of free space for the database
- 16GB of RAM
Phase 1: Data Gen - This will generate mock data to store in the database
- Clone this repo
- Run
pip install -r requirements.txt
- Copy '.env.example' to '.env' and fill the necessary information
- Set up your desired topics in
mock_data.py
- Run
python question.py
to generate questions - At the same time, run
python answer.py
to generate answers - Check out the installed postgresql database for the generated data
Phase 2: Store Data - Store data distributedly in the database cluster with K8s, Helm, Docker and Postgresql
- Create k8s secret from .env file
kubectl create secret generic postgres-secret --from-env-file=.env
- Take a look at
cluster.py
and replace the necessary information. - Then run the following command to deploy the database cluster
python3 cluster.py
- Check the status of the database cluster
kubectl get pods
We use Gemini 1.5 Flash to create both questions and answers for our system. As you know, Google introduced several harmful categories that the model should avoid. However, we found that the model still recognizes its own generated texts as harmful. We are still investigating the reason behind this.