Welcome to the Bemba Language Model (LLM) project! This initiative is dedicated to building and fine-tuning an open-source language model for the Bemba language, leveraging cutting-edge AI techniques. The model aims to empower the Zambian community by making AI accessible and linguistically inclusive.
GitHub Repository: Uniplexity AI - Bemba LLM
- Extensive Corpus: The model is trained on a Bemba corpus containing thousands of sentences, ensuring rich linguistic diversity.
- Google Colab Notebook: Accessible and easy-to-use training scripts are available in Google Colab for anyone to contribute or experiment.
- Integration with wandb: Training progress and metrics are tracked using Weights & Biases, making collaboration and performance monitoring seamless.
- Open Source: Contributions from developers, linguists, and AI enthusiasts are highly encouraged!
git clone <repository-url>
cd <repository-directory> Ensure the lora_model directory contains your fine-tuned model and tokenizer files.
Install the required dependencies:
pip install torch transformers wandb To generate text in Bemba, use the example script below:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load the model and tokenizer
model_name = "./lora_model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Set pad token
tokenizer.pad_token_id = tokenizer.eos_token_id
model.eval()
# Input text in Bemba
input_text = "ukutendeka lesa ali pangile isonde"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
# Generate text
with torch.no_grad():
output = model.generate(
input_ids,
max_length=50,
num_return_sequences=1,
do_sample=True,
top_k=50,
top_p=0.95,
temperature=0.7,
pad_token_id=tokenizer.eos_token_id
)
# Decode and print the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated Text:", generated_text) ukutendeka lesa ali pangile isonde
Generated Text: Ukutendeka lesa ali pangile isonde … (Generated continuation)
Preserves and promotes the use of Bemba in digital spaces, ensuring linguistic heritage remains accessible in the modern age.
Facilitates language learning by providing AI-generated stories, exercises, and other educational content.
Empowers businesses and media outlets to create content in Bemba, fostering deeper connections with audiences.
Supports the generation of localized, accessible communication for public health and administrative purposes.
Helps writers and creators produce unique content, enriching Zambia's cultural and literary scene.
We welcome your involvement! Here's how you can contribute:
- Extend the Dataset: Add more sentences to the corpus to improve language coverage.
- Model Optimization: Experiment with fine-tuning and share results via pull requests.
- Documentation: Help enhance tutorials and guides for the community.
- Feedback: Test the model and suggest improvements.
To start, fork the repository: Uniplexity AI - Bemba LLM.
- Hugging Face Transformers: For model architecture and training.
- Google Colab: Accessible GPU resources for training.
- Weights & Biases: Integrated for tracking experiments and visualizing performance metrics.
- Load the Bemba corpus.
- Preprocess the data for tokenization and batching.
- Fine-tune the model using LoRA (Low-Rank Adaptation) for efficient training.
- Monitor progress and results using wandb.
- Extended Token Generation: Enhance the model's ability to generate up to 500 tokens.
- Multilingual Support: Expand to other Zambian languages, promoting inclusivity.
- Improved Dataset Diversity: Add domain-specific text, such as healthcare and education materials.
- AI-Powered Applications: Build tools and apps leveraging this model for local communities.
This project is licensed under the MIT License. See the LICENSE file for details.