Bemba Language Model (LLM)

Welcome to the Bemba Language Model (LLM) project! This initiative is dedicated to building and fine-tuning an open-source language model for the Bemba language, leveraging cutting-edge AI techniques. The model aims to empower the Zambian community by making AI accessible and linguistically inclusive.

GitHub Repository: Uniplexity AI - Bemba LLM

Project Highlights

Extensive Corpus: The model is trained on a Bemba corpus containing thousands of sentences, ensuring rich linguistic diversity.
Google Colab Notebook: Accessible and easy-to-use training scripts are available in Google Colab for anyone to contribute or experiment.
Integration with wandb: Training progress and metrics are tracked using Weights & Biases, making collaboration and performance monitoring seamless.
Open Source: Contributions from developers, linguists, and AI enthusiasts are highly encouraged!

Setup

Clone the Repository

git clone <repository-url>  
cd <repository-directory>

Load Your Fine-Tuned Model

Ensure the lora_model directory contains your fine-tuned model and tokenizer files.

Install the required dependencies:

pip install torch transformers wandb

Usage

To generate text in Bemba, use the example script below:

import torch  
from transformers import AutoTokenizer, AutoModelForCausalLM  

# Load the model and tokenizer  
model_name = "./lora_model"  
tokenizer = AutoTokenizer.from_pretrained(model_name)  
model = AutoModelForCausalLM.from_pretrained(model_name)  

# Set pad token  
tokenizer.pad_token_id = tokenizer.eos_token_id  
model.eval()  

# Input text in Bemba  
input_text = "ukutendeka lesa ali pangile isonde"  
input_ids = tokenizer.encode(input_text, return_tensors='pt')  

# Generate text  
with torch.no_grad():  
    output = model.generate(  
        input_ids,  
        max_length=50,  
        num_return_sequences=1,  
        do_sample=True,  
        top_k=50,  
        top_p=0.95,  
        temperature=0.7,  
        pad_token_id=tokenizer.eos_token_id  
    )  

# Decode and print the generated text  
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)  
print("Generated Text:", generated_text)

Example

Input

ukutendeka lesa ali pangile isonde

Output

Generated Text: Ukutendeka lesa ali pangile isonde … (Generated continuation)

Community Impact

1. Cultural Enrichment

Preserves and promotes the use of Bemba in digital spaces, ensuring linguistic heritage remains accessible in the modern age.

2. Enhanced Education

Facilitates language learning by providing AI-generated stories, exercises, and other educational content.

3. Local Business and Media

Empowers businesses and media outlets to create content in Bemba, fostering deeper connections with audiences.

4. Health and Public Services

Supports the generation of localized, accessible communication for public health and administrative purposes.

5. Inspiring Content Creators

Helps writers and creators produce unique content, enriching Zambia's cultural and literary scene.

Contributing

We welcome your involvement! Here's how you can contribute:

Extend the Dataset: Add more sentences to the corpus to improve language coverage.
Model Optimization: Experiment with fine-tuning and share results via pull requests.
Documentation: Help enhance tutorials and guides for the community.
Feedback: Test the model and suggest improvements.

To start, fork the repository: Uniplexity AI - Bemba LLM.

Fine-Tuning Notes

Tools and Frameworks

Hugging Face Transformers: For model architecture and training.
Google Colab: Accessible GPU resources for training.
Weights & Biases: Integrated for tracking experiments and visualizing performance metrics.

Training Workflow

Load the Bemba corpus.
Preprocess the data for tokenization and batching.
Fine-tune the model using LoRA (Low-Rank Adaptation) for efficient training.
Monitor progress and results using wandb.

Future Directions

Extended Token Generation: Enhance the model's ability to generate up to 500 tokens.
Multilingual Support: Expand to other Zambian languages, promoting inclusivity.
Improved Dataset Diversity: Add domain-specific text, such as healthcare and education materials.
AI-Powered Applications: Build tools and apps leveraging this model for local communities.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Bemba_Model_LLM.ipynb		Bemba_Model_LLM.ipynb
ChibembaDictionaryProject-Chibemba_A.pdf		ChibembaDictionaryProject-Chibemba_A.pdf
Chronology of the Bemba.pdf		Chronology of the Bemba.pdf
LICENSE		LICENSE
README.md		README.md
bemba.xls		bemba.xls
bemba_corpse.txt		bemba_corpse.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bemba Language Model (LLM)

Project Highlights

Setup

Clone the Repository

Load Your Fine-Tuned Model

Usage

Example

Input

Output

Community Impact

1. Cultural Enrichment

2. Enhanced Education

3. Local Business and Media

4. Health and Public Services

5. Inspiring Content Creators

Contributing

Fine-Tuning Notes

Tools and Frameworks

Training Workflow

Future Directions

License

About

Uh oh!

Releases

Packages

Languages

License

Uniplexity-AI/Bemba-LLM-Model

Folders and files

Latest commit

History

Repository files navigation

Bemba Language Model (LLM)

Project Highlights

Setup

Clone the Repository

Load Your Fine-Tuned Model

Usage

Example

Input

Output

Community Impact

1. Cultural Enrichment

2. Enhanced Education

3. Local Business and Media

4. Health and Public Services

5. Inspiring Content Creators

Contributing

Fine-Tuning Notes

Tools and Frameworks

Training Workflow

Future Directions

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages