Skip to content

Uniplexity-AI/Bemba-LLM-Model

 
 

Repository files navigation

Bemba Language Model (LLM)

Welcome to the Bemba Language Model (LLM) project! This initiative is dedicated to building and fine-tuning an open-source language model for the Bemba language, leveraging cutting-edge AI techniques. The model aims to empower the Zambian community by making AI accessible and linguistically inclusive.

GitHub Repository: Uniplexity AI - Bemba LLM


Project Highlights

  • Extensive Corpus: The model is trained on a Bemba corpus containing thousands of sentences, ensuring rich linguistic diversity.
  • Google Colab Notebook: Accessible and easy-to-use training scripts are available in Google Colab for anyone to contribute or experiment.
  • Integration with wandb: Training progress and metrics are tracked using Weights & Biases, making collaboration and performance monitoring seamless.
  • Open Source: Contributions from developers, linguists, and AI enthusiasts are highly encouraged!

Setup

Clone the Repository

git clone <repository-url>  
cd <repository-directory>  

Load Your Fine-Tuned Model

Ensure the lora_model directory contains your fine-tuned model and tokenizer files.

Install the required dependencies:

pip install torch transformers wandb  

Usage

To generate text in Bemba, use the example script below:

import torch  
from transformers import AutoTokenizer, AutoModelForCausalLM  

# Load the model and tokenizer  
model_name = "./lora_model"  
tokenizer = AutoTokenizer.from_pretrained(model_name)  
model = AutoModelForCausalLM.from_pretrained(model_name)  

# Set pad token  
tokenizer.pad_token_id = tokenizer.eos_token_id  
model.eval()  

# Input text in Bemba  
input_text = "ukutendeka lesa ali pangile isonde"  
input_ids = tokenizer.encode(input_text, return_tensors='pt')  

# Generate text  
with torch.no_grad():  
    output = model.generate(  
        input_ids,  
        max_length=50,  
        num_return_sequences=1,  
        do_sample=True,  
        top_k=50,  
        top_p=0.95,  
        temperature=0.7,  
        pad_token_id=tokenizer.eos_token_id  
    )  

# Decode and print the generated text  
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)  
print("Generated Text:", generated_text)  

Example

Input

ukutendeka lesa ali pangile isonde  

Output

Generated Text: Ukutendeka lesa ali pangile isonde … (Generated continuation)  

Community Impact

1. Cultural Enrichment

Preserves and promotes the use of Bemba in digital spaces, ensuring linguistic heritage remains accessible in the modern age.

2. Enhanced Education

Facilitates language learning by providing AI-generated stories, exercises, and other educational content.

3. Local Business and Media

Empowers businesses and media outlets to create content in Bemba, fostering deeper connections with audiences.

4. Health and Public Services

Supports the generation of localized, accessible communication for public health and administrative purposes.

5. Inspiring Content Creators

Helps writers and creators produce unique content, enriching Zambia's cultural and literary scene.


Contributing

We welcome your involvement! Here's how you can contribute:

  • Extend the Dataset: Add more sentences to the corpus to improve language coverage.
  • Model Optimization: Experiment with fine-tuning and share results via pull requests.
  • Documentation: Help enhance tutorials and guides for the community.
  • Feedback: Test the model and suggest improvements.

To start, fork the repository: Uniplexity AI - Bemba LLM.


Fine-Tuning Notes

Tools and Frameworks

  • Hugging Face Transformers: For model architecture and training.
  • Google Colab: Accessible GPU resources for training.
  • Weights & Biases: Integrated for tracking experiments and visualizing performance metrics.

Training Workflow

  1. Load the Bemba corpus.
  2. Preprocess the data for tokenization and batching.
  3. Fine-tune the model using LoRA (Low-Rank Adaptation) for efficient training.
  4. Monitor progress and results using wandb.

Future Directions

  1. Extended Token Generation: Enhance the model's ability to generate up to 500 tokens.
  2. Multilingual Support: Expand to other Zambian languages, promoting inclusivity.
  3. Improved Dataset Diversity: Add domain-specific text, such as healthcare and education materials.
  4. AI-Powered Applications: Build tools and apps leveraging this model for local communities.

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%