Skip to content

Request for Guidance on Adding Punjabi Language Support to NLLB #5630

@braghome

Description

@braghome

Hello NLLB/fairseq team,

I’m reaching out to explore how to fine-tune the NLLB model to support better Punjabi, a vibrant language spoken by over 100 million people worldwide, including a historic Sikh community in California that has thrived since 1909.

As part of efforts to preserve and promote Punjabi in digital spaces, I’d like to understand:

Requirements for fine-tuning NLLB for Punjabi – Are there specific considerations for its Gurmukhi script or dialectal variations (e.g., Eastern vs. Western Punjabi)?

Existing tutorials – Is there a guide for adding new languages, particularly those with rich literary traditions, such as Punjabi?

Data needs – What type/amount of parallel data (e.g., Punjabi-English) would be optimal? Could community-translated datasets (e.g., religious texts, literature, or news) supplement existing resources?

Leveraging seed datasets – Are there templates (such as the NLLB-Seed dataset) that we could adapt for Punjabi?

Punjabi is a culturally significant language with deep roots in California’s Sikh diaspora, and I’d love to contribute to its inclusion in NLLB. Any advice or resources you could share would be invaluable!

Thank you for your time and for working on multilingual AI.

Best regards,
Manav

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions