South-African-Language-Identification-2022

my_take

Overview

With such a multilingual population, it is only obvious that systems and devices also communicate in multi-languages.

In this challenge, language of text, which is in any of South Africa's 11 Official languages, will be identified. This is an example of NLP's Language Identification, the task of determining the natural language that a piece of text is written in.

Dataset

The dataset used for this challenge is the NCHLT Text Corpora collected by the South African Department of Arts and Culture & Centre for Text Technology (CTexT, North-West University, South Africa). The training set was improved through additional cleaning done by Praekelt.

From kaggle

The data is in the form Language ID, Text. The text is in various states of cleanliness. Some NLP techniques will be necessary to clean up the data.

File descriptions

train_set.csv - the training set
test_set.csv - the test set
sample_submission.csv - a sample submission file in the correct format

Language IDs

afr - Afrikaans
eng - English
nbl - isiNdebele
nso - Sepedi
sot - Sesotho
ssw - siSwati
tsn - Setswana
tso - Xitsonga
ven - Tshivenda
xho - isiXhosa
zul - isiZulu

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
plot		plot
resources		resources
Model.ipynb		Model.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

South-African-Language-Identification-2022

Overview

Dataset

File descriptions

Language IDs

About

Uh oh!

Releases

Packages

Languages

toarstn92/South-African-Language-Identification-2022

Folders and files

Latest commit

History

Repository files navigation

South-African-Language-Identification-2022

Overview

Dataset

File descriptions

Language IDs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages