CommonLit Readability Challenge

Notebooks run based on local path (it should be the github repo one for local runs)

A Baseline?

word2vec-umap.ipynb : Using pretrained w2v for features extraction + UMAP for visualization purposes; Basic regression (with Gradient Boosting Regression), validation MSE = 0.44
- ToDos: K-Fold Cross Validation, W2V vectors augmentation (I'd give a try by applying small perturbations on the training set :) ), smart way to use the std, Ensembling several word embedding algos (W2V + GloVe + FastText)

Limits of statistical features

By using only informations about the number of the words, their length, and similar things about the sentences or the text as a whole does not make a decent result: RMSE is about 0.81 in most cases, even using a minimal amount of augmentation. However it could be useful to further boost more advanced models. Time will tell :)

Proposed approaches

From a rapid look at the data, it seems that the most readable texts talk about well known (by a kid) things in a simple way (for example: dinosaurs), while the least readable one are about highly techinical subjects (such as metalworking), with a lot of technical words in the text (which of course will be hard to read for a kid). This suggests two approaches: the first one is to select the argument(s) of the text as an extra training variable (maybe spacy + W2V), the second one is to estimate the frequency of words in a large corpora of WELL BALANCED text and look for those words that are higly specilized and unusual (maybe accounting also for the intrinsic "intensione (vedi 'Il software del linguaggio' di R. Simone)

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.ipynb_checkpoints		.ipynb_checkpoints
commonlitreadabilityprize		commonlitreadabilityprize
.gitattributes		.gitattributes
README.md		README.md
clrp-ensemble-tda.ipynb		clrp-ensemble-tda.ipynb
clrp-ensemble-umapviz-w-o-cv.ipynb		clrp-ensemble-umapviz-w-o-cv.ipynb
clrp-ensemble-w-ntk.ipynb		clrp-ensemble-w-ntk.ipynb
ensembling_x_MLJCSpring.ipynb		ensembling_x_MLJCSpring.ipynb
environment.yml		environment.yml
ntk-augmentation-scraping.ipynb		ntk-augmentation-scraping.ipynb
ntk-tda-megaensemble.ipynb		ntk-tda-megaensemble.ipynb
requirements.txt		requirements.txt
roberta-base-augtrainset-inference.ipynb		roberta-base-augtrainset-inference.ipynb
roberta-fit-stratkfold-augtrain.ipynb		roberta-fit-stratkfold-augtrain.ipynb
roberta-linguistic-features-xgb.ipynb		roberta-linguistic-features-xgb.ipynb
roberta-ntk.ipynb		roberta-ntk.ipynb
train_data_with_features_token.csv		train_data_with_features_token.csv
two-roberta-s-augmented-data.ipynb		two-roberta-s-augmented-data.ipynb
visual_notes.pdf		visual_notes.pdf
w2v_features.csv		w2v_features.csv
word2vec-umap.ipynb		word2vec-umap.ipynb
word2vec_kaggle_submission.ipynb		word2vec_kaggle_submission.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CommonLit Readability Challenge

A Baseline?

Limits of statistical features

Proposed approaches

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

MachineLearningJournalClub/CommonLitReadabilityChallenge

Folders and files

Latest commit

History

Repository files navigation

CommonLit Readability Challenge

A Baseline?

Limits of statistical features

Proposed approaches

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages