Skip to content

Future Work on the NLP Apps Notebook #890

Open
@antmarakis

Description

@antmarakis

Recently in the nlp_apps notebook I added a section on the Federalist Papers. What I did was write a simple workflow from start to finish. There is a lot of work to be done still and I am opening this to community contributions, since I believe it is a great way to get started with the applications notebooks.

A few ways you can improve the section:

  • DONE - One big issue with the Naive Bayes Classifier in this problem is that the multiplication of probabilities causes underflow (all the probability multiplications result in 0.0). That happens because examples are long texts. To avoid this, we are currently using the decimal module of Python. I believe this problem can be solved more elegantly using the logarithm of probabilities instead of probabilities. So instead of multiplying the probabilities, we add their logarithms.

  • Do some pre-processing. Currently I only added a sample pre-processing step (removing one common word from each paper). I would like to see some other pre-processing tasks + analysis of the text. Which are the most common words for each author? Is it worth it if we removed the most popular words?

  • Right now we are using unigram word models. There are other options available too. I would like to explore this in the notebooks. Maybe an author likes using two words together. Maybe another spells some words a bit differently. I would like to see different models used/explored in the notebook, to let the readers know that they shouldn't rely on just one model all the time. We can even combine models together.

  • At the end of the notebook I note that the dataset is lopsided. We have way more information on Hamilton than the other two. Maybe it is worth adding some more writings from Jay/Madison to balance this out. I think it would be interesting to see if we could improve the results by using external data. This could come after the current section, so that we could compare the results.

  • Finally, maybe we can take a step back and try and classify all the Federalist papers, not just the disputed ones. Add a new section where we use external data to train our model and then try and classify the papers.


This is a big undertaking, and it doesn't need to happen on the particular problem. If you have a problem in mind, you can instead use the above ideas to tackle your own problems! Sentiment Analysis is trending right now, so maybe this is a place to explore some of the above.

All in all, I think this is a good project to chip in every once in a while and I hope it will serve as an introduction to the repository. Or maybe it will sound interesting to GSoC students who might choose to tackle this.

In any case, feel free to post here with ideas + if you want to start working on something.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions