Skip to content

tf-idf added(#890) #1000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

thesagarsehgal
Copy link
Contributor

As mentioned in Issue #890, TF-IDF has been added as a separate section with implementation in text.py and implementation details and explanation in nlp_apps.ipynb.
The following things have been added:-

  1. Making a TF-IDF from a given set of documents.
  2. Getting the relevance of a word in a document.
  3. Getting top n most important words in that document.
  4. Query Searching and Ranking using TF-IDF

@thesagarsehgal
Copy link
Contributor Author

@MrDupin Sir, can you also review this PR. It has been a long time since I have submitted this PR. Thank You.

Copy link
Collaborator

@antmarakis antmarakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice PR, but it needs some fixes first. I will review the actual code once these are done.

nlp_apps.ipynb Outdated
"metadata": {
"collapsed": true
},
"source": [
"## Text Analysis using TF-IDF\n",
"Since we know that computers are great with numbers but they cannot work out with the natural language. So in-order to overcome that text can be directly converted to numbers for analyzing.One of the most common and popular technique for this is TF-IDF which stands for Term Frequency and Inverse Document Frequency. \n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first sentence is a bit awkward to read and not entirely grammatically correct. Remove the 'since' and try to restructure it a bit. In the second sentence, add a comma after 'that' and leave a space after the period. In the third sentence, it is 'techniques' instead of 'technique'.

nlp_apps.ipynb Outdated
"## Text Analysis using TF-IDF\n",
"Since we know that computers are great with numbers but they cannot work out with the natural language. So in-order to overcome that text can be directly converted to numbers for analyzing.One of the most common and popular technique for this is TF-IDF which stands for Term Frequency and Inverse Document Frequency. \n",
"\n",
"1. **TF(Term -Frequency):-**Gives the frequency of each word in a document. As the number of occurances of a word increases in a document its value increases for that document. Basically, if a word appears more times in a docuemnt then that word is important to that document. \n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is 'occurrences'.

nlp_apps.ipynb Outdated
" **TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)**\n",
"\n",
"\n",
"2. **IDF(Inverse Document Frequency):-** It calculates, how much a word is important for a given document. The words that occur in less documents are more important as compared to the words that occur in more number of documents. It helps to find the important words across the documents .\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first comma is not needed. Then, it is 'fewer documents' instead of 'less documents', the 'as' part in 'as compared' is a bit awkward and can be removed, and you should say 'in more documents' and not 'in more number of documents'. Finally, you added a space before the final period.

nlp_apps.ipynb Outdated
"\n",
" **IDF(t) = log(Total number of documents / (1+Number of documents with term t in it))**\n",
" \n",
" (1 is added to the base for coding purposes. It helps to deal with the cases when the term does not appears in any document)\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also mention that the +1 is used to avoid zero division.

nlp_apps.ipynb Outdated
" (1 is added to the base for coding purposes. It helps to deal with the cases when the term does not appears in any document)\n",
"\n",
"\n",
"*TF-IDF* finally gives the importance to a single word in a collection of documents by multiplying the TF of that word in that document with the IDF of that word across the documents.\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's 'of a single' instead of 'to a single'. Also, you have some double spaces in there (before 'IDF' and before 'across').

text.py Outdated
# doc_tf(list of dict)=a list containting the tf of each word in each document
# terms_df(dict of list)= gives the df of ach term
def __init__(self,docs):
'''input: list of all documents'''
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space between arguments.

'''input: list of all documents'''
self.docs=docs
self.tf_idf_score=self.make_tf_idf()
def make_tf_idf(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an empty line between functions.

text.py Outdated
self.tf_idf_score=self.make_tf_idf()
def make_tf_idf(self):
'''makes the tf-idf score of all the words in all the documents'''
terms_df={}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can initialize variables in the same line to save space:

terms_df, doc_tf, counter = {}, [], 0

text.py Outdated
self.docs=docs
self.tf_idf_score=self.make_tf_idf()
def make_tf_idf(self):
'''makes the tf-idf score of all the words in all the documents'''
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double quotes + capitalization.

text.py Outdated
doc_tf=[]
counter=0
for i in self.docs:
counter+=1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, you need to add more space to your code to make it more readable. Add spaces around all operators, arguments and add empty lines every once in a while (around loops, for example), to break the code up a bit. Right now it looks like a huge wall of text.

This applies to all the code below this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MrDupin Thank You for your detailed review of my code. I am working on the changes asked by you and update the PR soon. Thank You!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks!

Copy link
Collaborator

@antmarakis antmarakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice changes! I have some more minor things you can take care of if you have the time.

text.py Outdated
@@ -8,7 +8,7 @@
from learning import CountingProbDist
import search

from math import log, exp
import math
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be best if you left it as from math import log, exp, unless it interferes with code.

text.py Outdated
self.tf_idf_score=self.make_tf_idf()
"""A class to perform TF-IDF analysis on a given set of documents and search a query from the given set of documents.

variabels(type) = Values contained in the variable.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be 'variables'.

text.py Outdated

docs(list of strings) = Contains a list of all the documents as a string.
terms_tf_idf_score(list of dict) = TF-IDF-Score of all the documents with all words.
doc_tf(list of dict)= A list containing the Term Frequency of every word in the corresponding document.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor edit, but you can add a space before the equal sign for consistency.

text.py Outdated
docs(list of strings) = Contains a list of all the documents as a string.
terms_tf_idf_score(list of dict) = TF-IDF-Score of all the documents with all words.
doc_tf(list of dict)= A list containing the Term Frequency of every word in the corresponding document.
terms_df(dict of list)= Gives the Document Frequency of each term.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants