-
Notifications
You must be signed in to change notification settings - Fork 3.9k
tf-idf added(#890) #1000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
tf-idf added(#890) #1000
Conversation
@MrDupin Sir, can you also review this PR. It has been a long time since I have submitted this PR. Thank You. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice PR, but it needs some fixes first. I will review the actual code once these are done.
nlp_apps.ipynb
Outdated
"metadata": { | ||
"collapsed": true | ||
}, | ||
"source": [ | ||
"## Text Analysis using TF-IDF\n", | ||
"Since we know that computers are great with numbers but they cannot work out with the natural language. So in-order to overcome that text can be directly converted to numbers for analyzing.One of the most common and popular technique for this is TF-IDF which stands for Term Frequency and Inverse Document Frequency. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first sentence is a bit awkward to read and not entirely grammatically correct. Remove the 'since' and try to restructure it a bit. In the second sentence, add a comma after 'that' and leave a space after the period. In the third sentence, it is 'techniques' instead of 'technique'.
nlp_apps.ipynb
Outdated
"## Text Analysis using TF-IDF\n", | ||
"Since we know that computers are great with numbers but they cannot work out with the natural language. So in-order to overcome that text can be directly converted to numbers for analyzing.One of the most common and popular technique for this is TF-IDF which stands for Term Frequency and Inverse Document Frequency. \n", | ||
"\n", | ||
"1. **TF(Term -Frequency):-**Gives the frequency of each word in a document. As the number of occurances of a word increases in a document its value increases for that document. Basically, if a word appears more times in a docuemnt then that word is important to that document. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is 'occurrences'.
nlp_apps.ipynb
Outdated
" **TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)**\n", | ||
"\n", | ||
"\n", | ||
"2. **IDF(Inverse Document Frequency):-** It calculates, how much a word is important for a given document. The words that occur in less documents are more important as compared to the words that occur in more number of documents. It helps to find the important words across the documents .\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first comma is not needed. Then, it is 'fewer documents' instead of 'less documents', the 'as' part in 'as compared' is a bit awkward and can be removed, and you should say 'in more documents' and not 'in more number of documents'. Finally, you added a space before the final period.
nlp_apps.ipynb
Outdated
"\n", | ||
" **IDF(t) = log(Total number of documents / (1+Number of documents with term t in it))**\n", | ||
" \n", | ||
" (1 is added to the base for coding purposes. It helps to deal with the cases when the term does not appears in any document)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should also mention that the +1 is used to avoid zero division.
nlp_apps.ipynb
Outdated
" (1 is added to the base for coding purposes. It helps to deal with the cases when the term does not appears in any document)\n", | ||
"\n", | ||
"\n", | ||
"*TF-IDF* finally gives the importance to a single word in a collection of documents by multiplying the TF of that word in that document with the IDF of that word across the documents.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's 'of a single' instead of 'to a single'. Also, you have some double spaces in there (before 'IDF' and before 'across').
text.py
Outdated
# doc_tf(list of dict)=a list containting the tf of each word in each document | ||
# terms_df(dict of list)= gives the df of ach term | ||
def __init__(self,docs): | ||
'''input: list of all documents''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space between arguments.
'''input: list of all documents''' | ||
self.docs=docs | ||
self.tf_idf_score=self.make_tf_idf() | ||
def make_tf_idf(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an empty line between functions.
text.py
Outdated
self.tf_idf_score=self.make_tf_idf() | ||
def make_tf_idf(self): | ||
'''makes the tf-idf score of all the words in all the documents''' | ||
terms_df={} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can initialize variables in the same line to save space:
terms_df, doc_tf, counter = {}, [], 0
text.py
Outdated
self.docs=docs | ||
self.tf_idf_score=self.make_tf_idf() | ||
def make_tf_idf(self): | ||
'''makes the tf-idf score of all the words in all the documents''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double quotes + capitalization.
text.py
Outdated
doc_tf=[] | ||
counter=0 | ||
for i in self.docs: | ||
counter+=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, you need to add more space to your code to make it more readable. Add spaces around all operators, arguments and add empty lines every once in a while (around loops, for example), to break the code up a bit. Right now it looks like a huge wall of text.
This applies to all the code below this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MrDupin Thank You for your detailed review of my code. I am working on the changes asked by you and update the PR soon. Thank You!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice changes! I have some more minor things you can take care of if you have the time.
text.py
Outdated
@@ -8,7 +8,7 @@ | |||
from learning import CountingProbDist | |||
import search | |||
|
|||
from math import log, exp | |||
import math |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be best if you left it as from math import log, exp
, unless it interferes with code.
text.py
Outdated
self.tf_idf_score=self.make_tf_idf() | ||
"""A class to perform TF-IDF analysis on a given set of documents and search a query from the given set of documents. | ||
|
||
variabels(type) = Values contained in the variable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be 'variables'.
text.py
Outdated
|
||
docs(list of strings) = Contains a list of all the documents as a string. | ||
terms_tf_idf_score(list of dict) = TF-IDF-Score of all the documents with all words. | ||
doc_tf(list of dict)= A list containing the Term Frequency of every word in the corresponding document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor edit, but you can add a space before the equal sign for consistency.
text.py
Outdated
docs(list of strings) = Contains a list of all the documents as a string. | ||
terms_tf_idf_score(list of dict) = TF-IDF-Score of all the documents with all words. | ||
doc_tf(list of dict)= A list containing the Term Frequency of every word in the corresponding document. | ||
terms_df(dict of list)= Gives the Document Frequency of each term. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
As mentioned in Issue #890,
TF-IDF
has been added as a separate section with implementation intext.py
and implementation details and explanation innlp_apps.ipynb
.The following things have been added:-
TF-IDF
from a given set of documents.TF-IDF