tf-idf added(#890) #1000

thesagarsehgal · 2018-12-22T08:07:21Z

As mentioned in Issue #890, TF-IDF has been added as a separate section with implementation in text.py and implementation details and explanation in nlp_apps.ipynb.
The following things have been added:-

Making a TF-IDF from a given set of documents.
Getting the relevance of a word in a document.
Getting top n most important words in that document.
Query Searching and Ranking using TF-IDF

thesagarsehgal · 2019-01-30T20:51:21Z

@MrDupin Sir, can you also review this PR. It has been a long time since I have submitted this PR. Thank You.

antmarakis

This is a nice PR, but it needs some fixes first. I will review the actual code once these are done.

antmarakis · 2019-03-03T11:37:08Z

nlp_apps.ipynb

   "metadata": {
    "collapsed": true
   },
+   "source": [
+    "## Text Analysis using TF-IDF\n",
+    "Since we know that computers are great with numbers but they cannot work out with the natural language. So in-order to overcome that text can be directly converted to numbers for analyzing.One of the most common and popular technique for this is TF-IDF which stands for Term Frequency and Inverse Document Frequency. \n",


The first sentence is a bit awkward to read and not entirely grammatically correct. Remove the 'since' and try to restructure it a bit. In the second sentence, add a comma after 'that' and leave a space after the period. In the third sentence, it is 'techniques' instead of 'technique'.

antmarakis · 2019-03-03T11:37:42Z

nlp_apps.ipynb

+    "## Text Analysis using TF-IDF\n",
+    "Since we know that computers are great with numbers but they cannot work out with the natural language. So in-order to overcome that text can be directly converted to numbers for analyzing.One of the most common and popular technique for this is TF-IDF which stands for Term Frequency and Inverse Document Frequency. \n",
+    "\n",
+    "1. **TF(Term -Frequency):-**Gives the frequency of each word in a document. As the number of occurances of a word increases in a document its value increases for that document. Basically, if a word appears more times in a docuemnt then that word is important to that document. \n",


It is 'occurrences'.

antmarakis · 2019-03-03T11:39:42Z

nlp_apps.ipynb

+    "    **TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)**\n",
+    "\n",
+    "\n",
+    "2. **IDF(Inverse Document Frequency):-** It calculates, how much a word is important for a given document. The words that occur in less documents are more important as compared to the words that occur in more number of documents. It helps to find the important words across the documents .\n",


The first comma is not needed. Then, it is 'fewer documents' instead of 'less documents', the 'as' part in 'as compared' is a bit awkward and can be removed, and you should say 'in more documents' and not 'in more number of documents'. Finally, you added a space before the final period.

antmarakis · 2019-03-03T11:40:35Z

nlp_apps.ipynb

+    "\n",
+    "    **IDF(t) = log(Total number of documents / (1+Number of documents with term t in it))**\n",
+    "    \n",
+    "    (1 is added to the base for coding purposes. It helps to deal with the cases when the term does not appears in any document)\n",


You should also mention that the +1 is used to avoid zero division.

antmarakis · 2019-03-03T11:41:15Z

nlp_apps.ipynb

+    "    (1 is added to the base for coding purposes. It helps to deal with the cases when the term does not appears in any document)\n",
+    "\n",
+    "\n",
+    "*TF-IDF* finally gives the importance to a single word in a collection of documents by multiplying the TF of that word in that document with the  IDF of that word  across the documents.\n",


It's 'of a single' instead of 'to a single'. Also, you have some double spaces in there (before 'IDF' and before 'across').

antmarakis · 2019-03-03T11:50:21Z

text.py

+#     doc_tf(list of dict)=a list containting the tf of each word in each document
+#     terms_df(dict of list)= gives the df of ach term
+    def __init__(self,docs):
+        '''input: list of all documents'''


Space between arguments.

antmarakis · 2019-03-03T11:50:43Z

text.py

+        '''input: list of all documents'''
+        self.docs=docs
+        self.tf_idf_score=self.make_tf_idf()
+    def make_tf_idf(self):


Add an empty line between functions.

antmarakis · 2019-03-03T11:51:25Z

text.py

+        self.tf_idf_score=self.make_tf_idf()
+    def make_tf_idf(self):
+        '''makes the tf-idf score of all the words in all the documents'''
+        terms_df={}


You can initialize variables in the same line to save space:

terms_df, doc_tf, counter = {}, [], 0

antmarakis · 2019-03-03T11:51:31Z

text.py

+        self.docs=docs
+        self.tf_idf_score=self.make_tf_idf()
+    def make_tf_idf(self):
+        '''makes the tf-idf score of all the words in all the documents'''


Double quotes + capitalization.

antmarakis · 2019-03-03T11:51:45Z

text.py

+        doc_tf=[]
+        counter=0
+        for i in self.docs:
+            counter+=1


In general, you need to add more space to your code to make it more readable. Add spaces around all operators, arguments and add empty lines every once in a while (around loops, for example), to break the code up a bit. Right now it looks like a huge wall of text.

This applies to all the code below this.

@MrDupin Thank You for your detailed review of my code. I am working on the changes asked by you and update the PR soon. Thank You!

Awesome, thanks!

antmarakis

Nice changes! I have some more minor things you can take care of if you have the time.

antmarakis · 2019-03-03T23:12:02Z

text.py

@@ -8,7 +8,7 @@
 from learning import CountingProbDist
 import search

-from math import log, exp
+import math


I think it would be best if you left it as from math import log, exp, unless it interferes with code.

antmarakis · 2019-03-03T23:12:21Z

text.py

-        self.tf_idf_score=self.make_tf_idf()
+    """A class to perform TF-IDF analysis on a given set of documents and search a query from the given set of documents.
+
+        variabels(type) = Values contained in the variable.


It should be 'variables'.

antmarakis · 2019-03-03T23:12:55Z

text.py

+
+        docs(list of strings) = Contains a list of all the documents as a string.
+        terms_tf_idf_score(list of dict) = TF-IDF-Score of all the documents with all words.
+        doc_tf(list of dict)= A list containing the Term Frequency of every word in the corresponding document.


Minor edit, but you can add a space before the equal sign for consistency.

antmarakis · 2019-03-03T23:13:01Z

text.py

+        docs(list of strings) = Contains a list of all the documents as a string.
+        terms_tf_idf_score(list of dict) = TF-IDF-Score of all the documents with all words.
+        doc_tf(list of dict)= A list containing the Term Frequency of every word in the corresponding document.
+        terms_df(dict of list)= Gives the Document Frequency of each term.


added tf-idf in nlp_apps.ipynb(aimacode#890)

cb290ac

ad71 approved these changes Dec 30, 2018

View reviewed changes

antmarakis requested changes Mar 3, 2019

View reviewed changes

Improved the readablity of code and some other improvements suggested

f266bc5

antmarakis reviewed Mar 3, 2019

View reviewed changes

antmarakis approved these changes Mar 3, 2019

View reviewed changes

Updated the code as suggested.

c1f303f

tf-idf added(#890) #1000

Are you sure you want to change the base?

tf-idf added(#890) #1000

Uh oh!

Conversation

thesagarsehgal commented Dec 22, 2018

Uh oh!

thesagarsehgal commented Jan 30, 2019

Uh oh!

antmarakis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antmarakis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!