Skip to content

Commit

Permalink
Updates
Browse files Browse the repository at this point in the history
  • Loading branch information
trajanov committed May 14, 2024
1 parent d4385b0 commit 50d042e
Show file tree
Hide file tree
Showing 9 changed files with 609 additions and 1,248 deletions.
452 changes: 213 additions & 239 deletions Notebooks/Spark-Example-02-RDD Basics Toutorial.ipynb

Large diffs are not rendered by default.

121 changes: 97 additions & 24 deletions Notebooks/Spark-Example-03-PySpark vs Python.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,10 @@
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"execution_count": 1,
"metadata": {
"metadata": {}
},
"outputs": [],
"source": [
"import findspark\n",
Expand All @@ -31,8 +33,10 @@
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"execution_count": 2,
"metadata": {
"metadata": {}
},
"outputs": [
{
"name": "stdout",
Expand All @@ -54,7 +58,9 @@
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"metadata": {
"metadata": {}
},
"outputs": [
{
"name": "stdout",
Expand All @@ -81,7 +87,9 @@
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"metadata": {
"metadata": {}
},
"outputs": [
{
"name": "stdout",
Expand All @@ -103,7 +111,9 @@
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"metadata": {
"metadata": {}
},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -134,7 +144,9 @@
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"metadata": {
"metadata": {}
},
"outputs": [
{
"name": "stdout",
Expand All @@ -157,7 +169,9 @@
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"metadata": {
"metadata": {}
},
"outputs": [
{
"name": "stdout",
Expand All @@ -184,7 +198,9 @@
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"metadata": {
"metadata": {}
},
"outputs": [
{
"name": "stdout",
Expand All @@ -209,7 +225,9 @@
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"metadata": {
"metadata": {}
},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -237,7 +255,9 @@
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"metadata": {
"metadata": {}
},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -268,7 +288,9 @@
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"metadata": {
"metadata": {}
},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -296,7 +318,9 @@
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"metadata": {
"metadata": {}
},
"outputs": [
{
"name": "stdout",
Expand All @@ -317,7 +341,9 @@
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"metadata": {
"metadata": {}
},
"outputs": [
{
"name": "stdout",
Expand All @@ -342,8 +368,10 @@
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"execution_count": 14,
"metadata": {
"metadata": {}
},
"outputs": [
{
"name": "stdout",
Expand All @@ -364,8 +392,10 @@
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"execution_count": 15,
"metadata": {
"metadata": {}
},
"outputs": [
{
"name": "stdout",
Expand All @@ -386,13 +416,54 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# TF-IDF (Term Frequency-Inverse Document Frequency)"
"# TF-IDF (Term Frequency-Inverse Document Frequency)\n",
"\n",
"TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. This method is widely used in information retrieval and text mining.\n",
"\n",
"**Term Frequency (TF)**\n",
"\n",
"Term Frequency measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (the total number of terms in the document) as a way of normalization:\n",
"\n",
"$$\n",
"TF(t) = \\left(\\frac{\\text{Number of times term } t \\text{ appears in a document}}{\\text{Total number of terms in the document}}\\right)\n",
"$$\n",
"\n",
"**Inverse Document Frequency (IDF)**\n",
"\n",
"Inverse Document Frequency measures how important a term is. While computing TF, all terms are considered equally important. However, certain terms, like \"is\", \"of\", and \"that\", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scaling up the rare ones, by computing the following:\n",
"\n",
"$$\n",
"IDF(t) = \\log\\left(\\frac{1 + \\text{Total number of documents}}{1 + \\text{Number of documents with term } t}\\right) + 1\n",
"$$\n",
"\n",
"This formula ensures that terms with very low frequency do not get a zero IDF by adding 1 in the numerator and denominator. The logarithmic scale is used to ensure that the IDF doesn't grow too quickly with the increase in the number of documents.\n",
"\n",
"**TF-IDF Calculation**\n",
"\n",
"TF-IDF is simply the product of TF and IDF:\n",
"\n",
"$$\n",
"TFIDF(t, d) = TF(t, d) \\times IDF(t)\n",
"$$\n",
"\n",
"This value is higher when a term is more frequent in a specific document but less frequent across all documents, which implies the term is quite significant in the particular document.\n",
"\n",
"**Application in Text Mining**\n",
"\n",
"TF-IDF has a variety of applications, mainly in systems involving natural language processing (NLP) and information retrieval such as:\n",
"\n",
"- **Search engines**: Ranking documents based on query terms.\n",
"- **Document clustering**: Grouping similar documents.\n",
"- **Text summarization**: Extracting key terms that reflect the most relevant information in documents.\n",
"- **Feature extraction**: Transforming textual data into a format suitable for machine learning algorithms."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"execution_count": 16,
"metadata": {
"metadata": {}
},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -434,8 +505,10 @@
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"execution_count": 17,
"metadata": {
"metadata": {}
},
"outputs": [
{
"name": "stdout",
Expand Down
Loading

0 comments on commit 50d042e

Please sign in to comment.