KNN

ArmaanSeth · ArmaanSeth · commit 4f1c42170c7c · 2022-10-07T22:01:49.000+05:30
diff --git a/MLSelfImplementedAlgos/KNN/Documentation.md b/MLSelfImplementedAlgos/KNN/Documentation.md
@@ -0,0 +1,59 @@
+# K Nearest Neighbours(KNN)
+The intuition behind KNN is that for a perticular datapoint we'll look for its nearest neighbours and predict it to be belonging to the most commomly occuring neighbouring class.
+
+</br>
+
+<img src="https://miro.medium.com/max/587/1*hncgU7vWLBsRvc8WJhxlkQ.png" width =400>
+
+</br>
+
+### Distance between two datapoints
+The distance metric will define the relationship between two datapoints. More related the two datasets smaller will be the distance between them. It is an important criteria in predicting our results. 
+
+</br>
+
+<img src="https://miro.medium.com/max/1220/0*WrVc0CpxoStXpACy.png" width=300>
+
+Distance metric of our model could be anything we want to, it can be euclidean distance or it could be manhattan distance or minkowski.
+
+* Euclidean distance formula :
+
+    <img src="https://miro.medium.com/max/1400/1*9LeaMTcOXxeTPN-VCbKloQ.png" width=200>
+
+</br>
+
+* Manhattan diastance formula :
+
+    <img src="https://miro.medium.com/max/1018/1*KDgfdK6SooXtaUvlnXdpaA.png" width=210>
+
+### Feature Scaling
+
+Before applying KNN it is very important to apply feature scaling which involves transforming the data into a single range [0,1].
+
+</br>
+
+<img src="https://media-exp1.licdn.com/dms/image/C4E12AQFPqF6qfXYOvQ/article-cover_image-shrink_600_2000/0/1624324925880?e=2147483647&v=beta&t=C5ghUcvwlIFvEqyfLrB5bb4cL5z4mFYwQxzZscULq8c" width=400>
+
+</br>
+
+<img src="https://qph.cf2.quoracdn.net/main-qimg-63a05d8898505f9c84ba2c6427c9c78c" width=400>
+
+</br>
+
+### Value of K
+K will give the number of nearest neighbours to look for while implementing KNN.In sklearn it is 5 by default though you can change it according to your need.
+
+**Finding Optimal K**
+* Large value of k leads to undefitting. 
+* Small value of k leads to overfitting.
+
+</br>
+
+<img src="https://media.geeksforgeeks.org/wp-content/cdn-uploads/20190523171258/overfitting_2.png" width=600>
+
+There is a sweet spot in between where there our optimal value of k resides which will lead to good performance on testing data.
+
+### Code Description
+ Our code contains a predict function which takes x_train,y_train and x_test values as argument along with K then for each dataset it returns the most common class of the K nearest neighbour to the dataset.
+
+ After getting y_prediction we have evaluated our model using sklearn classification report and confusion matrix, and we can see that our model is doing well with high accuracy in testing data.
diff --git a/MLSelfImplementedAlgos/KNN/KNN_Implementation.ipynb b/MLSelfImplementedAlgos/KNN/KNN_Implementation.ipynb
@@ -0,0 +1,133 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "KNN Implementation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "from collections import Counter"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def predict_single_point(x_train,y_train,x,k):\n",
+    "    distances=[]\n",
+    "    for i in range(len(x_train)):\n",
+    "        distance=((x_train[i,:]-x)**2).sum()\n",
+    "        distances.append([distance,i])\n",
+    "    distances=sorted(distances)\n",
+    "    target=[]\n",
+    "    for i in range(k):\n",
+    "        target.append(y_train[distances[i][1]])\n",
+    "    prediction=Counter(target).most_common(1)[0][0]\n",
+    "    return prediction"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def predict(x_train,y_train,x_test,k):\n",
+    "    predictions=[]\n",
+    "    for x in x_test:\n",
+    "        predictions.append(predict_single_point(x_train,y_train,x,k))\n",
+    "    return predictions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.datasets import load_breast_cancer\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from sklearn.metrics import confusion_matrix,classification_report\n",
+    "\n",
+    "cancer=load_breast_cancer()\n",
+    "data=pd.DataFrame(cancer.data)\n",
+    "x_train,x_test,y_train,y_test=train_test_split(cancer.data,cancer.target,random_state=0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "y_pred=predict(x_train,y_train,x_test,5)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[[49  4]\n",
+      " [ 5 85]]\n",
+      "              precision    recall  f1-score   support\n",
+      "\n",
+      "           0       0.91      0.92      0.92        53\n",
+      "           1       0.96      0.94      0.95        90\n",
+      "\n",
+      "    accuracy                           0.94       143\n",
+      "   macro avg       0.93      0.93      0.93       143\n",
+      "weighted avg       0.94      0.94      0.94       143\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(confusion_matrix(y_test,y_pred))\n",
+    "print(classification_report(y_test,y_pred))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.9.12 ('base')",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.12"
+  },
+  "orig_nbformat": 4,
+  "vscode": {
+   "interpreter": {
+    "hash": "c19b36fe549fe2dce1ac32d0dd317d0a363043eb1c14a547f46436cb05190cdf"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}