Skip to content

Commit 4f1c421

Browse files
committed
KNN
1 parent 45f4ddc commit 4f1c421

File tree

2 files changed

+192
-0
lines changed

2 files changed

+192
-0
lines changed
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# K Nearest Neighbours(KNN)
2+
The intuition behind KNN is that for a perticular datapoint we'll look for its nearest neighbours and predict it to be belonging to the most commomly occuring neighbouring class.
3+
4+
</br>
5+
6+
<img src="https://miro.medium.com/max/587/1*hncgU7vWLBsRvc8WJhxlkQ.png" width =400>
7+
8+
</br>
9+
10+
### Distance between two datapoints
11+
The distance metric will define the relationship between two datapoints. More related the two datasets smaller will be the distance between them. It is an important criteria in predicting our results.
12+
13+
</br>
14+
15+
<img src="https://miro.medium.com/max/1220/0*WrVc0CpxoStXpACy.png" width=300>
16+
17+
Distance metric of our model could be anything we want to, it can be euclidean distance or it could be manhattan distance or minkowski.
18+
19+
* Euclidean distance formula :
20+
21+
<img src="https://miro.medium.com/max/1400/1*9LeaMTcOXxeTPN-VCbKloQ.png" width=200>
22+
23+
</br>
24+
25+
* Manhattan diastance formula :
26+
27+
<img src="https://miro.medium.com/max/1018/1*KDgfdK6SooXtaUvlnXdpaA.png" width=210>
28+
29+
### Feature Scaling
30+
31+
Before applying KNN it is very important to apply feature scaling which involves transforming the data into a single range [0,1].
32+
33+
</br>
34+
35+
<img src="https://media-exp1.licdn.com/dms/image/C4E12AQFPqF6qfXYOvQ/article-cover_image-shrink_600_2000/0/1624324925880?e=2147483647&v=beta&t=C5ghUcvwlIFvEqyfLrB5bb4cL5z4mFYwQxzZscULq8c" width=400>
36+
37+
</br>
38+
39+
<img src="https://qph.cf2.quoracdn.net/main-qimg-63a05d8898505f9c84ba2c6427c9c78c" width=400>
40+
41+
</br>
42+
43+
### Value of K
44+
K will give the number of nearest neighbours to look for while implementing KNN.In sklearn it is 5 by default though you can change it according to your need.
45+
46+
**Finding Optimal K**
47+
* Large value of k leads to undefitting.
48+
* Small value of k leads to overfitting.
49+
50+
</br>
51+
52+
<img src="https://media.geeksforgeeks.org/wp-content/cdn-uploads/20190523171258/overfitting_2.png" width=600>
53+
54+
There is a sweet spot in between where there our optimal value of k resides which will lead to good performance on testing data.
55+
56+
### Code Description
57+
Our code contains a predict function which takes x_train,y_train and x_test values as argument along with K then for each dataset it returns the most common class of the K nearest neighbour to the dataset.
58+
59+
After getting y_prediction we have evaluated our model using sklearn classification report and confusion matrix, and we can see that our model is doing well with high accuracy in testing data.
Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"KNN Implementation"
8+
]
9+
},
10+
{
11+
"cell_type": "code",
12+
"execution_count": 1,
13+
"metadata": {},
14+
"outputs": [],
15+
"source": [
16+
"import pandas as pd\n",
17+
"import numpy as np\n",
18+
"from collections import Counter"
19+
]
20+
},
21+
{
22+
"cell_type": "code",
23+
"execution_count": 2,
24+
"metadata": {},
25+
"outputs": [],
26+
"source": [
27+
"def predict_single_point(x_train,y_train,x,k):\n",
28+
" distances=[]\n",
29+
" for i in range(len(x_train)):\n",
30+
" distance=((x_train[i,:]-x)**2).sum()\n",
31+
" distances.append([distance,i])\n",
32+
" distances=sorted(distances)\n",
33+
" target=[]\n",
34+
" for i in range(k):\n",
35+
" target.append(y_train[distances[i][1]])\n",
36+
" prediction=Counter(target).most_common(1)[0][0]\n",
37+
" return prediction"
38+
]
39+
},
40+
{
41+
"cell_type": "code",
42+
"execution_count": 3,
43+
"metadata": {},
44+
"outputs": [],
45+
"source": [
46+
"def predict(x_train,y_train,x_test,k):\n",
47+
" predictions=[]\n",
48+
" for x in x_test:\n",
49+
" predictions.append(predict_single_point(x_train,y_train,x,k))\n",
50+
" return predictions"
51+
]
52+
},
53+
{
54+
"cell_type": "code",
55+
"execution_count": 7,
56+
"metadata": {},
57+
"outputs": [],
58+
"source": [
59+
"from sklearn.datasets import load_breast_cancer\n",
60+
"from sklearn.model_selection import train_test_split\n",
61+
"from sklearn.metrics import confusion_matrix,classification_report\n",
62+
"\n",
63+
"cancer=load_breast_cancer()\n",
64+
"data=pd.DataFrame(cancer.data)\n",
65+
"x_train,x_test,y_train,y_test=train_test_split(cancer.data,cancer.target,random_state=0)"
66+
]
67+
},
68+
{
69+
"cell_type": "code",
70+
"execution_count": 8,
71+
"metadata": {},
72+
"outputs": [],
73+
"source": [
74+
"y_pred=predict(x_train,y_train,x_test,5)"
75+
]
76+
},
77+
{
78+
"cell_type": "code",
79+
"execution_count": 9,
80+
"metadata": {},
81+
"outputs": [
82+
{
83+
"name": "stdout",
84+
"output_type": "stream",
85+
"text": [
86+
"[[49 4]\n",
87+
" [ 5 85]]\n",
88+
" precision recall f1-score support\n",
89+
"\n",
90+
" 0 0.91 0.92 0.92 53\n",
91+
" 1 0.96 0.94 0.95 90\n",
92+
"\n",
93+
" accuracy 0.94 143\n",
94+
" macro avg 0.93 0.93 0.93 143\n",
95+
"weighted avg 0.94 0.94 0.94 143\n",
96+
"\n"
97+
]
98+
}
99+
],
100+
"source": [
101+
"print(confusion_matrix(y_test,y_pred))\n",
102+
"print(classification_report(y_test,y_pred))"
103+
]
104+
}
105+
],
106+
"metadata": {
107+
"kernelspec": {
108+
"display_name": "Python 3.9.12 ('base')",
109+
"language": "python",
110+
"name": "python3"
111+
},
112+
"language_info": {
113+
"codemirror_mode": {
114+
"name": "ipython",
115+
"version": 3
116+
},
117+
"file_extension": ".py",
118+
"mimetype": "text/x-python",
119+
"name": "python",
120+
"nbconvert_exporter": "python",
121+
"pygments_lexer": "ipython3",
122+
"version": "3.9.12"
123+
},
124+
"orig_nbformat": 4,
125+
"vscode": {
126+
"interpreter": {
127+
"hash": "c19b36fe549fe2dce1ac32d0dd317d0a363043eb1c14a547f46436cb05190cdf"
128+
}
129+
}
130+
},
131+
"nbformat": 4,
132+
"nbformat_minor": 2
133+
}

0 commit comments

Comments
 (0)