본문 바로가기

코딩으로 익히는 Python/모델링

[Python] 13. KNN분류

728x90
반응형
SMALL
from sklearn.datasets import make_classification, load_iris
from sklearn.cluster import KMeans 
import numpy as np 
import pandas as pd 
import seaborn as sb 
import matplotlib.pyplot as plt

import mglearn
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import GridSearchCV
import warnings
warnings.simplefilter('ignore')

 

X,y = mglearn.datasets.make_forge()
x_train, x_test, y_train, y_test = train_test_split( X,y, 
                                                    test_size=0.2) #defalult 75% 25%  ,7:3

KNN분류

 

mglearn.discrete_scatter( X[:,0],X[:,1],y)
plt.legend( ['0class','1class'])
plt.title('KNN Example') 
plt.xlabel('x') 
plt.ylabel('y')
plt.show()

 

mglearn.plots.plot_knn_classification(n_neighbors=1)

 

mglearn.plots.plot_knn_classification(n_neighbors=3)

 

model_knn = KNeighborsClassifier(n_neighbors=1)
model_knn.fit(x_train,y_train)
[OUT]:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

 

x_test
[OUT]:

array([[ 8.7337095 ,  2.49162431],
       [ 8.68937095,  1.48709629],
       [ 8.92229526, -0.63993225],
       [ 8.69289001,  1.54322016],
       [ 8.67494727,  4.47573059],
       [ 9.15072323,  5.49832246]])

 

model_knn.predict(x_test)
[OUT]:

array([0, 0, 0, 0, 1, 1])

 

y_test
[OUT]:

array([0, 0, 0, 0, 1, 1])

 

model_knn.score(x_test,y_test)
[OUT]:

1.0

gridsearch 이용하여 최적 n찾기

 

param_value = {'n_neighbors':[1,2,3,4,5]}
gridSearch = GridSearchCV(KNeighborsClassifier(),param_grid=param_value)
gridSearch.fit(x_train,y_train)
[OUT]:

GridSearchCV(cv=None, error_score=nan,
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='deprecated', n_jobs=None,
             param_grid={'n_neighbors': [1, 2, 3, 4, 5]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

 

gridSearch.best_params_
[OUT]:

{'n_neighbors': 2}

 

gridSearch.best_score_
[OUT]:

0.9

 

gridSearch.best_estimator_.predict(x_test)
[OUT]:

array([0, 0, 0, 0, 0, 1])

 

gridSearch.predict(x_test)
[OUT]:

array([0, 0, 0, 0, 0, 1])

연습문제

 

iris 데이터셋을 KNN을 이용하여 분류하시오


Solution

 

iris = load_iris()
iris_df = pd.DataFrame(iris.data)
iris_df.columns = iris['feature_names']
iris_df['specis']  = iris.target
iris_df

 

x_data = iris_df.iloc[:,:-1]
y_data = iris_df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2,
                                                    random_state=42,stratify=y_data)
param_value = {'n_neighbors':range(10)} # 보통 3~5개
knn_model = GridSearchCV(KNeighborsClassifier(),param_grid=param_value) # CV none : 5번 default
knn_model.fit(x_train,y_train)
[OUT]:

GridSearchCV(cv=None, error_score=nan,
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='deprecated', n_jobs=None,
             param_grid={'n_neighbors': range(0, 10)}, pre_dispatch='2*n_jobs',
             refit=True, return_train_score=False, scoring=None, verbose=0)

 

print("best parameter :",knn_model.best_params_)
print("best score :",knn_model.best_score_)
print("predict :",knn_model.predict(x_test).tolist())
print("y_test  :",y_test.tolist())
[OUT]:

best parameter : {'n_neighbors': 6}
best score : 0.9833333333333334
predict : [0, 2, 1, 1, 0, 1, 0, 0, 2, 1, 2, 2, 2, 1, 0, 0, 0, 1, 1, 1, 0, 2, 1, 2, 2, 1, 1, 0, 2, 0]
y_test  : [0, 2, 1, 1, 0, 1, 0, 0, 2, 1, 2, 2, 2, 1, 0, 0, 0, 1, 1, 2, 0, 2, 1, 2, 2, 1, 1, 0, 2, 0]

번외문제

 

예측값과 실제값의 차이(틀린 개수)를 구하시오


Solution

 

x = knn_model.predict(x_test).tolist()
y = y_test.tolist()
cnt = 0
for i in range(len(x)):
    if x[i]!=y[i]:
        cnt+=1
        print(i,x[i],y[i])
print('틀린 개수 :',cnt)
[OUT]:

19 1 2
틀린 개수 : 1

review
- iris data set는 정규화를 굳이 안해도 되는 정제된 데이터지만 보통은 정규화 필수
728x90
반응형
LIST