[Python] 18. Ensemble(앙상블) : breast

728x90

SMALL

import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.model_selection import cross_validate, cross_val_score
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
import sklearn.metrics as metrics

from sklearn.datasets import load_breast_cancer,load_wine,load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline

from warnings import filterwarnings
filterwarnings('ignore')

앙상블 학습의 유형: 보팅(Voting), 배깅(Bagging), 부스팅(Boosting), 스태킹(Stacking) 등

보팅

여러 종류의 알고리즘을 사용한 각각의 결과에 대해 투표를 통해 최종 결과를 예측하는 방식

배깅

bagging은 bootstrap aggregating의 줄임말
bootstrap:모집단의 성질에 대해 표본을 통해 추정할 수 있는 것처럼, 표본의 성질에 대해서도 재표집(resampling)을 통해 추정할 수 있다는 것이다. 즉 주어진 표본(샘플)에 대해서, 그 샘플에서 또 다시 샘플(재표본)을 여러번(1,000~10,000번, 혹은 그 이상)추출하여 표본의 평균이나 분산 등이 어떤 분포를 가지는가를 알아낼 수 있다.(위키피디아)
같은 알고리즘에 대해 데이터 샘플을 다르게 두고 학습을 수행해 보팅을 수행하는 방식
이 때의 데이터 샘플은 중첩이 허용된다. 즉 10000개의 데이터에 대해 10개의 알고리즘이 배깅을 사용할 때,각 1000개의 데이터 내에는 중복된 데이터가 존재할 수 있다. 배깅의 대표적인 방식이 Random Forest

부스팅

여러 개의 알고리즘이 순차적으로 학습을 하되, 앞에 학습한 알고리즘 예측이 틀린 데이터에 대해 올바르게 예측할 수 있도록, 그 다음번 알고리즘에 가중치(Ada)를 부여하여 학습과 예측을 진행하는 방식(잔여오차를 다시 학습 - gradient)

( 부스팅 알고리즘은 대표적으로 아래와 같은 알고리즘들이 있음)

AdaBoost
Gradient Booting Machine(GBM)
XGBoost
LightGBM
CatBoost

Voting

bcancer = load_breast_cancer()

x_data = bcancer.data
y_data = bcancer.target

x_train, x_test, y_train, y_test = train_test_split(bcancer['data'],
  bcancer['target'], test_size=0.2,random_state=11,stratify=bcancer['target'] )

model_logi = LogisticRegression()
model_knn = KNeighborsClassifier()
model_tree = DecisionTreeClassifier()

model_vote = VotingClassifier(estimators = [('LogisticRegression',model_logi),
                              ('KNeighborsClassifier',model_knn),
                              ('DecisionTreeClassifier',model_tree)])
# default  : voting = 'hard'

model_cross = cross_validate(model_vote,X=x_train,y=y_train,cv=5)
print(model_cross)
print(model_cross['test_score'].mean())

[OUT] :

{'fit_time': array([0.03175044, 0.02892351, 0.02720141, 0.0289793 , 0.02693224]), 
'score_time': array([0.0039885 , 0.00302219, 0.00395799, 0.00296211, 0.00399232]), 
'test_score': array([0.93406593, 0.94505495, 0.97802198, 0.93406593, 0.94505495])}
0.9472527472527472

model_vote.fit(x_train,y_train)
model_vote.predict(x_test)

[OUT] :

array([0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1,
       0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1,
       1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0,
       1, 0, 1, 0])

model_logi.fit(x_train,y_train)
model_logi.predict(x_test)

[OUT] :

array([0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1,
       0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1,
       1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0,
       1, 0, 1, 0])

model_vote.score(x_test,y_test)

[OUT] :

0.9122807017543859

model_logi.score(x_test,y_test)

[OUT] :

0.9122807017543859

for c in [model_logi,model_knn,model_tree]:
    c.fit(x_train,y_train)
    print(c.__class__.__name__, c.score(x_test,y_test))

[OUT] :

LogisticRegression 0.9122807017543859
KNeighborsClassifier 0.9122807017543859
DecisionTreeClassifier 0.8859649122807017

결론

- 개별데이터보다 voting이 더 잘나옴 (투표를 통해 하는것이므로->hard voting)

- 교차검증 score : 0.9472527472527472

Vote 방법

Hard Vote

classification을 예로 들어 보자면, 분류를 예측한 값이 1, 0, 0, 1, 1 이었다고 가정한다면 1이 3표, 0이 2표를 받았기 때문에 Hard Voting 방식에서는 1이 최종 값으로 예측을 하게 됩니다.

Soft Vote

soft vote 방식은 각각의 확률의 평균 값을 계산한다음에 가장 확률이 높은 값으로 확정짓게 됩니다.

가령 class 0이 나올 확률이 (0.4, 0.9, 0.9, 0.4, 0.4)이었고, class 1이 나올 확률이 (0.6, 0.1, 0.1, 0.6, 0.6) 이었다면,

class 0이 나올 최종 확률은 (0.4+0.9+0.9+0.4+0.4) / 5 = 0.44, class 1이 나올 최종 확률은

(0.6+0.1+0.1+0.6+0.6) / 5 = 0.4 가 되기 때문에

앞선 Hard Vote의 결과와는 다른 결과 값이 최종 으로 선출되게 됩니다.

연습문제 (wine data set)

3개의 분류 클래스를 이용하여 soft voting으로 정확도를 구하시오

Solution

wine = load_wine()

x_data = wine.data
y_data = wine.target

x_train, x_test, y_train, y_test = train_test_split(wine['data'],
  wine['target'], test_size=0.2,random_state=11,stratify=wine['target'] )

model_logi = LogisticRegression()
model_knn = KNeighborsClassifier()
model_tree = DecisionTreeClassifier()

model_vote = VotingClassifier(estimators = [('LogisticRegression',model_logi),
                              ('KNeighborsClassifier',model_knn),
                              ('DecisionTreeClassifier',model_tree)],
                             voting = 'soft')

model_cross = cross_validate(model_vote,X=x_train,y=y_train,cv=5)
print(model_cross)
print(model_cross['test_score'].mean())

[OUT] :

{'fit_time': array([0.01955009, 0.01894951, 0.01695442, 0.01695466, 0.0169549 ]), 
'score_time': array([0.00099754, 0.00099754, 0.00099754, 0.0009973 , 0.        ]), 
'test_score': array([1.        , 1.        , 0.85714286, 0.96428571, 0.92857143])}
0.95

model_vote.fit(x_train,y_train)
model_vote.predict(x_test)

[OUT] :

array([0, 1, 1, 0, 1, 2, 0, 2, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 2, 2, 0,
       1, 2, 0, 1, 2, 2, 0, 1, 2, 1, 2, 0, 1, 2])

model_logi.fit(x_train,y_train)
model_logi.predict(x_test)

[OUT] :

array([0, 1, 1, 0, 1, 2, 0, 2, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 2, 2, 0,
       1, 1, 0, 1, 2, 2, 0, 1, 2, 1, 1, 0, 1, 2])

model_vote.score(x_test,y_test)

[OUT] :

0.9444444444444444

model_logi.score(x_test,y_test)

[OUT] :

0.8888888888888888

for c in [model_logi,model_knn,model_tree]:
    c.fit(x_train,y_train)
    print(c.__class__.__name__, c.score(x_test,y_test))

[OUT] :

LogisticRegression 0.8888888888888888
KNeighborsClassifier 0.6388888888888888
DecisionTreeClassifier 0.9444444444444444

결론

- 개별데이터보다 voting이 더 잘나옴 ->soft voting

- 교차검증 score : 0.95

Bagging

wine = load_wine()

x_data = wine.data
y_data = wine.target

x_train, x_test, y_train, y_test = train_test_split(wine['data'],
  wine['target'], test_size=0.2,random_state=11,stratify=wine['target'] )

model_pipe_knn = make_pipeline(StandardScaler(),KNeighborsClassifier())
model_bagg = BaggingClassifier(model_pipe_knn,n_estimators=10,max_samples=0.5,)

model_cross = cross_validate(model_bagg,X=x_train,y=y_train,cv=5)
print(model_cross)
print(model_cross['test_score'].mean())

[OUT] :

{'fit_time': array([0.01203084, 0.0109694 , 0.01097059, 0.01199365, 0.01196766]), 
'score_time': array([0.00498724, 0.00598478, 0.00498676, 0.00498748, 0.00496078]), 
'test_score': array([0.93103448, 1.        , 0.89285714, 1.        , 0.96428571])}
0.9576354679802955

model_bagg.fit(x_train,y_train)
model_bagg.predict(x_test)

[OUT] :

array([0, 0, 1, 0, 0, 2, 0, 2, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 2, 2, 0,
       2, 2, 0, 1, 2, 2, 0, 0, 2, 1, 2, 0, 0, 2])

model_bagg.score(x_test,y_test)

[OUT] :

0.9166666666666666

결론

- bagging score : 0.9166666666666666

- 교차검증 score : 0.9576354679802955

Randomforest(decision tree + bagging)

forest = RandomForestClassifier()

forest.fit(x_train,y_train)

[OUT] :

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

forest.predict(x_test)

[OUT] :

array([0, 1, 1, 0, 1, 2, 0, 2, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 2, 2, 0,
       1, 2, 0, 1, 2, 2, 0, 1, 2, 1, 2, 0, 1, 2])

forest.score(x_test,y_test)

[OUT] :

0.9444444444444444

Boosting

tree = DecisionTreeClassifier(max_depth=1,criterion='entropy',
                             random_state=1)
model_ada = AdaBoostClassifier(tree)
model_ada.fit(x_train,y_train)

[OUT] :

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                         class_weight=None,
                                                         criterion='entropy',
                                                         max_depth=1,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         presort='deprecated',
                                                         random_state=1,
                                                         splitter='best'),
                   learning_rate=1.0, n_estimators=50, random_state=None)

model_ada.predict(x_test)

[OUT] :

array([1, 1, 1, 0, 1, 2, 0, 2, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 2, 2, 0,
       1, 2, 0, 1, 2, 2, 0, 1, 2, 1, 2, 0, 1, 1])

model_ada.score(x_test,y_test)

[OUT] :

0.8888888888888888

연습문제

cnacer 데이터 셋을 이용하여 logistic 에 bagging 과 boosting 을 적용하여 각각 정확도를 구하고 roc 를 그리시오

(roc는 이진분류만 가능)

Solution

cancer = load_breast_cancer()

x_data = cancer['data']
y_data = cancer['target']

x_train, x_test, y_train, y_test = train_test_split(x_data, 
                                                    y_data, 
                                                    test_size=0.2,
                                                    stratify=y_data)

#1 Bagging

# bagging 모델 
model_logi = make_pipeline( StandardScaler(), LogisticRegression() )
bagging_logi = BaggingClassifier( model_logi, max_samples=0.7)

# 교차검증
crosss_val = cross_val_score( bagging_logi, X=x_train, y=y_train )
crosss_val.mean()

[OUT] :

0.9736263736263737

# 훈련 및 정확도
bagging_logi.fit( x_train , y_train )
bagging_logi.score( x_test, y_test )

[OUT] :

0.9736842105263158

# ROC 커브
y_score = bagging_logi.predict_proba( x_test )
fpr, tpr, thresholds = metrics.roc_curve( y_test,  # 실제 y값
                                          y_score[:,1])  # 확률값) 

plt.plot( fpr, tpr , 'r-')
plt.plot( [0,1],[0,1], 'k--' ) # 이 선을 기준으로 왼쪽 위로 갈 수록 좋은 진단

plt.title('[Figure1] bagging ROC Curve')
plt.xlabel('FPR(False Positive Rate)')
plt.ylabel('TPR(True Positive Rate)')
plt.legend( ['bagging_Lositcic','Guess'] )

plt.show()

#2 Boosting(Ada)

# boosting 모델 
model_logi = LogisticRegression()
ada_logi = AdaBoostClassifier(model_logi)

# 교차검증
crosss_val = cross_val_score(ada_logi, X=x_train, y=y_train )
crosss_val.mean()

[OUT] :

0.9538461538461538

# 훈련 및 정확도
ada_logi.fit(x_train,y_train)
ada_logi.score(x_test,y_test)

[OUT] :

0.9473684210526315

# ROC 커브
y_score = ada_logi.predict_proba( x_test )
fpr, tpr, thresholds = metrics.roc_curve( y_test,  # 실제 y값
                                          y_score[:,1])  # 확률값) 

plt.plot( fpr, tpr , 'r-')
plt.plot( [0,1],[0,1], 'k--' ) # 이 선을 기준으로 왼쪽 위로 갈 수록 좋은 진단

plt.title('[Figure1] boosting ROC Curve')
plt.xlabel('FPR(False Positive Rate)')
plt.ylabel('TPR(True Positive Rate)')
plt.legend( ['boosting_Lositcic','Guess'] )

plt.show()

#번외문제 Voting

models = [('ada',AdaBoostClassifier()),
          ('bg',BaggingClassifier()),
          ('logi',LogisticRegression()),
          ('tree',DecisionTreeClassifier()),
          ('knn',KNeighborsClassifier())]
model_vote = VotingClassifier(models,voting='soft')
model_vote.fit(x_train,y_train)

[OUT] :

VotingClassifier(estimators=[('ada',
                              AdaBoostClassifier(algorithm='SAMME.R',
                                                 base_estimator=None,
                                                 learning_rate=1.0,
                                                 n_estimators=50,
                                                 random_state=None)),
                             ('bg',
                              BaggingClassifier(base_estimator=None,
                                                bootstrap=True,
                                                bootstrap_features=False,
                                                max_features=1.0,
                                                max_samples=1.0,
                                                n_estimators=10, n_jobs=None,
                                                oob_score=False,
                                                random_state=None, verbose=0,
                                                warm_start=F...
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_leaf=0.0,
                                                     presort='deprecated',
                                                     random_state=None,
                                                     splitter='best')),
                             ('knn',
                              KNeighborsClassifier(algorithm='auto',
                                                   leaf_size=30,
                                                   metric='minkowski',
                                                   metric_params=None,
                                                   n_jobs=None, n_neighbors=5,
                                                   p=2, weights='uniform'))],
                 flatten_transform=True, n_jobs=None, voting='soft',
                 weights=None)

model_vote.score(x_test,y_test)

[OUT] :

0.956140350877193

728x90

LIST

'코딩으로 익히는 Python > 모델링' 카테고리의 다른 글

[Python] 20. 나이브베이즈 (0)	2021.01.26
[Python] 19. MLP : pima-indians 예제 (0)	2021.01.26
[Python] 17. Decision Tree(의사결정나무) : iris, breast_cancer (0)	2021.01.25
[Python] 16. 강아지&고양이 이미지 분류 실습 (0)	2021.01.24
[Python] 15. 이미지분류 : mnist, MLPClassifier (0)	2021.01.24

Seize the Data

[Python] 18. Ensemble(앙상블) : breast_cancer,wine 예제

Voting

Bagging

Randomforest(decision tree + bagging)

Boosting

'코딩으로 익히는 Python > 모델링' 카테고리의 다른 글

티스토리툴바

[Python] 18. Ensemble(앙상블) : breast_cancer,wine 예제

Voting

Bagging

Randomforest(decision tree + bagging)

Boosting

'코딩으로 익히는 Python > 모델링' 카테고리의 다른 글

'코딩으로 익히는 Python/모델링' Related Articles

티스토리툴바