[Python] 2. 정규화 : 상관관계, 다중공선성

728x90

SMALL

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.datasets import load_boston, load_iris
from sklearn.linear_model import LinearRegression,Ridge, SGDRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import MinMaxScaler, StandardScaler

import mglearn
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['font.family']='Malgun Gothic'
matplotlib.rcParams['axes.unicode_minus'] = False

import warnings
warnings.simplefilter('ignore')

선형 모델(Linear Models)

선형 모델은 과거 부터 지금 까지 널리 사용되고 연구 되고 있는 기계학습 방법
선형 모델은 입력 데이터에 대한 선형 함수를 만들어 예측 수행
회귀 분석을 위한 선형 모델은 다음과 같이 정의

선형 회귀(Linear Regression)

선형 회귀(Linear Regression)또는 최소제곱법(Ordinary Least Squares)은 가장 간단한 회귀 분석을 위한 선형 모델
선형 회귀는 모델의 예측과 정답 사이의 평균제곱오차(Mean Squared Error)를 최소화 하는 학습 파라미터 𝑤를 찾음
평균제곱오차는 아래와 같이 정의

df = pd.read_csv( 'data4/data-01.csv', header=None)
df.columns =['q1','q2','midterm','final']
df[:5]

# 방법1
df.iloc[:,:-1] # 특성데이터(행렬) -> matmul할것이므로 무조건 행렬

# 방법2
df[['midterm','q1','q2']]

# 방법3
df[df.columns.difference(['final'])] # final컬럼 빼고 들고옴

df.iloc[:,-1]

[OUT]:

0     152
1     185
2     180
3     196
4     142
5     101
6     149
7     115
8     175
9     164
10    141
11    141
12    184
13    152
14    148
15    192
16    147
17    183
18    177
19    159
20    177
21    175
22    175
23    149
24    192
Name: final, dtype: int64

x_data = df.iloc[:,:-1]
y_data = df.iloc[:,-1]
model_lr = LinearRegression()
model_lr.fit(x_data,y_data)

[OUT]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

model_lr.coef_ # 3개(각1개씩)

[OUT]:

array([0.35593822, 0.54251876, 1.16744422])

model_lr.intercept_ # 1개

[OUT]:

-4.3361024012403675

연습문제

q1 : 70, q2 : 75, midterm : 75 일 때 final 예측값?

# 1 직접
# w1 * x1 + w2 * x2 + w3 * x3 + b
model_lr.coef_[0] * 70 + model_lr.coef_[1] * 75 + model_lr.coef_[2] * 75 + model_lr.intercept_

[OUT]:

148.8267959476565

# 2 모델함수 활용
model_lr.predict([[70,75,75]])

[OUT]:

array([148.82679595])

# 3
np.matmul([[70,75,75]],model_lr.coef_.reshape(-1,1)) + model_lr.intercept_

[OUT]:

array([[148.82679595]])

# 결정계수값
model_lr.score(x_data,y_data)

[OUT]:

0.98966157894484

plt.plot(y_data)
plt.plot(model_lr.predict(x_data),'r--')
plt.show()

연습문제

1. q1 : 60, q2 : 55, midterm : 65 일 때 final 예측값을 구하시오

2. q1 : 90, q2 : 85, midterm : 95 일 때 final 예측값을 구하시오

Solution

1. q1 : 60, q2 : 55, midterm : 65 일 때 final 예측값을 구하시오

model_lr.predict([[60,55,65]])

[OUT]:

array([122.74259645])

2. q1 : 90, q2 : 85, midterm : 95 일 때 final 예측값을 구하시오

model_lr.predict([[90,85,95]])

[OUT]:

array([184.71963222])

보스턴 주택 가격 데이터 셋

주택 가격 데이터는 도시에 대한 분석과 부동산, 경제적인 정보 분석 등 많은 활용 가능한 측면들이 존재
보스턴 주택 가격 데이터는 카네기 멜론 대학교에서 관리하는 StatLib 라이브러리에서 가져온 것
헤리슨(Harrison, D.)과 루빈펠트(Rubinfeld, D. L.)의 논문 "Hedonic prices and the demand for clean air', J. Environ. Economics & Management"에서 보스턴 데이터가 사용
1970년도 인구 조사에서 보스턴의 506개 조사 구역과 주택 가격에 영향을 주는 속성 21개로 구성

boston = load_boston()
boston # dict

[OUT]:

{'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
         4.9800e+00],
        [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
         9.1400e+00],
        [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
         4.0300e+00],
        ...,
        [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
         5.6400e+00],
        [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
         6.4800e+00],
        [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
         7.8800e+00]]),
 'target': array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
        18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
        15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
        13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
        21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
        35.4, 24.7, 31.6, 23.3, 19.6, 18.7, 16. , 22.2, 25. , 33. , 23.5,
        19.4, 22. , 17.4, 20.9, 24.2, 21.7, 22.8, 23.4, 24.1, 21.4, 20. ,
        20.8, 21.2, 20.3, 28. , 23.9, 24.8, 22.9, 23.9, 26.6, 22.5, 22.2,
        23.6, 28.7, 22.6, 22. , 22.9, 25. , 20.6, 28.4, 21.4, 38.7, 43.8,
        33.2, 27.5, 26.5, 18.6, 19.3, 20.1, 19.5, 19.5, 20.4, 19.8, 19.4,
        21.7, 22.8, 18.8, 18.7, 18.5, 18.3, 21.2, 19.2, 20.4, 19.3, 22. ,
        20.3, 20.5, 17.3, 18.8, 21.4, 15.7, 16.2, 18. , 14.3, 19.2, 19.6,
        23. , 18.4, 15.6, 18.1, 17.4, 17.1, 13.3, 17.8, 14. , 14.4, 13.4,
        15.6, 11.8, 13.8, 15.6, 14.6, 17.8, 15.4, 21.5, 19.6, 15.3, 19.4,
        17. , 15.6, 13.1, 41.3, 24.3, 23.3, 27. , 50. , 50. , 50. , 22.7,
        25. , 50. , 23.8, 23.8, 22.3, 17.4, 19.1, 23.1, 23.6, 22.6, 29.4,
        23.2, 24.6, 29.9, 37.2, 39.8, 36.2, 37.9, 32.5, 26.4, 29.6, 50. ,
        32. , 29.8, 34.9, 37. , 30.5, 36.4, 31.1, 29.1, 50. , 33.3, 30.3,
        34.6, 34.9, 32.9, 24.1, 42.3, 48.5, 50. , 22.6, 24.4, 22.5, 24.4,
        20. , 21.7, 19.3, 22.4, 28.1, 23.7, 25. , 23.3, 28.7, 21.5, 23. ,
        26.7, 21.7, 27.5, 30.1, 44.8, 50. , 37.6, 31.6, 46.7, 31.5, 24.3,
        31.7, 41.7, 48.3, 29. , 24. , 25.1, 31.5, 23.7, 23.3, 22. , 20.1,
        22.2, 23.7, 17.6, 18.5, 24.3, 20.5, 24.5, 26.2, 24.4, 24.8, 29.6,
        42.8, 21.9, 20.9, 44. , 50. , 36. , 30.1, 33.8, 43.1, 48.8, 31. ,
        36.5, 22.8, 30.7, 50. , 43.5, 20.7, 21.1, 25.2, 24.4, 35.2, 32.4,
        32. , 33.2, 33.1, 29.1, 35.1, 45.4, 35.4, 46. , 50. , 32.2, 22. ,
        20.1, 23.2, 22.3, 24.8, 28.5, 37.3, 27.9, 23.9, 21.7, 28.6, 27.1,
        20.3, 22.5, 29. , 24.8, 22. , 26.4, 33.1, 36.1, 28.4, 33.4, 28.2,
        22.8, 20.3, 16.1, 22.1, 19.4, 21.6, 23.8, 16.2, 17.8, 19.8, 23.1,
        21. , 23.8, 23.1, 20.4, 18.5, 25. , 24.6, 23. , 22.2, 19.3, 22.6,
        19.8, 17.1, 19.4, 22.2, 20.7, 21.1, 19.5, 18.5, 20.6, 19. , 18.7,
# ...
# 중략
# ...
        23.1, 19.7, 18.3, 21.2, 17.5, 16.8, 22.4, 20.6, 23.9, 22. , 11.9]),
 'feature_names': array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
        'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7'),
 'DESCR': ".. _boston_dataset:\n\nBoston house prices dataset\n---------------------------\n\n**Data Set Characteristics:**  \n\n    :Number of Instances: 506 \n\n    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\n\n    :Attribute Information (in order):\n        - CRIM     per capita crime rate by town\n        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.\n        - INDUS    proportion of non-retail business acres per town\n        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n        - NOX      nitric oxides concentration (parts per 10 million)\n        - RM       average number of rooms per dwelling\n        - AGE      proportion of owner-occupied units built prior to 1940\n        - DIS      weighted distances to five Boston employment centres\n        - RAD      index of accessibility to radial highways\n        - TAX      full-value property-tax rate per $10,000\n        - PTRATIO  pupil-teacher ratio by town\n        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n        - LSTAT    % lower status of the population\n        - MEDV     Median value of owner-occupied homes in $1000's\n\n    :Missing Attribute Values: None\n\n    :Creator: Harrison, D. and Rubinfeld, D.L.\n\nThis is a copy of UCI ML housing dataset.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/housing/\n\n\nThis dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n\nThe Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\nprices and the demand for clean air', J. Environ. Economics & Management,\nvol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n...', Wiley, 1980.   N.B. Various transformations are used in the table on\npages 244-261 of the latter.\n\nThe Boston house-price data has been used in many machine learning papers that address regression\nproblems.   \n     \n.. topic:: References\n\n   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n",
 'filename': 'C:\\anaconda3\\lib\\site-packages\\sklearn\\datasets\\data\\boston_house_prices.csv'}

boston.keys()

[OUT]:

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

boston['data']

[OUT]:

array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
        4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
        9.1400e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
        4.0300e+00],
       ...,
       [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        5.6400e+00],
       [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
        6.4800e+00],
       [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        7.8800e+00]])

boston['target']

[OUT]:

array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
       18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
       15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
       13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
       21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
       35.4, 24.7, 31.6, 23.3, 19.6, 18.7, 16. , 22.2, 25. , 33. , 23.5,
       19.4, 22. , 17.4, 20.9, 24.2, 21.7, 22.8, 23.4, 24.1, 21.4, 20. ,
       20.8, 21.2, 20.3, 28. , 23.9, 24.8, 22.9, 23.9, 26.6, 22.5, 22.2,
       23.6, 28.7, 22.6, 22. , 22.9, 25. , 20.6, 28.4, 21.4, 38.7, 43.8,
       33.2, 27.5, 26.5, 18.6, 19.3, 20.1, 19.5, 19.5, 20.4, 19.8, 19.4,
       21.7, 22.8, 18.8, 18.7, 18.5, 18.3, 21.2, 19.2, 20.4, 19.3, 22. ,
       20.3, 20.5, 17.3, 18.8, 21.4, 15.7, 16.2, 18. , 14.3, 19.2, 19.6,
       23. , 18.4, 15.6, 18.1, 17.4, 17.1, 13.3, 17.8, 14. , 14.4, 13.4,
       15.6, 11.8, 13.8, 15.6, 14.6, 17.8, 15.4, 21.5, 19.6, 15.3, 19.4,
       17. , 15.6, 13.1, 41.3, 24.3, 23.3, 27. , 50. , 50. , 50. , 22.7,
       25. , 50. , 23.8, 23.8, 22.3, 17.4, 19.1, 23.1, 23.6, 22.6, 29.4,
       23.2, 24.6, 29.9, 37.2, 39.8, 36.2, 37.9, 32.5, 26.4, 29.6, 50. ,
       32. , 29.8, 34.9, 37. , 30.5, 36.4, 31.1, 29.1, 50. , 33.3, 30.3,
       34.6, 34.9, 32.9, 24.1, 42.3, 48.5, 50. , 22.6, 24.4, 22.5, 24.4,
       20. , 21.7, 19.3, 22.4, 28.1, 23.7, 25. , 23.3, 28.7, 21.5, 23. ,
       26.7, 21.7, 27.5, 30.1, 44.8, 50. , 37.6, 31.6, 46.7, 31.5, 24.3,
       31.7, 41.7, 48.3, 29. , 24. , 25.1, 31.5, 23.7, 23.3, 22. , 20.1,
       22.2, 23.7, 17.6, 18.5, 24.3, 20.5, 24.5, 26.2, 24.4, 24.8, 29.6,
       42.8, 21.9, 20.9, 44. , 50. , 36. , 30.1, 33.8, 43.1, 48.8, 31. ,
       36.5, 22.8, 30.7, 50. , 43.5, 20.7, 21.1, 25.2, 24.4, 35.2, 32.4,
       32. , 33.2, 33.1, 29.1, 35.1, 45.4, 35.4, 46. , 50. , 32.2, 22. ,
       20.1, 23.2, 22.3, 24.8, 28.5, 37.3, 27.9, 23.9, 21.7, 28.6, 27.1,
       20.3, 22.5, 29. , 24.8, 22. , 26.4, 33.1, 36.1, 28.4, 33.4, 28.2,
       22.8, 20.3, 16.1, 22.1, 19.4, 21.6, 23.8, 16.2, 17.8, 19.8, 23.1,
       21. , 23.8, 23.1, 20.4, 18.5, 25. , 24.6, 23. , 22.2, 19.3, 22.6,
       19.8, 17.1, 19.4, 22.2, 20.7, 21.1, 19.5, 18.5, 20.6, 19. , 18.7,
       32.7, 16.5, 23.9, 31.2, 17.5, 17.2, 23.1, 24.5, 26.6, 22.9, 24.1,
       18.6, 30.1, 18.2, 20.6, 17.8, 21.7, 22.7, 22.6, 25. , 19.9, 20.8,
       16.8, 21.9, 27.5, 21.9, 23.1, 50. , 50. , 50. , 50. , 50. , 13.8,
       13.8, 15. , 13.9, 13.3, 13.1, 10.2, 10.4, 10.9, 11.3, 12.3,  8.8,
        7.2, 10.5,  7.4, 10.2, 11.5, 15.1, 23.2,  9.7, 13.8, 12.7, 13.1,
       12.5,  8.5,  5. ,  6.3,  5.6,  7.2, 12.1,  8.3,  8.5,  5. , 11.9,
       27.9, 17.2, 27.5, 15. , 17.2, 17.9, 16.3,  7. ,  7.2,  7.5, 10.4,
        8.8,  8.4, 16.7, 14.2, 20.8, 13.4, 11.7,  8.3, 10.2, 10.9, 11. ,
        9.5, 14.5, 14.1, 16.1, 14.3, 11.7, 13.4,  9.6,  8.7,  8.4, 12.8,
       10.5, 17.1, 18.4, 15.4, 10.8, 11.8, 14.9, 12.6, 14.1, 13. , 13.4,
       15.2, 16.1, 17.8, 14.9, 14.1, 12.7, 13.5, 14.9, 20. , 16.4, 17.7,
       19.5, 20.2, 21.4, 19.9, 19. , 19.1, 19.1, 20.1, 19.9, 19.6, 23.2,
       29.8, 13.8, 13.3, 16.7, 12. , 14.6, 21.4, 23. , 23.7, 25. , 21.8,
       20.6, 21.2, 19.1, 20.6, 15.2,  7. ,  8.1, 13.6, 20.1, 21.8, 24.5,
       23.1, 19.7, 18.3, 21.2, 17.5, 16.8, 22.4, 20.6, 23.9, 22. , 11.9])

boston['feature_names']

[OUT]:

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

boston_df = pd.DataFrame(  boston['data'], columns=boston['feature_names'])
boston_df['MEDV'] = boston.target
boston_df

boston_df.columns

[OUT]:

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT', 'MEDV'],
      dtype='object')

len(boston_df.columns) # 13개가 특성데이터, 1개가 라벨(주택가격 MEDV)

[OUT]:

14

len(boston_df)

[OUT]:

506

x_data = boston_df.iloc[:,:-1] # 특성데이터
y_data = boston_df.iloc[:,-1] # 라벨데이터
x_train, x_test, y_train, y_test =  train_test_split(x_data,y_data,
                                                    test_size=0.2,random_state=1)
x_train[:5]

x_test.shape

[OUT]:

(102, 13)

modelBoston = LinearRegression()
modelBoston.fit(x_train,y_train)
print('훈련데이터 R2 :',modelBoston.score(x_train,y_train))
print('테스트데이터 R2 :',modelBoston.score(x_test,y_test))
# 훈련데이터 보다 테스트데이터의 R2값이 현저히 적으면 과대적합

[OUT]:

훈련데이터 R2 : 0.7293585058196337
테스트데이터 R2 : 0.7634174432138463

x_test.iloc[0]

[OUT]:

CRIM         0.04932
ZN          33.00000
INDUS        2.18000
CHAS         0.00000
NOX          0.47200
RM           6.84900
AGE         70.30000
DIS          3.18270
RAD          7.00000
TAX        222.00000
PTRATIO     18.40000
B          396.90000
LSTAT        7.53000
Name: 307, dtype: float64

y_test.iloc[0]

[OUT]:

28.2

modelBoston.predict([x_test.iloc[0]])

[OUT]:

array([32.65503184])

연습문제

1. 학습에 의한 train set R2, test set R2 구하시오

2. 딥러닝에 의한 train set R2, test set R2 구하시오

Solution

1. 학습에 의한 train set R2, test set R2 구하시오

# 학습
modelSGD = SGDRegressor(max_iter=10000, alpha=0.0001,
                        early_stopping=True, verbose=1)
modelSGD.fit(x_train,y_train)

[OUT]:

-- Epoch 1
Norm: 1000972354801.43, NNZs: 13, Bias: 4875876688.731638, T: 363, Avg. loss: 175655328978744046681447202816.000000
Total training time: 0.00 seconds.
-- Epoch 2
Norm: 2248035349382.25, NNZs: 13, Bias: -5359221226.635144, T: 726, Avg. loss: 82599512282602847752893235200.000000
Total training time: 0.00 seconds.
-- Epoch 3
Norm: 805374884791.66, NNZs: 13, Bias: 11287484422.079906, T: 1089, Avg. loss: 65193196772991794807832576000.000000
Total training time: 0.00 seconds.
-- Epoch 4
Norm: 1296516696646.59, NNZs: 13, Bias: 5356965431.623692, T: 1452, Avg. loss: 51648732282067209500625993728.000000
Total training time: 0.00 seconds.
-- Epoch 5
Norm: 1149429087801.64, NNZs: 13, Bias: 10021548118.592192, T: 1815, Avg. loss: 48182121091369778135561142272.000000
Total training time: 0.00 seconds.
-- Epoch 6
Norm: 1076366996014.92, NNZs: 13, Bias: 5610700693.707168, T: 2178, Avg. loss: 41605617825969045559515807744.000000
Total training time: 0.00 seconds.
-- Epoch 7
Norm: 1015895432936.18, NNZs: 13, Bias: 7436521956.549241, T: 2541, Avg. loss: 38515237284027625701556355072.000000
Total training time: 0.00 seconds.
-- Epoch 8
Norm: 1430633239000.05, NNZs: 13, Bias: 11638016811.682087, T: 2904, Avg. loss: 35910448101155580703525568512.000000
Total training time: 0.00 seconds.
-- Epoch 9
Norm: 1441775053656.24, NNZs: 13, Bias: 18380039313.811031, T: 3267, Avg. loss: 34942365753719340041976676352.000000
Total training time: 0.00 seconds.
-- Epoch 10
Norm: 981556486355.73, NNZs: 13, Bias: 11145416217.229061, T: 3630, Avg. loss: 32132024253372650615583277056.000000
Total training time: 0.01 seconds.
-- Epoch 11
Norm: 936925396971.25, NNZs: 13, Bias: 17919510018.562450, T: 3993, Avg. loss: 29385536584489097520935337984.000000
Total training time: 0.01 seconds.
-- Epoch 12
Norm: 1102400034251.60, NNZs: 13, Bias: 18426686411.697453, T: 4356, Avg. loss: 29288317556616796956432269312.000000
Total training time: 0.01 seconds.
-- Epoch 13
Norm: 1302258219884.28, NNZs: 13, Bias: 21808507946.957272, T: 4719, Avg. loss: 27717310161958140201163292672.000000
Total training time: 0.01 seconds.
-- Epoch 14
Norm: 1001988151016.18, NNZs: 13, Bias: 16284442595.832541, T: 5082, Avg. loss: 26000865104349300821961211904.000000
Total training time: 0.01 seconds.
-- Epoch 15
Norm: 1405018245984.92, NNZs: 13, Bias: 17968958283.249458, T: 5445, Avg. loss: 26060618444452566532054056960.000000
Total training time: 0.01 seconds.
-- Epoch 16
Norm: 1563056952337.01, NNZs: 13, Bias: 21441820968.204262, T: 5808, Avg. loss: 24784813485539757735838482432.000000
Total training time: 0.01 seconds.
-- Epoch 17
Norm: 1085136714666.82, NNZs: 13, Bias: 20246179829.370113, T: 6171, Avg. loss: 24061365657598261301407121408.000000
Total training time: 0.01 seconds.
-- Epoch 18
Norm: 1588249228405.20, NNZs: 13, Bias: 16587896127.834354, T: 6534, Avg. loss: 23187350012691487262190338048.000000
Total training time: 0.01 seconds.
-- Epoch 19
Norm: 1419668291958.70, NNZs: 13, Bias: 17692683983.104443, T: 6897, Avg. loss: 22718598043919622623522193408.000000
Total training time: 0.01 seconds.
-- Epoch 20
Norm: 1669457402745.95, NNZs: 13, Bias: 14427258883.285353, T: 7260, Avg. loss: 22541103939184081350273531904.000000
Total training time: 0.01 seconds.
-- Epoch 21
Norm: 1456040765071.93, NNZs: 13, Bias: 11194914940.566019, T: 7623, Avg. loss: 21565047387416307686975733760.000000
Total training time: 0.01 seconds.
Convergence after 21 epochs took 0.01 seconds
SGDRegressor(alpha=0.0001, average=False, early_stopping=True, epsilon=0.1,
             eta0=0.01, fit_intercept=True, l1_ratio=0.15,
             learning_rate='invscaling', loss='squared_loss', max_iter=10000,
             n_iter_no_change=5, penalty='l2', power_t=0.25, random_state=None,
             shuffle=True, tol=0.001, validation_fraction=0.1, verbose=1,
             warm_start=False)

print('훈련데이터 R2 :',modelSGD.score(x_train,y_train))
print('테스트데이터 R2 :',modelSGD.score(x_test,y_test))

[OUT]:

훈련데이터 R2 : -5.107376028662615e+26
테스트데이터 R2 : -3.728272722591322e+26

2. 딥러닝에 의한 train set R2, test set R2 구하시오

# 딥러닝
modelNN = MLPRegressor(max_iter=5000, alpha=0.1,
                       verbose=1 ,hidden_layer_sizes=(100,10))
modelNN.fit(x_train,y_train)

[OUT]:

Iteration 1, loss = 410.58598117
Iteration 2, loss = 155.00128852
Iteration 3, loss = 193.54880162
Iteration 4, loss = 108.50415804
Iteration 5, loss = 51.69378721
Iteration 6, loss = 79.70249061
Iteration 7, loss = 63.60636043
Iteration 8, loss = 36.99753105
Iteration 9, loss = 40.78749131
Iteration 10, loss = 46.73881178
Iteration 11, loss = 36.33579031
Iteration 12, loss = 32.77907696
Iteration 13, loss = 35.16511087
Iteration 14, loss = 30.51158017
Iteration 15, loss = 26.06022453
Iteration 16, loss = 26.37237275
Iteration 17, loss = 27.07081218
Iteration 18, loss = 25.85264687
Iteration 19, loss = 24.68202430
Iteration 20, loss = 24.00486038
Iteration 21, loss = 23.84470286
Iteration 22, loss = 23.68999971
Iteration 23, loss = 25.01745463
Iteration 24, loss = 23.62987014
Iteration 25, loss = 24.48892014
Iteration 26, loss = 24.39330617
Iteration 27, loss = 23.43536905
Iteration 28, loss = 22.48148754
Iteration 29, loss = 22.55614164
Iteration 30, loss = 22.61518132
Iteration 31, loss = 21.70859104
Iteration 32, loss = 22.74354704
Iteration 33, loss = 23.50143281
Iteration 34, loss = 22.25441534
Iteration 35, loss = 21.56804314
Iteration 36, loss = 21.02223706
Iteration 37, loss = 20.74956064
Iteration 38, loss = 20.74706756
Iteration 39, loss = 26.40852987
Iteration 40, loss = 37.90708614
Iteration 41, loss = 23.34691653
Iteration 42, loss = 34.35396913
Iteration 43, loss = 27.48777948
Iteration 44, loss = 24.10178145
Iteration 45, loss = 27.92069125
Iteration 46, loss = 20.85642184
Iteration 47, loss = 25.22598048
Iteration 48, loss = 20.68438471
Iteration 49, loss = 21.88868369
Iteration 50, loss = 20.50823251
Iteration 51, loss = 20.18661269
Iteration 52, loss = 20.98836901
Iteration 53, loss = 19.40491274
Iteration 54, loss = 19.38627756
Iteration 55, loss = 19.51584550
Iteration 56, loss = 19.51943482
Iteration 57, loss = 19.06594189
Iteration 58, loss = 19.21880062
Iteration 59, loss = 19.11134186
Iteration 60, loss = 19.75823485
Iteration 61, loss = 20.34727341
Iteration 62, loss = 19.67969468
Iteration 63, loss = 19.20957985
Iteration 64, loss = 20.63997598
Iteration 65, loss = 20.14667158
Iteration 66, loss = 18.50491137
Iteration 67, loss = 19.28163817
Iteration 68, loss = 19.74873609
Iteration 69, loss = 22.24846819
Iteration 70, loss = 18.57109493
Iteration 71, loss = 18.70337981
Iteration 72, loss = 20.00890660
Iteration 73, loss = 18.71861294
Iteration 74, loss = 18.19869368
Iteration 75, loss = 17.66808398
Iteration 76, loss = 18.76628925
Iteration 77, loss = 18.74217144
Iteration 78, loss = 18.11249078
Iteration 79, loss = 18.24184136
Iteration 80, loss = 18.38733740
Iteration 81, loss = 18.32816054
Iteration 82, loss = 18.15028982
Iteration 83, loss = 18.49161193
Iteration 84, loss = 17.56914362
Iteration 85, loss = 17.45750738
Iteration 86, loss = 17.36106135
Iteration 87, loss = 18.38137755
Iteration 88, loss = 17.38648967
Iteration 89, loss = 17.59238285
Iteration 90, loss = 19.32306476
Iteration 91, loss = 19.19422569
Iteration 92, loss = 25.07658889
Iteration 93, loss = 19.66393661
Iteration 94, loss = 25.56091248
Iteration 95, loss = 17.93740542
Iteration 96, loss = 20.44030334
Iteration 97, loss = 18.91339838
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
MLPRegressor(activation='relu', alpha=0.1, batch_size='auto', beta_1=0.9,
             beta_2=0.999, early_stopping=False, epsilon=1e-08,
             hidden_layer_sizes=(100, 10), learning_rate='constant',
             learning_rate_init=0.001, max_fun=15000, max_iter=5000,
             momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
             power_t=0.5, random_state=None, shuffle=True, solver='adam',
             tol=0.0001, validation_fraction=0.1, verbose=1, warm_start=False)

print('훈련데이터 R2 :',modelNN.score(x_train,y_train))
print('테스트데이터 R2 :',modelNN.score(x_test,y_test))

[OUT]:

훈련데이터 R2 : 0.44676336145680695
테스트데이터 R2 : 0.37978819200914804

결과가 너무 만족스럽지 못함 Why?

특성데이터가 다중일 경우 check해야 할 포인트가 있음
학습을통한 방식은 정규화 필수 (단위차이가 클 수 있음)

Multi(다중)

1. 정규화(scailing)
2. 다중공선성(상관관계)

scaleX = StandardScaler()
x_dataS = scaleX.fit_transform(x_data)
x_dataS

[OUT]:

array([[-0.41978194,  0.28482986, -1.2879095 , ..., -1.45900038,
         0.44105193, -1.0755623 ],
       [-0.41733926, -0.48772236, -0.59338101, ..., -0.30309415,
         0.44105193, -0.49243937],
       [-0.41734159, -0.48772236, -0.59338101, ..., -0.30309415,
         0.39642699, -1.2087274 ],
       ...,
       [-0.41344658, -0.48772236,  0.11573841, ...,  1.17646583,
         0.44105193, -0.98304761],
       [-0.40776407, -0.48772236,  0.11573841, ...,  1.17646583,
         0.4032249 , -0.86530163],
       [-0.41500016, -0.48772236,  0.11573841, ...,  1.17646583,
         0.44105193, -0.66905833]])

x_train, x_test, y_train, y_test =  train_test_split(x_dataS,y_data,
                                                    test_size=0.2,random_state=1)

# 학습
modelSGD = SGDRegressor(max_iter=10000, alpha=0.0001,
                        early_stopping=True)
modelSGD.fit(x_train,y_train)
print('훈련데이터 R2 :',modelSGD.score(x_train,y_train))
print('테스트데이터 R2 :',modelSGD.score(x_test,y_test))

[OUT]:

훈련데이터 R2 : 0.7223317095794387
테스트데이터 R2 : 0.7619633956242506

# 딥러닝
modelNN = MLPRegressor(max_iter=5000, alpha=0.1,
                      hidden_layer_sizes=(100,10))
modelNN.fit(x_train,y_train)
print('훈련데이터 R2 :',modelNN.score(x_train,y_train))
print('테스트데이터 R2 :',modelNN.score(x_test,y_test))

[OUT]:

훈련데이터 R2 : 0.8462545317151478
테스트데이터 R2 : 0.8704536888301699

결론 : 정규화를 해줬더니 결과가 훨씬 잘 나옴, 딥러닝이 더 결과가 좋음 (보통 딥러닝 성능 최고)

y_data[0]

[OUT]:

24.0

modelSGD.predict([x_test[0]]) # numpy는 iloc[0]말고 바로 [0]

[OUT]:

array([30.77279072])

modelNN.predict([x_test[0]])

[OUT]:

array([28.40309629])

x_data.iloc[0]

[OUT]:

CRIM         0.00632
ZN          18.00000
INDUS        2.31000
CHAS         0.00000
NOX          0.53800
RM           6.57500
AGE         65.20000
DIS          4.09000
RAD          1.00000
TAX        296.00000
PTRATIO     15.30000
B          396.90000
LSTAT        4.98000
Name: 0, dtype: float64

x_data[0]을 predict할때 x_train, x_test를 정규화 했으므로 x_data[0]도 정규화 해야함

xx = scaleX.transform([x_data.iloc[0]])
xx

[OUT]:

array([[-0.41978194,  0.28482986, -1.2879095 , -0.27259857, -0.14421743,
         0.41367189, -0.12001342,  0.1402136 , -0.98284286, -0.66660821,
        -1.45900038,  0.44105193, -1.0755623 ]])

modelSGD.predict([xx[0]])

[OUT]:

array([30.23437136])

modelNN.predict([xx[0]])

[OUT]:

array([29.76331856])

review
- 단위 차이가 너무 크면 정규화 필수

728x90

LIST

'코딩으로 익히는 Python > 모델링' 카테고리의 다른 글

[Python] 5. 문자열encoding : LabelEncoder, OneHotEncoder, get_dummies(), make_column_transformer 예제 (0)	2021.01.20
[Python] 4. 다중선형회귀 : 릿지L2규제, 라쏘L1규제, 엘라스틱넷 (0)	2021.01.20
[Python] 3. 다중선형회귀 (0)	2021.01.20
[Python] 1. 선형회귀분석 (0)	2021.01.20
[Python] 0. sklearn 구조 (0)	2021.01.20

Seize the Data

[Python] 2. 정규화 : 상관관계, 다중공선성

'코딩으로 익히는 Python > 모델링' 카테고리의 다른 글

티스토리툴바

[Python] 2. 정규화 : 상관관계, 다중공선성

'코딩으로 익히는 Python > 모델링' 카테고리의 다른 글

'코딩으로 익히는 Python/모델링' Related Articles

티스토리툴바