본문 바로가기

코딩으로 익히는 Python/모델링

[Python] 3. 다중선형회귀

728x90
반응형
SMALL
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.datasets import load_boston, load_iris
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression,Ridge, SGDRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler, StandardScaler


import mglearn
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['font.family']='Malgun Gothic'
matplotlib.rcParams['axes.unicode_minus'] = False

from statsmodels.stats.outliers_influence import variance_inflation_factor

import warnings
warnings.simplefilter('ignore')
boston = load_boston()

boston_df = pd.DataFrame(  boston['data'], columns=boston['feature_names'])
boston_df['MEDV'] = boston.target # MEDV : 주택 중위 가격
boston_df

 

x_data = boston_df.iloc[:,:-1]
y_data = boston_df.iloc[:,-1]
x_data.shape
[OUT]:

(506, 13)

 

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, 
                                test_size=0.2,random_state=1)

make_pipleline()

  • 한번에 스케일링과 모델적용 가능
  • 다중선형회귀에 사용
model = make_pipeline(StandardScaler(),SGDRegressor()) # scaling한 다음 바로 SGDRegressor() 적용
model.fit(x_train,y_train)
[OUT]:

Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('sgdregressor',
                 SGDRegressor(alpha=0.0001, average=False, early_stopping=False,
                              epsilon=0.1, eta0=0.01, fit_intercept=True,
                              l1_ratio=0.15, learning_rate='invscaling',
                              loss='squared_loss', max_iter=1000,
                              n_iter_no_change=5, penalty='l2', power_t=0.25,
                              random_state=None, shuffle=True, tol=0.001,
                              validation_fraction=0.1, verbose=0,
                              warm_start=False))],
         verbose=False)

 

print(model.score(x_train,y_train))
print(model.score(x_test,y_test))
[OUT]:

0.7282032376879544
0.7613945570359111

 

model.predict([x_test.iloc[0]]) # 이미 스케일링되었으므로 다시 스케일링 시킬필요없음
[OUT]:

array([32.03212725])

교차검증 (데이터가 적은 경우)

 

r2Score = cross_val_score(model,x_data,y_data,cv=10,scoring='r2',verbose=1)
print(r2Score)
print(r2Score.mean())
[OUT]:

[ 0.74250305  0.47222271 -1.03998803  0.63944922  0.55575059  0.73762767
  0.40141345 -0.11295988 -0.80462517  0.46928872]
0.20606823396915472

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.0s finished

다중공선성

다중공선성 회귀 분석에서 사용된 모형의 일부 설명 변수가 다른 설명 변수와 상관 정도가 높아, 데이터 분석 시 부정적인 영향을 미치는 현상

회귀 모델에서 다중공선성을 파악할 수 있는 대표적인 방법은 VIF (Variance inflation Factors 분산팽창요인)

  • 안전 : VIF < 5
  • 주의 : 5 < VIF < 10
  • 위험 : 10 < VIF
plt.figure(figsize=(10,8))
sns.heatmap(boston_df.corr(), 
           annot=True, cmap='Reds', vmin=-1,vmax=1) # annot : 주석
plt.show()

 

vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(boston_df.values,i) for i in range(boston_df.shape[1])]
vif['features'] = boston_df.columns
vif

 

  • 특성데이터가 많은 경우 :  다중 공선성(컬럼제거), 타겟(레벨)에 상관관계가 낮은 컬럼

 


연습문제

 

캘리포니아 주택가격을 이용하여 cross validation, 다중공선성을 확인한 후에 다중공선성이 높은 컬럼을 제외하고 학습하여 cross validation값을 확인하시오

 

cal =  fetch_california_housing()
cal
[OUT]:

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n\nCalifornia Housing dataset\n--------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 20640\n\n    :Number of Attributes: 8 numeric, predictive attributes and the target\n\n    :Attribute Information:\n        - MedInc        median income in block\n        - HouseAge      median house age in block\n        - AveRooms      average number of rooms\n        - AveBedrms     average number of bedrooms\n        - Population    block population\n        - AveOccup      average house occupancy\n        - Latitude      house block latitude\n        - Longitude     house block longitude\n\n    :Missing Attribute Values: None\n\nThis dataset was obtained from the StatLib repository.\nhttp://lib.stat.cmu.edu/datasets/\n\nThe target variable is the median house value for California districts.\n\nThis dataset was derived from the 1990 U.S. census, using one row per census\nblock group. A block group is the smallest geographical unit for which the U.S.\nCensus Bureau publishes sample data (a block group typically has a population\nof 600 to 3,000 people).\n\nIt can be downloaded/loaded using the\n:func:`sklearn.datasets.fetch_california_housing` function.\n\n.. topic:: References\n\n    - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,\n      Statistics and Probability Letters, 33 (1997) 291-297\n'}

 

cal.keys()
[OUT]:

dict_keys(['data', 'target', 'feature_names', 'DESCR'])

 

cal['feature_names']
[OUT]:

cal['feature_names']
cal['feature_names']
['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']

 

print(cal['DESCR'])
[OUT]:

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. topic:: References

    - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
      Statistics and Probability Letters, 33 (1997) 291-297

 

cal_df = pd.DataFrame(cal['data'], columns=cal['feature_names'])
cal_df['MEDV'] = cal.target
cal_df

 

x_data = cal_df.iloc[:,:-1]
y_data = cal_df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, 
                                test_size=0.2,random_state=1)
model = make_pipeline(StandardScaler(),LinearRegression()) # scaling한 다음 바로 SGDRegressor() 적용
model.fit(x_train,y_train)
[OUT]:

Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('linearregression',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

 

print(model.score(x_train,y_train))
print(model.score(x_test,y_test))
[OUT]:

0.6083741964648377
0.5965968374812352

 

r2Score = cross_val_score(model,x_data,y_data,cv=5,scoring='r2',verbose=1)
print(r2Score)
print(r2Score.mean())
[OUT]:

[0.54866323 0.46820691 0.55078434 0.53698703 0.66051406]
0.5530311140279566

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished

 

feature끼리 상관관계 확인

plt.figure(figsize=(10,8))
sns.heatmap(cal_df.corr(), 
           annot=True, cmap='Reds', vmin=-1,vmax=1) # annot : 주석
plt.show()

 

sns.pairplot(cal_df)
plt.show()

 

VIF 확인하고 다중공선성높은 3가지 컬럼 제거

vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(cal_df.values,i) for i in range(cal_df.shape[1])]
vif['features'] = cal_df.columns
vif

 

vifX = vif.iloc[vif['VIF Factor'].nlargest(2).index].features.values.tolist()
vifX
[OUT]:

['Longitude', 'Latitude']

 

다중공선성 높은 컬럼 3개 제거한 데이터 calDF

calDF = cal_df.drop(columns=vifX)
calDF

 

x_data = calDF.iloc[:,:-1]
y_data = calDF.iloc[:,-1]

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, 
                                test_size=0.2,random_state=1)

model = make_pipeline(StandardScaler(),LinearRegression()) # scaling한 다음 바로 SGDRegressor() 적용
model.fit(x_train,y_train)

print(model.score(x_train,y_train))
print(model.score(x_test,y_test))

r2Score = cross_val_score(model,x_data,y_data,cv=10,scoring='r2',verbose=1)
print(r2Score)
print(r2Score.mean())
[OUT]:

0.5409848489986098
0.5332575128677686
[0.53071132 0.4839093  0.38987981 0.48402382 0.5075654  0.49644267
 0.17024893 0.4105453  0.29452113 0.4514216 ]
0.421926928373032
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.0s finished

 

다중공선성 높은 컬럼 2개를 제거했는데 r2socre가 더 낮게 나왔다 어떡하지?


review
- 상관관계, 다중공선성 이해
728x90
반응형
LIST