728x90
반응형
SMALL
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.datasets import load_boston, load_iris
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression,Ridge, SGDRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import mglearn
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['font.family']='Malgun Gothic'
matplotlib.rcParams['axes.unicode_minus'] = False
from statsmodels.stats.outliers_influence import variance_inflation_factor
import warnings
warnings.simplefilter('ignore')
boston = load_boston()
boston_df = pd.DataFrame( boston['data'], columns=boston['feature_names'])
boston_df['MEDV'] = boston.target # MEDV : 주택 중위 가격
boston_df
x_data = boston_df.iloc[:,:-1]
y_data = boston_df.iloc[:,-1]
x_data.shape
[OUT]:
(506, 13)
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data,
test_size=0.2,random_state=1)
make_pipleline()
- 한번에 스케일링과 모델적용 가능
- 다중선형회귀에 사용
model = make_pipeline(StandardScaler(),SGDRegressor()) # scaling한 다음 바로 SGDRegressor() 적용
model.fit(x_train,y_train)
[OUT]:
Pipeline(memory=None,
steps=[('standardscaler',
StandardScaler(copy=True, with_mean=True, with_std=True)),
('sgdregressor',
SGDRegressor(alpha=0.0001, average=False, early_stopping=False,
epsilon=0.1, eta0=0.01, fit_intercept=True,
l1_ratio=0.15, learning_rate='invscaling',
loss='squared_loss', max_iter=1000,
n_iter_no_change=5, penalty='l2', power_t=0.25,
random_state=None, shuffle=True, tol=0.001,
validation_fraction=0.1, verbose=0,
warm_start=False))],
verbose=False)
print(model.score(x_train,y_train))
print(model.score(x_test,y_test))
[OUT]:
0.7282032376879544
0.7613945570359111
model.predict([x_test.iloc[0]]) # 이미 스케일링되었으므로 다시 스케일링 시킬필요없음
[OUT]:
array([32.03212725])
교차검증 (데이터가 적은 경우)
r2Score = cross_val_score(model,x_data,y_data,cv=10,scoring='r2',verbose=1)
print(r2Score)
print(r2Score.mean())
[OUT]:
[ 0.74250305 0.47222271 -1.03998803 0.63944922 0.55575059 0.73762767
0.40141345 -0.11295988 -0.80462517 0.46928872]
0.20606823396915472
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.0s finished
다중공선성
다중공선성 회귀 분석에서 사용된 모형의 일부 설명 변수가 다른 설명 변수와 상관 정도가 높아, 데이터 분석 시 부정적인 영향을 미치는 현상
회귀 모델에서 다중공선성을 파악할 수 있는 대표적인 방법은 VIF (Variance inflation Factors 분산팽창요인)
- 안전 : VIF < 5
- 주의 : 5 < VIF < 10
- 위험 : 10 < VIF
plt.figure(figsize=(10,8))
sns.heatmap(boston_df.corr(),
annot=True, cmap='Reds', vmin=-1,vmax=1) # annot : 주석
plt.show()
vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(boston_df.values,i) for i in range(boston_df.shape[1])]
vif['features'] = boston_df.columns
vif
- 특성데이터가 많은 경우 : 다중 공선성(컬럼제거), 타겟(레벨)에 상관관계가 낮은 컬럼
연습문제
캘리포니아 주택가격을 이용하여 cross validation, 다중공선성을 확인한 후에 다중공선성이 높은 컬럼을 제외하고 학습하여 cross validation값을 확인하시오
cal = fetch_california_housing()
cal
[OUT]:
{'data': array([[ 8.3252 , 41. , 6.98412698, ..., 2.55555556,
37.88 , -122.23 ],
[ 8.3014 , 21. , 6.23813708, ..., 2.10984183,
37.86 , -122.22 ],
[ 7.2574 , 52. , 8.28813559, ..., 2.80225989,
37.85 , -122.24 ],
...,
[ 1.7 , 17. , 5.20554273, ..., 2.3256351 ,
39.43 , -121.22 ],
[ 1.8672 , 18. , 5.32951289, ..., 2.12320917,
39.43 , -121.32 ],
[ 2.3886 , 16. , 5.25471698, ..., 2.61698113,
39.37 , -121.24 ]]),
'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
'feature_names': ['MedInc',
'HouseAge',
'AveRooms',
'AveBedrms',
'Population',
'AveOccup',
'Latitude',
'Longitude'],
'DESCR': '.. _california_housing_dataset:\n\nCalifornia Housing dataset\n--------------------------\n\n**Data Set Characteristics:**\n\n :Number of Instances: 20640\n\n :Number of Attributes: 8 numeric, predictive attributes and the target\n\n :Attribute Information:\n - MedInc median income in block\n - HouseAge median house age in block\n - AveRooms average number of rooms\n - AveBedrms average number of bedrooms\n - Population block population\n - AveOccup average house occupancy\n - Latitude house block latitude\n - Longitude house block longitude\n\n :Missing Attribute Values: None\n\nThis dataset was obtained from the StatLib repository.\nhttp://lib.stat.cmu.edu/datasets/\n\nThe target variable is the median house value for California districts.\n\nThis dataset was derived from the 1990 U.S. census, using one row per census\nblock group. A block group is the smallest geographical unit for which the U.S.\nCensus Bureau publishes sample data (a block group typically has a population\nof 600 to 3,000 people).\n\nIt can be downloaded/loaded using the\n:func:`sklearn.datasets.fetch_california_housing` function.\n\n.. topic:: References\n\n - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,\n Statistics and Probability Letters, 33 (1997) 291-297\n'}
cal.keys()
[OUT]:
dict_keys(['data', 'target', 'feature_names', 'DESCR'])
cal['feature_names']
[OUT]:
cal['feature_names']
cal['feature_names']
['MedInc',
'HouseAge',
'AveRooms',
'AveBedrms',
'Population',
'AveOccup',
'Latitude',
'Longitude']
print(cal['DESCR'])
[OUT]:
.. _california_housing_dataset:
California Housing dataset
--------------------------
**Data Set Characteristics:**
:Number of Instances: 20640
:Number of Attributes: 8 numeric, predictive attributes and the target
:Attribute Information:
- MedInc median income in block
- HouseAge median house age in block
- AveRooms average number of rooms
- AveBedrms average number of bedrooms
- Population block population
- AveOccup average house occupancy
- Latitude house block latitude
- Longitude house block longitude
:Missing Attribute Values: None
This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/
The target variable is the median house value for California districts.
This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).
It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.
.. topic:: References
- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297
cal_df = pd.DataFrame(cal['data'], columns=cal['feature_names'])
cal_df['MEDV'] = cal.target
cal_df
x_data = cal_df.iloc[:,:-1]
y_data = cal_df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data,
test_size=0.2,random_state=1)
model = make_pipeline(StandardScaler(),LinearRegression()) # scaling한 다음 바로 SGDRegressor() 적용
model.fit(x_train,y_train)
[OUT]:
Pipeline(memory=None,
steps=[('standardscaler',
StandardScaler(copy=True, with_mean=True, with_std=True)),
('linearregression',
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False))],
verbose=False)
print(model.score(x_train,y_train))
print(model.score(x_test,y_test))
[OUT]:
0.6083741964648377
0.5965968374812352
r2Score = cross_val_score(model,x_data,y_data,cv=5,scoring='r2',verbose=1)
print(r2Score)
print(r2Score.mean())
[OUT]:
[0.54866323 0.46820691 0.55078434 0.53698703 0.66051406]
0.5530311140279566
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.0s finished
feature끼리 상관관계 확인
plt.figure(figsize=(10,8))
sns.heatmap(cal_df.corr(),
annot=True, cmap='Reds', vmin=-1,vmax=1) # annot : 주석
plt.show()
sns.pairplot(cal_df)
plt.show()
VIF 확인하고 다중공선성높은 3가지 컬럼 제거
vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(cal_df.values,i) for i in range(cal_df.shape[1])]
vif['features'] = cal_df.columns
vif
vifX = vif.iloc[vif['VIF Factor'].nlargest(2).index].features.values.tolist()
vifX
[OUT]:
['Longitude', 'Latitude']
다중공선성 높은 컬럼 3개 제거한 데이터 calDF
calDF = cal_df.drop(columns=vifX)
calDF
x_data = calDF.iloc[:,:-1]
y_data = calDF.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data,
test_size=0.2,random_state=1)
model = make_pipeline(StandardScaler(),LinearRegression()) # scaling한 다음 바로 SGDRegressor() 적용
model.fit(x_train,y_train)
print(model.score(x_train,y_train))
print(model.score(x_test,y_test))
r2Score = cross_val_score(model,x_data,y_data,cv=10,scoring='r2',verbose=1)
print(r2Score)
print(r2Score.mean())
[OUT]:
0.5409848489986098
0.5332575128677686
[0.53071132 0.4839093 0.38987981 0.48402382 0.5075654 0.49644267
0.17024893 0.4105453 0.29452113 0.4514216 ]
0.421926928373032
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.0s finished
다중공선성 높은 컬럼 2개를 제거했는데 r2socre가 더 낮게 나왔다 어떡하지?
review
- 상관관계, 다중공선성 이해
728x90
반응형
LIST
'코딩으로 익히는 Python > 모델링' 카테고리의 다른 글
[Python] 5. 문자열encoding : LabelEncoder, OneHotEncoder, get_dummies(), make_column_transformer 예제 (0) | 2021.01.20 |
---|---|
[Python] 4. 다중선형회귀 : 릿지L2규제, 라쏘L1규제, 엘라스틱넷 (0) | 2021.01.20 |
[Python] 2. 정규화 : 상관관계, 다중공선성 (0) | 2021.01.20 |
[Python] 1. 선형회귀분석 (0) | 2021.01.20 |
[Python] 0. sklearn 구조 (0) | 2021.01.20 |