[Python] 9. Logistic Regression (로지스틱 회귀)

728x90

SMALL

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import sklearn.metrics as metrics
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math

로지스틱 회귀 (Logistic Regression)

로지스틱 회귀는 이름에 회귀라는 단어가 들어가지만, 가능한 클래스가 2개인 이진 분류를 위한 모델
로지스틱 회귀의 예측 함수 정의
σ: 시그모이드 함수
로지스틱 회귀 모델은 선형 회귀 모델에 시그모이드 함수를 적용
로지스틱 회귀의 학습 목표는 다음과 같은 목적 함수를 최소화 하는 파라미터 w를 찾는 것

y * -np.log(hx) + (1 - y) * -np.log(1 - hx)

# x_data:[공부시간, 출석일수], y_data:불합격/합격 
x_data = np.array( [[1,3],[2,2],[3,1],[4,6],[5,5],[6,4]])
y_data = np.array( [0,0,0,1,1,1])
model_logi = LogisticRegression()
model_logi.fit(x_data,y_data)

[OUT]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

model_logi.coef_

[OUT]:

array([[0.78341156, 0.78341156]])

model_logi.intercept_

[OUT]:

array([-5.48382067])

# 1

z = np.matmul([[6,6]],model_logi.coef_.reshape(-1,1)) + model_logi.intercept_
z

[OUT]:

array([[3.9171181]])

def sigmoid( z ):
    return 1/(1+math.e**-z )
sigmoid(z)

[OUT]:

array([[0.98048986]])

# 2

model_logi.predict_proba([[6,6]]) # predict_proba : sigmoid통과했을 때 값 array(0확률,1확률)

[OUT]:

array([[0.01951014, 0.98048986]])

model_logi.predict([[6,6]])
# model_logi.predict_proba([[6,6]]).argmax(axis=1)

[OUT]:

array([1])

model_logi.predict([[1,1],[6,6]]) # 불합격, 합격

[OUT]:

array([0, 1])

model_logi.score(x_data,y_data) # 정확도

[OUT]:

1.0

- 로지스틱에서 score는 r2결정계수가 아닌 정확도

x_data

[OUT]:

array([[1, 3],
       [2, 2],
       [3, 1],
       [4, 6],
       [5, 5],
       [6, 4]])

p = model_logi.predict(x_data)
p # 예측값

[OUT]:

array([0, 0, 0, 1, 1, 1])

y_data # 실제값

[OUT]:

array([0, 0, 0, 1, 1, 1])

(p == y_data).mean() # 정확도 accuracy

[OUT]:

1.0

- 분류에서는 단순히 정확도를 가지고 판단하지 않는다 (참고만 함)

- 주로 F1 score, ROC curve 사용

logistic 파라미터

penalty : str, ‘l1’, ‘l2’, ‘elasticnet’ or ‘none’, optional (default=’l2’)

- l1: 맨하튼 거리, 오차 = 오차 + alpha * (|w1| + |w2|)
- l2: 유클리디안 거리의 제곱, 오차 = 오차 + alpha * (W1^2 + w2^2) 가중치 규제 (특성 수 줄이기, 과대적합 방지)
- none: 가중치 규제 미사용

C : float, optional (default=1.0) cost function의 C를 의미하는 것이며, C의 경우에는 높은 C를 설정할 수록, 낮은 강도의 제약조건이 설정되고 낮은 C를 설정할 수록, 높은 강도의 제약조건이 설정

solver

- liblinear:L1제약조건, L2제약조건 두 가지를 모두 지원하며, 이것은 작은 데이터에 적합한 알고리즘
- sag, saga: 이것을 확률적경사하강법을 기반으로 하기에 대용량 데이터에 적합한 알고이름이라고 하며, sag는 L1 제약조건만을 지원하고, saga는 L1, L2 제약조건 둘 다 지원함

class_weight :데이터에 직접 가중치를 설정하여 학습의 강도를 다르게 할 수 있는 하이퍼 파라미터

alpha : alpha 는 클수록 가중치 규제, 작을수록 정확하게 (과적합)

데이터 불러오기 (pd.read_csv)

python 파일 경로에 data5 폴더 만든 후 다음의 'pima-indians-diabetes.data.csv'파일 넣어놓기

pima-indians-diabetes.data.csv

0.02MB

피마인디안 data set

정보 1 : 과거 임신 횟수 (pregnant)

정보 2 : 포도당 부하 검사 2시간 후 공복 혈당 농도 (plasma)

정보 3 : 확장기 혈압 (pressure)

정보 4 : 삼두근 피부 주름 두께 (thickness)

정보 5 : 혈정 인슐린 (insulin)

정보 6 : 체질량 지수 (BMI)

정보 7 : 당뇨병 가족력 (pedigree)

정보 8 : 나이 (age)

클래스 : 당뇨( 1) , 당뇨가 아님 ( 0 )

연습문제

1. train, test set으로 나눈 후 각각의 정확도를 구하시오

2. test set의 0번째 row데이터로 당뇨병 유무를 확인하시오

Solution

1. train, test set으로 나눈 후 각각의 정확도를 구하시오

df = pd.read_csv('data5/pima-indians-diabetes.data.csv')

# 피쳐와 레이블 지정 및 학습, 테스트 데이터셋 설정
x_data = df.iloc[:,:-1]
y_data = df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2,random_state=1,
                                                    stratify=y_data) # stratify=y_data 반드시 (0,1균형)

# 정규화 및 학습 파이프라인 설정
model_logistic = make_pipeline( StandardScaler(), LogisticRegression() )
model_logistic.fit(x_train, y_train)

# 각 데이터셋의 정확도 추출
print('학습데이터 accuracy: ',model_logistic.score(x_train,y_train))
print('테스트데이터 accuracy: ',model_logistic.score(x_test,y_test))

[OUT]:

학습데이터 accuracy:  0.7866449511400652
테스트데이터 accuracy:  0.7857142857142857

2. test set의 0번째 row데이터로 당뇨병 유무를 확인하시오

# test셋의 0번째 row 데이터로 당뇨병 유무 확인
model_logistic.predict([x_test.iloc[0,:]])

[OUT]:

array([0], dtype=int64)

review
- 연습문제 solution 안보고 풀어보기

728x90

LIST

'코딩으로 익히는 Python > 모델링' 카테고리의 다른 글

[Python] 11. softmax (4)	2021.01.21
[Python] 10. confusion matrix : precision,recall,f1,ROC (2)	2021.01.21
[Python] 8. Sigmoid & Logistic (0)	2021.01.20
[Python] 7. Sigmoid 함수 (0)	2021.01.20
[Python] 6. L1 norm, L2 norm (2)	2021.01.20

Seize the Data

[Python] 9. Logistic Regression (로지스틱 회귀)

'코딩으로 익히는 Python > 모델링' 카테고리의 다른 글

티스토리툴바

[Python] 9. Logistic Regression (로지스틱 회귀)

'코딩으로 익히는 Python > 모델링' 카테고리의 다른 글

'코딩으로 익히는 Python/모델링' Related Articles

티스토리툴바