[인프런 - 파이썬 머신러닝 완벽 가이드]머신러닝 classification(분류) 성능 지표

[인프런 - 파이썬 머신러닝 완벽 가이드]머신러닝 classification(분류) 성능 지표 - 정밀도(Precision)과 재현율(Recall)

Data Science 2020. 6. 9. 07:10

정밀도 = TP/(FP + TP)

정밀도는 예측을 Positive로 한 대상 중 예측과 실제 값이 Positive로 일치한 데이터의 비율

재현율 = TP/(FN + TP)

재현율은 실제 값이 Positive인 대상 중에 예측과 실제 값이 Positive로 일치한 데이터의 비율

모든 예측을 0으로만 하는 MyFakeClassifier라는 Classifier를 만들어 precision과 recall을 확인해보자.

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd

class MyFakeClassifier(BaseEstimator):
    def fit(self,X,y):
        pass
    
    # 입력값으로 들어오는 X 데이터 셋의 크기만큼 모두 0값으로 만들어서 반환
    def predict(self,X):
        return np.zeros( (len(X), 1) , dtype=bool)

# 사이킷런의 내장 데이터 셋인 load_digits( )를 이용하여 MNIST 데이터 로딩
digits = load_digits()

# digits번호가 7번이면 True이고 이를 astype(int)로 1로 변환, 7번이 아니면 False이고 0으로 변환. 
y = (digits.target == 7).astype(int)
X_train, X_test, y_train, y_test = train_test_split( digits.data, y, random_state=11)

# 불균형한 레이블 데이터 분포도 확인. 
print('레이블 테스트 세트 크기 :', y_test.shape)
print('테스트 세트 레이블 0 과 1의 분포도')
print(pd.Series(y_test).value_counts())

# Dummy Classifier로 학습/예측/정확도 평가
fakeclf = MyFakeClassifier()
fakeclf.fit(X_train , y_train)
fakepred = fakeclf.predict(X_test)

from sklearn.metrics import confusion_matrix




from sklearn.metrics import accuracy_score, precision_score , recall_score


# 앞절의 예측 결과인 fakepred와 실제 결과인 y_test의 Confusion Matrix출력
print(confusion_matrix(y_test , fakepred))
print("정밀도:", precision_score(y_test, fakepred))
print("재현율:", recall_score(y_test, fakepred))

결과에서 확인할수있는것처럼

예측모델에서 Positive로 예측을 단 한 번도 하지 않았기 때문에 정밀도와 재현율이 모두 0이다.

(정밀도 = TP/(FP + TP) 재현율 = TP/(FN + TP)) 분자 TP가 모두 0

다음은 케글의 타이타닉 데이터와 LogisticRegression알고리즘을 활용해 정밀도와 재현율을 확인해보자.

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score , recall_score , confusion_matrix


# 원본 데이터를 재로딩, 데이터 가공, 학습데이터/테스트 데이터 분할. 
titanic_df = pd.read_csv('../1장/titanic/train.csv')
y_titanic_df = titanic_df['Survived']
X_titanic_df= titanic_df.drop('Survived', axis=1)

# 데이터 전처리.
X_titanic_df = transform_features(X_titanic_df)

X_train, X_test, y_train, y_test = train_test_split(X_titanic_df, y_titanic_df, \
                                                    test_size=0.20, random_state=11)

lr_clf = LogisticRegression()

lr_clf.fit(X_train , y_train)
pred = lr_clf.predict(X_test)
get_clf_eval(y_test , pred)


def get_clf_eval(y_test , pred):
    confusion = confusion_matrix( y_test, pred)
    precision = precision_score(y_test , pred)
    recall = recall_score(y_test , pred)
    print('오차 행렬')
    print(confusion)
    print(' 정밀도: {0:.4f}, 재현율: {1:.4f}'.format(precision ,recall))

업무에 따라서 재현율과 정밀도의 상대적 중요도가 다를 수 있다.

재현율이 상대적으로 더 중요한 지표인 경우는 실제 Positive인 데이터 예측을

Negative로 잘못 판단하게 되면 업무상 큰 영향이 발생하는 경우인데

예로 암 진단이 있다.

실제 암 환자임에도 Negative로 예측하면........

Precision/Recall Trade-off

분류하려는 업무의 특성상 정밀도 또는 재현율이 특별히 강조돼야 할 경우

분류의 결정 임계값(Threshold)을 조정해 정밀도 또는 재현율의 수치를 높일 수 있다.

하지만 정밀도와 재현율은 상호 보완적인 평가 지표이기 때문에

어느 한쪽을 강제로 높이면 다른 한쪽의 수치는 떨어지기 쉽다.

이를 우리는 Precision/Recall Trade-off라고 한다.

정밀도와 재현율을 높이기 위해선 각각 FP와 FN을 낮춰야 한다.

분류 결정 임계값이 낮아질수록 Positive로 예측할 확률이 높아짐 즉 재현율이 증가한다.

분류 결정 임계값이 0.5일 때와 0.4일 때를 비교해보면 당연히

임계값이 0.4일때 모델은 예측의 결과로 더 많은 Positive를 예측할 것이고

재현율 = TP/(FN + TP)

에서 FN Native로 예측하는 횟수가 줄어드니 당연히 재현율이 증가한다.

predict_proba( ) 확인

# predict_proba 분류결정 예측 확률을 반환
pred_proba = lr_clf.predict_proba(X_test)
pred  = lr_clf.predict(X_test)
print('pred_proba()결과 Shape : {0}'.format(pred_proba.shape))
print('pred_proba array에서 앞 3개만 샘플로 추출 \n:', pred_proba[:3])

# 예측 확률 array 와 예측 결과값 array 를 concatenate 하여 예측 확률과 결과값을 한눈에 확인
pred_proba_result = np.concatenate([pred_proba , pred.reshape(-1,1)],axis=1)
print('두개의 class 중에서 더 큰 확률을 클래스 값으로 예측 \n',pred_proba_result[:3])

Binarizer 활용

from sklearn.preprocessing import Binarizer

X = [[ 1, -1,  2],
     [ 2,  0,  0],
     [ 0,  1.1, 1.2]]

# threshold 기준값보다 같거나 작으면 0을, 크면 1을 반환
binarizer = Binarizer(threshold=1.1)                     
print(binarizer.fit_transform(X))

분류 결정 임계값 0.5 기반에서 Binarizer를 이용하여 예측값 변환

from sklearn.preprocessing import Binarizer

#Binarizer의 threshold 설정값. 분류 결정 임곗값임.  
custom_threshold = 0.5

# predict_proba( ) 반환값의 두번째 컬럼 , 즉 Positive 클래스 컬럼 하나만 추출하여 Binarizer를 적용
pred_proba_1 = pred_proba[:,1].reshape(-1,1)

binarizer = Binarizer(threshold=custom_threshold).fit(pred_proba_1) 

custom_predict = binarizer.transform(pred_proba_1)
 
get_clf_eval(y_test, custom_predict)

이 글 상단의 케글의 타이타닉 데이터와 LogisticRegression알고리즘을 활용한 모델의 정확도 정밀도 재현율이 같은 것을 확인할 수 있다.

분류 결정 임계값 0.4 기반에서 Binarizer를 이용하여 예측값 변환

# Binarizer의 threshold 설정값을 0.4로 설정. 즉 분류 결정 임곗값을 0.5에서 0.4로 낮춤  
custom_threshold = 0.4
pred_proba_1 = pred_proba[:,1].reshape(-1,1)
 
binarizer = Binarizer(threshold=custom_threshold).fit(pred_proba_1) 
custom_predict = binarizer.transform(pred_proba_1)

get_clf_eval(y_test , custom_predict)

분류 결정 임계값이 낮아지니 FP와 TP가 상승했고 재현율도 상승한 것을 확인할 수 있다.

여러 개의 분류 결정 임계값을 변경하면서 Binarizer를 이용하여 예측값 변환

# 테스트를 수행할 모든 임곗값을 리스트 객체로 저장. 
thresholds = [0.4, 0.45, 0.50, 0.55, 0.60]

def get_eval_by_threshold(y_test , pred_proba_c1, thresholds):
    # thresholds list객체내의 값을 차례로 iteration하면서 Evaluation 수행.
    for custom_threshold in thresholds:
        binarizer = Binarizer(threshold=custom_threshold).fit(pred_proba_c1) 
        custom_predict = binarizer.transform(pred_proba_c1)
        print('임곗값:',custom_threshold)
        get_clf_eval(y_test , custom_predict)

get_eval_by_threshold(y_test ,pred_proba[:,1].reshape(-1,1), thresholds )

precision_recall_curve( )를 이용하여 임곗값에 따른 정밀도-재현율 값 추출

from sklearn.metrics import precision_recall_curve

# 레이블 값이 1일때의 예측 확률을 추출 
pred_proba_class1 = lr_clf.predict_proba(X_test)[:, 1] 

# 실제값 데이터 셋과 레이블 값이 1일 때의 예측 확률을 precision_recall_curve 인자로 입력 
precisions, recalls, thresholds = precision_recall_curve(y_test, pred_proba_class1 )
print('반환된 분류 결정 임곗값 배열의 Shape:', thresholds.shape)
print('반환된 precisions 배열의 Shape:', precisions.shape)
print('반환된 recalls 배열의 Shape:', recalls.shape)

print("thresholds 5 sample:", thresholds[:5])
print("precisions 5 sample:", precisions[:5])
print("recalls 5 sample:", recalls[:5])

#반환된 임계값 배열 로우가 147건이므로 샘플로 10건만 추출하되, 임곗값을 15 Step으로 추출. 
thr_index = np.arange(0, thresholds.shape[0], 15)
print('샘플 추출을 위한 임계값 배열의 index 10개:', thr_index)
print('샘플용 10개의 임곗값: ', np.round(thresholds[thr_index], 2))

# 15 step 단위로 추출된 임계값에 따른 정밀도와 재현율 값 
print('샘플 임계값별 정밀도: ', np.round(precisions[thr_index], 3))
print('샘플 임계값별 재현율: ', np.round(recalls[thr_index], 3))

임곗값의 변경에 따른 정밀도-재현율 변화 곡선을 그림

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
 
def precision_recall_curve_plot(y_test , pred_proba_c1):
    # threshold ndarray와 이 threshold에 따른 정밀도, 재현율 ndarray 추출. 
    precisions, recalls, thresholds = precision_recall_curve( y_test, pred_proba_c1)
    
    # X축을 threshold값으로, Y축은 정밀도, 재현율 값으로 각각 Plot 수행. 정밀도는 점선으로 표시
    plt.figure(figsize=(8,6))
    threshold_boundary = thresholds.shape[0]
    plt.plot(thresholds, precisions[0:threshold_boundary], linestyle='--', label='precision')
    plt.plot(thresholds, recalls[0:threshold_boundary],label='recall')
    
    # threshold 값 X 축의 Scale을 0.1 단위로 변경
    start, end = plt.xlim()
    plt.xticks(np.round(np.arange(start, end, 0.1),2))
    
    # x축, y축 label과 legend, 그리고 grid 설정
    plt.xlabel('Threshold value'); plt.ylabel('Precision and Recall value')
    plt.legend(); plt.grid()
    plt.show()
    
precision_recall_curve_plot( y_test, lr_clf.predict_proba(X_test)[:, 1] )

'Data Science' 카테고리의 다른 글

[인프런 - 파이썬 머신러닝 완벽 가이드] 분류(Classification) 결정트리 Decision Tree (0)	2020.06.10
[인프런 - 파이썬 머신러닝 완벽 가이드]머신러닝 classification(분류) 성능 지표 F1 Score , ROC 곡선과 AUC (0)	2020.06.09
[인프런 - 파이썬 머신러닝 완벽 가이드] 머신러닝 classification(분류) 성능 지표 - 정확도 Accuracy , 오차 행렬 (Confusion Matrix) (0)	2020.06.09
[인프런 - 파이썬 머신러닝 완벽 가이드] 데이터 전처리 (데이터 인코딩과 스케일링) (0)	2020.06.08
[인프런 - 파이썬 머신러닝 완벽 가이드] 교차 검증 (0)	2020.06.08

ABOUT ME

Jay's Lab Jay's Lab

정밀도 = TP/(FP + TP)

재현율 = TP/(FN + TP)

Precision/Recall Trade-off

재현율 = TP/(FN + TP)

predict_proba( ) 확인

Binarizer 활용

'Data Science' 카테고리의 다른 글

티스토리툴바

ABOUT ME

정밀도 = TP/(FP + TP)

재현율 = TP/(FN + TP)

Precision/Recall Trade-off

재현율 = TP/(FN + TP)

predict_proba( ) 확인

Binarizer 활용

'Data Science' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바