sklearn.model_selection.GridSearchCV 정리

GridSearchCV : 교차검증 + 하이퍼 퍼라미터 튜닝

  • 하이퍼 파라미터 튜닝을 위한 그리드 서치와 교차평가(CV)를 한번에 할 수있는 Sklearn api
  • 지정한 하이퍼 파라미터를 순차적으로 입력하면서 최적의 파라미터를 도출할 수 있음
  • 그리드 서치 경우의 수 x CV 횟수 만큼의 학습과 평가가 이루어짐
  • 최적의 파라미터를 편리하게 찾을 수 있지만 수행시간이 상대적으로 오래걸림

 

GridSearchCV(estimator, param_grid, scoring=None, refit=True, cv=None)

  • estimator : classifier, regressor, pipeline
  • param_grid : 그리드 서치로 튜닝할 하이퍼 파라미터를 딕셔너리 형태로 넣음
  • scoring : 평가지표
  • cv : 교차 검증을 위한 분할 데이터 셋의 수
  • refit : True 생성 시 최적의 하이퍼 파라미터 튜닝 후 입력된 estimator 객체를 해당 파라미터로 재학습

Dacon wine 품질 예측 데이터셋을 이용한 예시

from IPython.core.display import display, HTML
display(HTML("<style>.container {width:90% !important;}</style>"))
import os
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score
pwd
'C:\\Users\\Jay\\Desktop\\github'
df_wine = pd.read_csv('C:/Users/Jay/Desktop/python_ML/study/wine_quality/data/train.csv')
df_wine
  index quality fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol type
0 0 5 5.6 0.695 0.06 6.8 0.042 9.0 84.0 0.99432 3.44 0.44 10.2 white
1 1 5 8.8 0.610 0.14 2.4 0.067 10.0 42.0 0.99690 3.19 0.59 9.5 red
2 2 5 7.9 0.210 0.39 2.0 0.057 21.0 138.0 0.99176 3.05 0.52 10.9 white
3 3 6 7.0 0.210 0.31 6.0 0.046 29.0 108.0 0.99390 3.26 0.50 10.8 white
4 4 6 7.8 0.400 0.26 9.5 0.059 32.0 178.0 0.99550 3.04 0.43 10.9 white
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5492 5492 5 7.7 0.150 0.29 1.3 0.029 10.0 64.0 0.99320 3.35 0.39 10.1 white
5493 5493 6 6.3 0.180 0.36 1.2 0.034 26.0 111.0 0.99074 3.16 0.51 11.0 white
5494 5494 7 7.8 0.150 0.34 1.1 0.035 31.0 93.0 0.99096 3.07 0.72 11.3 white
5495 5495 5 6.6 0.410 0.31 1.6 0.042 18.0 101.0 0.99195 3.13 0.41 10.5 white
5496 5496 6 7.0 0.350 0.17 1.1 0.049 7.0 119.0 0.99297 3.13 0.36 9.7 white

5497 rows × 14 columns

# type 컬럼 인코딩
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder() # 객체 생성
df_wine['type'] = encoder.fit_transform(df_wine['type'])
df_wine
  index quality fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol type
0 0 5 5.6 0.695 0.06 6.8 0.042 9.0 84.0 0.99432 3.44 0.44 10.2 1
1 1 5 8.8 0.610 0.14 2.4 0.067 10.0 42.0 0.99690 3.19 0.59 9.5 0
2 2 5 7.9 0.210 0.39 2.0 0.057 21.0 138.0 0.99176 3.05 0.52 10.9 1
3 3 6 7.0 0.210 0.31 6.0 0.046 29.0 108.0 0.99390 3.26 0.50 10.8 1
4 4 6 7.8 0.400 0.26 9.5 0.059 32.0 178.0 0.99550 3.04 0.43 10.9 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5492 5492 5 7.7 0.150 0.29 1.3 0.029 10.0 64.0 0.99320 3.35 0.39 10.1 1
5493 5493 6 6.3 0.180 0.36 1.2 0.034 26.0 111.0 0.99074 3.16 0.51 11.0 1
5494 5494 7 7.8 0.150 0.34 1.1 0.035 31.0 93.0 0.99096 3.07 0.72 11.3 1
5495 5495 5 6.6 0.410 0.31 1.6 0.042 18.0 101.0 0.99195 3.13 0.41 10.5 1
5496 5496 6 7.0 0.350 0.17 1.1 0.049 7.0 119.0 0.99297 3.13 0.36 9.7 1

5497 rows × 14 columns

# type 컬럼 category 자료형 변환 -> 인코딩으로 서열화 방지
df_wine['type']=df_wine['type'].astype('category')
df_wine.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5497 entries, 0 to 5496
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   index                 5497 non-null   int64   
 1   quality               5497 non-null   int64   
 2   fixed acidity         5497 non-null   float64 
 3   volatile acidity      5497 non-null   float64 
 4   citric acid           5497 non-null   float64 
 5   residual sugar        5497 non-null   float64 
 6   chlorides             5497 non-null   float64 
 7   free sulfur dioxide   5497 non-null   float64 
 8   total sulfur dioxide  5497 non-null   float64 
 9   density               5497 non-null   float64 
 10  pH                    5497 non-null   float64 
 11  sulphates             5497 non-null   float64 
 12  alcohol               5497 non-null   float64 
 13  type                  5497 non-null   category
dtypes: category(1), float64(11), int64(2)
memory usage: 563.9 KB
df_wine.drop('index',axis=1,inplace=True)
df_wine
  quality fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol type
0 5 5.6 0.695 0.06 6.8 0.042 9.0 84.0 0.99432 3.44 0.44 10.2 1
1 5 8.8 0.610 0.14 2.4 0.067 10.0 42.0 0.99690 3.19 0.59 9.5 0
2 5 7.9 0.210 0.39 2.0 0.057 21.0 138.0 0.99176 3.05 0.52 10.9 1
3 6 7.0 0.210 0.31 6.0 0.046 29.0 108.0 0.99390 3.26 0.50 10.8 1
4 6 7.8 0.400 0.26 9.5 0.059 32.0 178.0 0.99550 3.04 0.43 10.9 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
5492 5 7.7 0.150 0.29 1.3 0.029 10.0 64.0 0.99320 3.35 0.39 10.1 1
5493 6 6.3 0.180 0.36 1.2 0.034 26.0 111.0 0.99074 3.16 0.51 11.0 1
5494 7 7.8 0.150 0.34 1.1 0.035 31.0 93.0 0.99096 3.07 0.72 11.3 1
5495 5 6.6 0.410 0.31 1.6 0.042 18.0 101.0 0.99195 3.13 0.41 10.5 1
5496 6 7.0 0.350 0.17 1.1 0.049 7.0 119.0 0.99297 3.13 0.36 9.7 1

5497 rows × 13 columns

ftr_df = df_wine.iloc[:,1:]
ftr_df
  fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol type
0 5.6 0.695 0.06 6.8 0.042 9.0 84.0 0.99432 3.44 0.44 10.2 1
1 8.8 0.610 0.14 2.4 0.067 10.0 42.0 0.99690 3.19 0.59 9.5 0
2 7.9 0.210 0.39 2.0 0.057 21.0 138.0 0.99176 3.05 0.52 10.9 1
3 7.0 0.210 0.31 6.0 0.046 29.0 108.0 0.99390 3.26 0.50 10.8 1
4 7.8 0.400 0.26 9.5 0.059 32.0 178.0 0.99550 3.04 0.43 10.9 1
... ... ... ... ... ... ... ... ... ... ... ... ...
5492 7.7 0.150 0.29 1.3 0.029 10.0 64.0 0.99320 3.35 0.39 10.1 1
5493 6.3 0.180 0.36 1.2 0.034 26.0 111.0 0.99074 3.16 0.51 11.0 1
5494 7.8 0.150 0.34 1.1 0.035 31.0 93.0 0.99096 3.07 0.72 11.3 1
5495 6.6 0.410 0.31 1.6 0.042 18.0 101.0 0.99195 3.13 0.41 10.5 1
5496 7.0 0.350 0.17 1.1 0.049 7.0 119.0 0.99297 3.13 0.36 9.7 1

5497 rows × 12 columns

tgt_df = df_wine.iloc[:,0]
tgt_df
0       5
1       5
2       5
3       6
4       6
       ..
5492    5
5493    6
5494    7
5495    5
5496    6
Name: quality, Length: 5497, dtype: int64
# train_test_split 함수 : 학습/테스트 데이터 나누기
X_train, X_test, y_train, y_test = train_test_split(ftr_df, tgt_df, 
                                                    test_size=0.3, random_state=121)
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score


# 학습/테스트 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(ftr_df, tgt_df, 
                                                    test_size=0.2, random_state=0)

# 모델 정의
dtree = DecisionTreeClassifier()

### hyper-parameter 들을 딕셔너리 형태로 설정
parameters = {'max_depth':[1, 2, 3,4,5], 'min_samples_split':[2,3,4]}
import pandas as pd

# param_grid의 하이퍼 파라미터들을 3개의 train, test set fold 로 나누어서 테스트 수행 설정.  
grid_dtree = GridSearchCV(dtree, param_grid=parameters, cv=5, refit=True, return_train_score=True)
### refit=True 가 default : 가장 좋은 파라미터 설정으로 재 학습 시킴.  

# 붓꽃 Train 데이터로 param_grid의 하이퍼 파라미터들을 순차적으로 학습/평가 .
grid_dtree.fit(X_train, y_train)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:666: UserWarning: The least populated class in y has only 4 members, which is less than n_splits=5.
  warnings.warn(("The least populated class in y has only %d"





GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
             param_grid={'max_depth': [1, 2, 3, 4, 5],
                         'min_samples_split': [2, 3, 4]},
             return_train_score=True)
# GridSearchCV 결과는 cv_results_ 라는 딕셔너리로 저장됨
# 이를 DataFrame으로 변환해서 확인
scores_df = pd.DataFrame(grid_dtree.cv_results_)
scores_df[['params', 'mean_test_score', 'rank_test_score', 
           'split0_test_score', 'split1_test_score', 'split2_test_score']]
  params mean_test_score rank_test_score split0_test_score split1_test_score split2_test_score
0 {'max_depth': 1, 'min_samples_split': 2} 0.477367 13 0.468182 0.502273 0.478953
1 {'max_depth': 1, 'min_samples_split': 3} 0.477367 13 0.468182 0.502273 0.478953
2 {'max_depth': 1, 'min_samples_split': 4} 0.477367 13 0.468182 0.502273 0.478953
3 {'max_depth': 2, 'min_samples_split': 2} 0.517170 7 0.511364 0.525000 0.510808
4 {'max_depth': 2, 'min_samples_split': 3} 0.517170 7 0.511364 0.525000 0.510808
5 {'max_depth': 2, 'min_samples_split': 4} 0.517170 7 0.511364 0.525000 0.510808
6 {'max_depth': 3, 'min_samples_split': 2} 0.523768 1 0.518182 0.520455 0.532423
7 {'max_depth': 3, 'min_samples_split': 3} 0.523768 1 0.518182 0.520455 0.532423
8 {'max_depth': 3, 'min_samples_split': 4} 0.523768 1 0.518182 0.520455 0.532423
9 {'max_depth': 4, 'min_samples_split': 2} 0.523539 4 0.521591 0.522727 0.532423
10 {'max_depth': 4, 'min_samples_split': 3} 0.523539 4 0.521591 0.522727 0.532423
11 {'max_depth': 4, 'min_samples_split': 4} 0.523539 4 0.521591 0.522727 0.532423
12 {'max_depth': 5, 'min_samples_split': 2} 0.513083 10 0.512500 0.489773 0.534699
13 {'max_depth': 5, 'min_samples_split': 3} 0.512627 11 0.512500 0.489773 0.533561
14 {'max_depth': 5, 'min_samples_split': 4} 0.512627 11 0.512500 0.489773 0.533561
print('GridSearchCV 최적 파라미터:', grid_dtree.best_params_)
print('GridSearchCV 최고 정확도: {0:.4f}'.format(grid_dtree.best_score_))

# refit=True로 설정된 GridSearchCV 객체가 fit()을 수행 시 학습이 완료된 Estimator를 내포하고 있으므로 predict()를 통해 예측도 가능. 
pred = grid_dtree.predict(X_test)
print('테스트 데이터 세트 정확도: {0:.4f}'.format(accuracy_score(y_test, pred)))
GridSearchCV 최적 파라미터: {'max_depth': 3, 'min_samples_split': 2}
GridSearchCV 최고 정확도: 0.5238
테스트 데이터 세트 정확도: 0.5682
# GridSearchCV의 refit으로 이미 학습이 된 estimator 반환
# 위에서 dtree = DecisionTreeClassifier() 로 estimator를 선언했고, 이를 GridSearchCV에 넣었으므로,
estimator = grid_dtree.best_estimator_
estimator
DecisionTreeClassifier(max_depth=3)