sklearn.model_selection.GridSearchCV 정리

GridSearchCV : 교차검증 + 하이퍼 퍼라미터 튜닝

하이퍼 파라미터 튜닝을 위한 그리드 서치와 교차평가(CV)를 한번에 할 수있는 Sklearn api
지정한 하이퍼 파라미터를 순차적으로 입력하면서 최적의 파라미터를 도출할 수 있음
그리드 서치 경우의 수 x CV 횟수 만큼의 학습과 평가가 이루어짐
최적의 파라미터를 편리하게 찾을 수 있지만 수행시간이 상대적으로 오래걸림

GridSearchCV(estimator, param_grid, scoring=None, refit=True, cv=None)

estimator : classifier, regressor, pipeline
param_grid : 그리드 서치로 튜닝할 하이퍼 파라미터를 딕셔너리 형태로 넣음
scoring : 평가지표
cv : 교차 검증을 위한 분할 데이터 셋의 수
refit : True 생성 시 최적의 하이퍼 파라미터 튜닝 후 입력된 estimator 객체를 해당 파라미터로 재학습

Dacon wine 품질 예측 데이터셋을 이용한 예시

from IPython.core.display import display, HTML
display(HTML("<style>.container {width:90% !important;}</style>"))

import os
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

pwd

'C:\\Users\\Jay\\Desktop\\github'

df_wine = pd.read_csv('C:/Users/Jay/Desktop/python_ML/study/wine_quality/data/train.csv')
df_wine

	index	quality	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	type
0	0	5	5.6	0.695	0.06	6.8	0.042	9.0	84.0	0.99432	3.44	0.44	10.2	white
1	1	5	8.8	0.610	0.14	2.4	0.067	10.0	42.0	0.99690	3.19	0.59	9.5	red
2	2	5	7.9	0.210	0.39	2.0	0.057	21.0	138.0	0.99176	3.05	0.52	10.9	white
3	3	6	7.0	0.210	0.31	6.0	0.046	29.0	108.0	0.99390	3.26	0.50	10.8	white
4	4	6	7.8	0.400	0.26	9.5	0.059	32.0	178.0	0.99550	3.04	0.43	10.9	white
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5492	5492	5	7.7	0.150	0.29	1.3	0.029	10.0	64.0	0.99320	3.35	0.39	10.1	white
5493	5493	6	6.3	0.180	0.36	1.2	0.034	26.0	111.0	0.99074	3.16	0.51	11.0	white
5494	5494	7	7.8	0.150	0.34	1.1	0.035	31.0	93.0	0.99096	3.07	0.72	11.3	white
5495	5495	5	6.6	0.410	0.31	1.6	0.042	18.0	101.0	0.99195	3.13	0.41	10.5	white
5496	5496	6	7.0	0.350	0.17	1.1	0.049	7.0	119.0	0.99297	3.13	0.36	9.7	white

5497 rows × 14 columns

# type 컬럼 인코딩
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder() # 객체 생성
df_wine['type'] = encoder.fit_transform(df_wine['type'])
df_wine

	index	quality	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	type
0	0	5	5.6	0.695	0.06	6.8	0.042	9.0	84.0	0.99432	3.44	0.44	10.2	1
1	1	5	8.8	0.610	0.14	2.4	0.067	10.0	42.0	0.99690	3.19	0.59	9.5	0
2	2	5	7.9	0.210	0.39	2.0	0.057	21.0	138.0	0.99176	3.05	0.52	10.9	1
3	3	6	7.0	0.210	0.31	6.0	0.046	29.0	108.0	0.99390	3.26	0.50	10.8	1
4	4	6	7.8	0.400	0.26	9.5	0.059	32.0	178.0	0.99550	3.04	0.43	10.9	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5492	5492	5	7.7	0.150	0.29	1.3	0.029	10.0	64.0	0.99320	3.35	0.39	10.1	1
5493	5493	6	6.3	0.180	0.36	1.2	0.034	26.0	111.0	0.99074	3.16	0.51	11.0	1
5494	5494	7	7.8	0.150	0.34	1.1	0.035	31.0	93.0	0.99096	3.07	0.72	11.3	1
5495	5495	5	6.6	0.410	0.31	1.6	0.042	18.0	101.0	0.99195	3.13	0.41	10.5	1
5496	5496	6	7.0	0.350	0.17	1.1	0.049	7.0	119.0	0.99297	3.13	0.36	9.7	1

5497 rows × 14 columns

# type 컬럼 category 자료형 변환 -> 인코딩으로 서열화 방지
df_wine['type']=df_wine['type'].astype('category')
df_wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5497 entries, 0 to 5496
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   index                 5497 non-null   int64   
 1   quality               5497 non-null   int64   
 2   fixed acidity         5497 non-null   float64 
 3   volatile acidity      5497 non-null   float64 
 4   citric acid           5497 non-null   float64 
 5   residual sugar        5497 non-null   float64 
 6   chlorides             5497 non-null   float64 
 7   free sulfur dioxide   5497 non-null   float64 
 8   total sulfur dioxide  5497 non-null   float64 
 9   density               5497 non-null   float64 
 10  pH                    5497 non-null   float64 
 11  sulphates             5497 non-null   float64 
 12  alcohol               5497 non-null   float64 
 13  type                  5497 non-null   category
dtypes: category(1), float64(11), int64(2)
memory usage: 563.9 KB

df_wine.drop('index',axis=1,inplace=True)
df_wine

	quality	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	type
0	5	5.6	0.695	0.06	6.8	0.042	9.0	84.0	0.99432	3.44	0.44	10.2	1
1	5	8.8	0.610	0.14	2.4	0.067	10.0	42.0	0.99690	3.19	0.59	9.5	0
2	5	7.9	0.210	0.39	2.0	0.057	21.0	138.0	0.99176	3.05	0.52	10.9	1
3	6	7.0	0.210	0.31	6.0	0.046	29.0	108.0	0.99390	3.26	0.50	10.8	1
4	6	7.8	0.400	0.26	9.5	0.059	32.0	178.0	0.99550	3.04	0.43	10.9	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...
5492	5	7.7	0.150	0.29	1.3	0.029	10.0	64.0	0.99320	3.35	0.39	10.1	1
5493	6	6.3	0.180	0.36	1.2	0.034	26.0	111.0	0.99074	3.16	0.51	11.0	1
5494	7	7.8	0.150	0.34	1.1	0.035	31.0	93.0	0.99096	3.07	0.72	11.3	1
5495	5	6.6	0.410	0.31	1.6	0.042	18.0	101.0	0.99195	3.13	0.41	10.5	1
5496	6	7.0	0.350	0.17	1.1	0.049	7.0	119.0	0.99297	3.13	0.36	9.7	1

5497 rows × 13 columns

ftr_df = df_wine.iloc[:,1:]
ftr_df

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	type
0	5.6	0.695	0.06	6.8	0.042	9.0	84.0	0.99432	3.44	0.44	10.2	1
1	8.8	0.610	0.14	2.4	0.067	10.0	42.0	0.99690	3.19	0.59	9.5	0
2	7.9	0.210	0.39	2.0	0.057	21.0	138.0	0.99176	3.05	0.52	10.9	1
3	7.0	0.210	0.31	6.0	0.046	29.0	108.0	0.99390	3.26	0.50	10.8	1
4	7.8	0.400	0.26	9.5	0.059	32.0	178.0	0.99550	3.04	0.43	10.9	1
...	...	...	...	...	...	...	...	...	...	...	...	...
5492	7.7	0.150	0.29	1.3	0.029	10.0	64.0	0.99320	3.35	0.39	10.1	1
5493	6.3	0.180	0.36	1.2	0.034	26.0	111.0	0.99074	3.16	0.51	11.0	1
5494	7.8	0.150	0.34	1.1	0.035	31.0	93.0	0.99096	3.07	0.72	11.3	1
5495	6.6	0.410	0.31	1.6	0.042	18.0	101.0	0.99195	3.13	0.41	10.5	1
5496	7.0	0.350	0.17	1.1	0.049	7.0	119.0	0.99297	3.13	0.36	9.7	1

5497 rows × 12 columns

tgt_df = df_wine.iloc[:,0]
tgt_df

0       5
1       5
2       5
3       6
4       6
       ..
5492    5
5493    6
5494    7
5495    5
5496    6
Name: quality, Length: 5497, dtype: int64

# train_test_split 함수 : 학습/테스트 데이터 나누기
X_train, X_test, y_train, y_test = train_test_split(ftr_df, tgt_df, 
                                                    test_size=0.3, random_state=121)

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score


# 학습/테스트 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(ftr_df, tgt_df, 
                                                    test_size=0.2, random_state=0)

# 모델 정의
dtree = DecisionTreeClassifier()

### hyper-parameter 들을 딕셔너리 형태로 설정
parameters = {'max_depth':[1, 2, 3,4,5], 'min_samples_split':[2,3,4]}

import pandas as pd

# param_grid의 하이퍼 파라미터들을 3개의 train, test set fold 로 나누어서 테스트 수행 설정.  
grid_dtree = GridSearchCV(dtree, param_grid=parameters, cv=5, refit=True, return_train_score=True)
### refit=True 가 default : 가장 좋은 파라미터 설정으로 재 학습 시킴.  

# 붓꽃 Train 데이터로 param_grid의 하이퍼 파라미터들을 순차적으로 학습/평가 .
grid_dtree.fit(X_train, y_train)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:666: UserWarning: The least populated class in y has only 4 members, which is less than n_splits=5.
  warnings.warn(("The least populated class in y has only %d"





GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
             param_grid={'max_depth': [1, 2, 3, 4, 5],
                         'min_samples_split': [2, 3, 4]},
             return_train_score=True)

# GridSearchCV 결과는 cv_results_ 라는 딕셔너리로 저장됨
# 이를 DataFrame으로 변환해서 확인
scores_df = pd.DataFrame(grid_dtree.cv_results_)
scores_df[['params', 'mean_test_score', 'rank_test_score', 
           'split0_test_score', 'split1_test_score', 'split2_test_score']]

	params	mean_test_score	rank_test_score	split0_test_score	split1_test_score	split2_test_score
0	{'max_depth': 1, 'min_samples_split': 2}	0.477367	13	0.468182	0.502273	0.478953
1	{'max_depth': 1, 'min_samples_split': 3}	0.477367	13	0.468182	0.502273	0.478953
2	{'max_depth': 1, 'min_samples_split': 4}	0.477367	13	0.468182	0.502273	0.478953
3	{'max_depth': 2, 'min_samples_split': 2}	0.517170	7	0.511364	0.525000	0.510808
4	{'max_depth': 2, 'min_samples_split': 3}	0.517170	7	0.511364	0.525000	0.510808
5	{'max_depth': 2, 'min_samples_split': 4}	0.517170	7	0.511364	0.525000	0.510808
6	{'max_depth': 3, 'min_samples_split': 2}	0.523768	1	0.518182	0.520455	0.532423
7	{'max_depth': 3, 'min_samples_split': 3}	0.523768	1	0.518182	0.520455	0.532423
8	{'max_depth': 3, 'min_samples_split': 4}	0.523768	1	0.518182	0.520455	0.532423
9	{'max_depth': 4, 'min_samples_split': 2}	0.523539	4	0.521591	0.522727	0.532423
10	{'max_depth': 4, 'min_samples_split': 3}	0.523539	4	0.521591	0.522727	0.532423
11	{'max_depth': 4, 'min_samples_split': 4}	0.523539	4	0.521591	0.522727	0.532423
12	{'max_depth': 5, 'min_samples_split': 2}	0.513083	10	0.512500	0.489773	0.534699
13	{'max_depth': 5, 'min_samples_split': 3}	0.512627	11	0.512500	0.489773	0.533561
14	{'max_depth': 5, 'min_samples_split': 4}	0.512627	11	0.512500	0.489773	0.533561

print('GridSearchCV 최적 파라미터:', grid_dtree.best_params_)
print('GridSearchCV 최고 정확도: {0:.4f}'.format(grid_dtree.best_score_))

# refit=True로 설정된 GridSearchCV 객체가 fit()을 수행 시 학습이 완료된 Estimator를 내포하고 있으므로 predict()를 통해 예측도 가능. 
pred = grid_dtree.predict(X_test)
print('테스트 데이터 세트 정확도: {0:.4f}'.format(accuracy_score(y_test, pred)))

GridSearchCV 최적 파라미터: {'max_depth': 3, 'min_samples_split': 2}
GridSearchCV 최고 정확도: 0.5238
테스트 데이터 세트 정확도: 0.5682

# GridSearchCV의 refit으로 이미 학습이 된 estimator 반환
# 위에서 dtree = DecisionTreeClassifier() 로 estimator를 선언했고, 이를 GridSearchCV에 넣었으므로,
estimator = grid_dtree.best_estimator_
estimator

DecisionTreeClassifier(max_depth=3)

'Data Science > python' 카테고리의 다른 글

[pandas] dataframe 행 추가/제거 (0)	2022.02.09
[pandas] pandas_profiling 정리 (0)	2022.02.02
sklearn.model_selection.cross_val_score 인자 정리 (0)	2022.01.30
[pandas] dataframe/series 형태 train_test_split() 적용 (0)	2022.01.30
데이터 인코딩 (Data encoding) (0)	2022.01.23

GridSearchCV : 교차검증 + 하이퍼 퍼라미터 튜닝

GridSearchCV(estimator, param_grid, scoring=None, refit=True, cv=None)

Dacon wine 품질 예측 데이터셋을 이용한 예시

'Data Science > python' 카테고리의 다른 글

티스토리툴바