XGboost

데이터분석

XGboost

안행주의 2020. 7. 9. 00:03

"using xgboost without parameter tuning is like driving a car without changing its gears; you can never up your speed."

* feature scaling 없이 data processing 가능

parameter

n_estimators: number of trees

grid search를 적용해 최적의 hyper-parameter 값 찾기 가능

□ General Parameters : XGBoost의 전반적인 기능을 정의함.
> booster [default=gbtree] >> 일반적으로 gbtree의 성능이 낫다.
- gbtree: tree-based models
- gblinear: linear models
> silent [default=0]
- 1: 동작 메시지를 프린트하지 않음.

□ Booster Parameters (아래는 gbtree booster 기준으로 정리되어있음.)
> eta [default=0.3] => learning_rate
- GBM의 학습 속도와 유사.
- 각 단계에서 가중치를 줄임으로써 모델을 더 강건하게 만든다.
- 일반적으로 0.01-0.2
> min_child_weight [default=1] (Should be tuned using CV)
- child의 관측(?)에서 요구되는 최소 가중치의 합
- over-fitting vs under-fitting을 조정하기 위한 파라미터.
- 너무 큰 값이 주어지면 under-fitting.
> max_depth [default=6] (Should be tuned using CV)
- 트리의 최대 깊이.
- 일반적으로 3-10
> max_leaf_nodes
- 최종 노드의 최대 개수. (max number of terminal nodes)
- 이진 트리가 생성되기 때문에 max_depth가 6이면 max_leaf_nodes는 2^6개가 됨.
> gamma [default=0]
- 분할을 수행하는데 필요한 최소 손실 감소를 지정한다.
- 알고리즘을 보수적으로 만든다. loss function에 따라 조정해야 한다.
> subsample [default=1]
- 각 트리마다의 관측 데이터 샘플링 비율.
- 값을 적게 주면 over-fitting을 방지하지만 값을 너무 작게 주면 under-fitting.
- 일반적으로 0.5-1
> colsample_bytree [default=1]
- 각 트리마다의 feature 샘플링 비율.
- 일반적으로 0.5-1
> lambda [default=1] => reg_lambda
- 가중치에 대한 L2 정규화 용어 (Ridge 회귀 분석과 유사(?))
> alpha [default=0] => reg_alpha
- 가중치에 대한 L1 정규화 용어 (Lasso 회귀 분석과 유사(?))
> scale_pos_weight [default=1]
- 불균형한 경우 더 빠른 수렴(convergence)에 도움되므로 0보다 큰 값을 쓸것.

□ Learning Task Parameters : 각 단계에서 계산할 최적화 목표를 정의하는 데 사용된다.
> objective [default=reg:linear]
- binary:logistic : 이진 분류를 위한 로지스틱 회귀, 예측된 확률을 반환한다. (not class)
- multi:softmax : softmax를 사용한 다중 클래스 분류, 예측된 클래스를 반환한다. (not probabilities)
- multi:softprob : softmax와 같지만 각 클래스에 대한 예상 확률을 반환한다.
> eval_metric [default according to objective]
- 회귀 분석인 경우 'rmse'를, 클래스 분류 문제인 경우 'error'를 default로 사용.
- rmse : root mean square error
- mae : mean absolute error
- logloss : negative log-likelihood
- error : Binary classification error rate (0.5 threshold)
- merror : Multiclass classification error rate
- mlogloss : Multiclass logloss
- auc : Area under the curve
> seed [default = 0]
- 난수 시드
- 재현 가능한 결과를 생성하고 파라미터 튜닝에도 사용할 수 있다.

https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

XGBoost Parameters | XGBoost Parameter Tuning

This article explains XGBoost parameters and xgboost parameter tuning in python with example and takes a practice problem to explain the xgboost algorithm.

www.analyticsvidhya.com

+ 용어 정리.
# 부스팅(Boosting)? Bagging과 유사하나 Bagging과 다르게 순차적으로 학습을 시킨다. 학습 결과에 가중치를 부여하게 되는데 오답에 높은 가중치를 부여해 정확도를 높게 가져가는 것이 목적이다. (XGBoost, AdaBoost, GradientBoost)
부스팅은 여러 분류기가 상호보완적 역할을 할 수 있도록 단계적으로 학습을 수행하여 결합함으로써 성능을 올리기 위한 방법이다. (약한 분류기를 세트로 묶는다.)
부스팅이 배깅과 다른 가장 큰 차이점은 분류기들을 순차 학습하도록 하여, 먼저 학습된 분류기의 결과가 다음 분류기의 학습에 정보를 제공하여 이전 분류기의 결점을 보완하는 방향으로 학습이 이루어지도록 한다는 것이다.

# 배깅(Bagging)? 복원 랜덤 샘플링 방법을 이용해 동일한 모델을 여러 번 모델을 학습시키는 방법. (랜덤하게 추출된 데이터가 일종의 표본 집단이 되는셈.) 학습 과정에서 나온 결과를 중간값으로 맞춰주기 때문에 높은 Variance로 인한 Overfitting을 일부 극복할 수 있다. (RandomForest)

import xgboost as xgb
xgb_clf = xgb.XGBClassifier()
print(xgb_clf)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='binary:logistic', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=None, subsample=1, verbosity=1)

https://statkclee.github.io/model/model-python-xgboost-hyper.html

model-python-xgboost-hyper

3가지 초모수 튜닝¶ 일반적으로 초모수(Hypter Parameter)를 튜닝하는 방식은 격자 탐색(Grid Search), 임의 탐색(Random Search), 베이즈 최적화(Bayesian Optimization) 방식이 많이 사용된다. 가장 기본적인 격자

statkclee.github.io

https://statkclee.github.io/model/model-python-xgboost-hyper.html

model-python-xgboost-hyper

statkclee.github.io