SW정리: xgboost 기본 예제 python

xgboost python의 기본 예제를 살펴보면서 간단하게 사용법에 대해서 알아보았습니다.

0. 들어가기에 앞서

여기에서 사용한 소스 및 데이터 출처는 아래 링크 입니다.
https://xgboost.readthedocs.io/en/latest/python/index.html#

예제 소스 링크

https://github.com/dmlc/xgboost/tree/master/demo/guide-python

사용된 데이터 링크

https://github.com/dmlc/xgboost/blob/master/demo/data/agaricus.txt.test
https://github.com/dmlc/xgboost/blob/master/demo/data/agaricus.txt.train

기본 예제 테스트시 아래와 같은 오류 발생시 소스 데이터가 제대로 되어 있는지 확인이 필요합니다.
xgboost.core.XGBoostError: b'[17:54:28] d:\\build\\xgboost\\xgboost-0.81.git\\dmlc-core\\src\\data\\strtonum.h:147: Check failed: sign == true (0 vs. 1) '

참고 링크

https://github.com/dmlc/xgboost/issues/3355
(입력 데이터가 이상함 html 코드로 되어있음)

1. 예제

#filename : boost_from_prediction.py

#!/usr/bin/python
import numpy as np
import xgboost as xgb

dtrain = xgb.DMatrix('agaricus.txt.train')
dtest = xgb.DMatrix('agaricus.txt.test')
watchlist = [(dtest, 'eval'), (dtrain, 'train')]
###
# advanced: start from a initial base prediction
#
print ('start running example to start from a initial prediction')
# specify parameters via map, definition are same as c++ version
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'}
# train xgboost for 1 round
bst = xgb.train(param, dtrain, 1, watchlist)
# Note: we need the margin value instead of transformed prediction in set_base_margin
# do predict with output_margin=True, will always give you margin values before logistic transformation
ptrain = bst.predict(dtrain, output_margin=True)
ptest = bst.predict(dtest, output_margin=True)
dtrain.set_base_margin(ptrain)
dtest.set_base_margin(ptest)


print('this is result of running from initial prediction')
bst = xgb.train(param, dtrain, 1, watchlist)

결과

(base) E:\work\ai\xgboost>python boost_from_prediction.py
[19:09:06] 6513x127 matrix with 143286 entries loaded from agaricus.txt.train
[19:09:06] 1611x127 matrix with 35442 entries loaded from agaricus.txt.test
start running example to start from a initial prediction
[0]     eval-error:0.042831     train-error:0.046522
this is result of running from initial prediction
[0]     eval-error:0.021726     train-error:0.022263

2. 입력 데이터

입력 데이터는 DMatrix 함수를 이용하여 읽습니다.
dtrain = xgb.DMatrix('agaricus.txt.train')
dtest = xgb.DMatrix('agaricus.txt.test')

DMatrix 파일 포맷은 아래 링크를 참고 부탁드립니다.
https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html
두가지 포맷을 지원합니다. ( LibSVM, CSV )

train 데이터 파일을 Text 에디터로 열어보면 아래와 같은 형태의 내용이 있습니다.

train.txt

101:1.2 102:0.03
1:2.1 10001:300 10002:400
0:1.3 1:0.3
0:0.01 1:0.3
0:0.2 1:0.3

처음 0, 1 나오는것은 label이라고 합니다. 머신 러닝에서 일반적으로 말하는 y값이 됩니다. 위 예제는 binary classification 이라서 0,1의 값만 표시된 상태입니다. 그 이후 101:1.2 라고 표현되어있는데 101은 array의 컬럼 index를 의미합니다.(여기에서는 feature 인덱스라고 표현하기도 합니다.) 0부터 시작하며 101 array의 값이 1.2가 저장된다는 의미입니다.
agaricus.txt.train 파일을 열어보면 126이 최대값을 가지고 있기 때문에 컬럼의 크기(feature의 크기)는 127이 됩니다.

3. train parameter

train 은 train함수에 의해 동작 됩니다. 이때 여러개 인자가 필요합니다. 여기에서는 다음과 같이 표현 하였습니다.

watchlist = [(dtest, 'eval'), (dtrain, 'train')]
###
# advanced: start from a initial base prediction
#
print ('start running example to start from a initial prediction')
# specify parameters via map, definition are same as c++ version
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'}
# train xgboost for 1 round
bst = xgb.train(param, dtrain, 1, watchlist)

train에 넘어가는 인자는 다음과 같습니다.

xgboost.train(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None, maximize=False, early_stopping_rounds=None, evals_result=None, verbose_eval=True, xgb_model=None, callbacks=None, learning_rates=None)
다음 링크에서 확인할 수 있습니다.
https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.training

params (dict) – Booster params.
dtrain (DMatrix) – Data to be trained.
num_boost_round (int) – Number of boosting iterations.
evals (list of pairs (DMatrix, string)) – List of items to be evaluated during training, this allows user to watch performance on the validation set.

param 은 다음 링크에서 자세한 설명이 되어있습니다.
https://xgboost.readthedocs.io/en/latest/parameter.html
learning 종류가 여러가지가 있기때문에 기본적으로 'objective' 부분이 가장 중요합니다.
binary:logistic는 로지스틱 회귀를 위한 값입니다.

binary:logistic: logistic regression for binary classification, output probability

eta는 learning_rate가 됩니다. 숫자가 작아지면 시간이 오래걸리고, 숫자가 커지면 학습속도가 올라가는 대신 overfitting 될 가능성도 있습니다.

eta [default=0.3, alias: learning_rate]

Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative.
range: [0,1]

max_depth는 트리의 최대 깊이입니다. 이 값이 증가되면 좀 더 복잡한 모델이 되지만 좀 더 overfit 하기 쉽습니다.

max_depth [default=6]

Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 indicates no limit. Note that limit is required when grow_policy is set of depthwise.
range: [0,∞]

silent 대신 verbosity 을 사용하라고 되어 있습니다. 이것은 로그를 출력하는 레벨을 설정합니다.
silent [default=0] [Deprecated]

Deprecated. Please use verbosity instead.

verbosity [default=1]

Verbosity of printing messages. Valid values are 0 (silent), 1 (warning), 2 (info), 3 (debug). Sometimes XGBoost tries to change configurations based on heuristics, which is displayed as warning message. If there’s unexpected behaviour, please try to increase value of verbosity.

dtrain 는 train data로 DMatrix의 리턴값을 사용합니다.
num_boost_round 인자로 1을 사용하였고 evals 값으로 watchlist 변수를 사용하였습니다.
다음과 같은 값을 사용하였는데, watchlist = [(dtest, 'eval'), (dtrain, 'train')] , (DMatrix, String) 두개의 쌍으로 되어 있습니다. 모델의 성능을 평가하기 위한 용도인데, 그 의미는 출력 결과를 보면 이해가 됩니다.
아래와 같이 나오는데요. 현재 완료된 모델의 성능이 dtest를 사용했을때 오류가 얼마나 발생했는지 dtrain를 사용했을때는 얼마나 타나는지 확인하기 위한 용도입니다. string을 변경하면 아래 이름도 변경이 됩니다.

[0]     eval-error:0.042831     train-error:0.046522

num_boost_round : boosting을 얼마나 돌릴지 지정합니다.

4. predict

예측은 최종 결과를 얻기 위해서 사용합니다.

# Note: we need the margin value instead of transformed prediction in set_base_margin# do predict with output_margin=True, will always give you margin values before logistic transformationptrain = bst.predict(dtrain, output_margin=True)ptest = bst.predict(dtest, output_margin=True)

예측시 predict 함수를 사용합니다. dtrain의 입력을 넣었을때 결과는 1차원 list 형태로 나옵니다. ptrain 값을 출력을 해보면 다음과 같은 형태로 나옵니다.
[ 1.7121772 -1.7004405 -1.7004405 ... -1.9407086 -1.9407086 -1.9407086]

predict(data, output_margin=False, ntree_limit=None, validate_features=True)

Predict with data.

Parameters:	data (DMatrix) – The dmatrix storing the input. output_margin (bool) – Whether to output the raw untransformed margin value. ntree_limit (int) – Limit number of trees in the prediction; defaults to best_ntree_limit if defined (i.e. it has been trained with early stopping), otherwise 0 (use all trees). validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
Returns:	prediction
Return type:	numpy array

SW정리

2019년 1월 16일 수요일

xgboost 기본 예제 python