SW정리: xgboost

레이블이 xgboost인 게시물을 표시합니다. 모든 게시물 표시

2019년 2월 6일 수요일

xgboost StratifiedKFold, KFold 예제 (example)

1. 들어가기에 앞서

XGBoost 를 시작하기전에 설치를 안하셨다면 아래 링크로 설치 하시기 바랍니다.

https://swlock.blogspot.com/2019/01/install-xgboost-in-python-xgboost.html

XGBoost 와 비슷한 부스팅 머신러닝 알고리즘은 lightGBM이 있습니다. 동일 한 데이터로 진행한 예제이며 비교해보면 사용법상 차이점을 알 수 있습니다.

pytorch를 이용한 Deep learning 예제
https://swlock.blogspot.com/2019/02/deep-learning-pytorch-kfold-example.html

XGBoost 예제
https://swlock.blogspot.com/2019/02/xgboost-stratifiedkfold-kfold.html

lightGBM 설치 및 예제 (How to install Lightgbm and example)
https://swlock.blogspot.com/2019/01/lightgbm-how-to-install-lightgbm-and.html

교차검증에 대해잘 모르겠으면 아래 링크에서 추가 확인이 가능합니다.

scikit-learn 이용한 (cross-validation) 교차 검증 iterators StratifiedKFold, KFold 사용법
https://swlock.blogspot.com/2019/01/scikit-learn-cross-validation-iterators.html

이번 예제는 XGBoost 에 XOR 데이터를 넣어서 학습한뒤 결과를 보는 예제입니다. XOR함수는 X1, X2 두개의 입력을 가지는데 두개의 값이 같은 경우에 Y값이 1 이 되는 함수 입니다. 이 부분은 lightGBM 예제에서도 동일하기 때문에 해당 내용을 그대로 복사하였습니다.

2. XOR data로 XGBoost 사용하기

XOR함수는 입력을 2개를 받고 두개의 값이 같으면 1이되고 다르면 0이 되는 함수 입니다.
0,1 정수만 넣어서 예제를 만들면 test와 train 데이터가 겹치는 부분도 있고 해서 여기에서는 random 을 이용해서 실수 형태로 만들었습니다. 또한 Y가 한개인것보다 복잡한 sample을 만들기 위해 y1, y2는 서로 반대가 되는 값을 넣었습니다.

2.1 XOR data 만들기

파일명 : makexordata.py

import numpy as np
import pandas as pd

np.random.seed(0)

def get_int_rand(min, max):
 value = np.random.randint(min, high=max+1, size=1)
 return int(value[0])

def get_rand(min, max):
 value = np.random.rand(1)*(max-min)+min
 return float(value[0])

#make train set
df = pd.DataFrame([],columns=['x1', 'x2', 'y1', 'y2'])

for i in range(1000):
 x1 = get_int_rand(0,1)
 x2 = get_int_rand(0,1)
 if x1 == x2 :
  y1 = 0
  y2 = 1
 else  :
  y1 = 1
  y2 = 0
 x1 = get_rand(x1-0.3, x1+0.3)
 x2 = get_rand(x2-0.3, x2+0.3)

 df = df.append(pd.DataFrame(np.array([x1,x2,int(y1),int(y2)]).reshape(1,4),columns=['x1', 'x2', 'y1', 'y2']))
 
df.reset_index(inplace=True,drop=True)
print(df.head())

df.to_csv("train.csv", encoding='utf-8', index=False)


# make test set
df = pd.DataFrame([],columns=['x1', 'x2'])

for i in range(100):
 x1 = get_int_rand(0,1)
 x2 = get_int_rand(0,1)
 x1 = get_rand(x1-0.3, x1+0.3)
 x2 = get_rand(x2-0.3, x2+0.3)

 df = df.append(pd.DataFrame(np.array([x1,x2,int(0),int(0)]).reshape(1,4),columns=['x1', 'x2', 'y1', 'y2']))
 
df.reset_index(inplace=True,drop=True)
print(df.head())

df.to_csv("test.csv", encoding='utf-8', index=False)

실행결과는 아래와 같습니다. 위 코드는 데이터를 만들기 위한 용도이므로 자세한 설명은 생략합니다. 결과는 아래와 같고 train.csv, test.csv 파일을 생성해 냅니다. 형태는 아래 결과 와 같습니다. test.csv파일의 y1, y2는 0으로 가득차 있으며, 해당값을 예측해야 합니다.

(base) E:\>python makexordata.py
         x1        x2   y1   y2
0  0.129114  1.061658  1.0  0.0
1  0.954193  1.087536  0.0  1.0
2  0.235064  1.278198  1.0  0.0
3  0.175035  1.017337  1.0  0.0
4  0.255358  0.742622  1.0  0.0
e:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py:6211: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  sort=sort)
         x1        x2   y1   y2
0  1.077771  0.167151  0.0  0.0
1  0.189848 -0.200354  0.0  0.0
2 -0.264823  0.820102  0.0  0.0
3 -0.231184  0.062009  0.0  0.0
4  0.957680  0.889091  0.0  0.0

데이터를 보는 방법은 다음과 같습니다.
x1값이 0.5 보다 작으면 0으로, 0.5 보다 크면 1로 생각하고 두개 값이 (대략)0이면 y1=0, y2=1, 0,1또는 1,0으로 다를 경우 y1=1, y2=0 이 됩니다.(XOR은 입력값이 배타적일때 참이 됩니다.) 결과 값 또한 0~1 사이의 binary 형태의 데이터입니다.
만들려고 하는 예제는 위 데이터(train.csv)를 가지고 학습한뒤 test.csv 에 주어지는 x1, x2를 이용하여 y1, y2를 예측하는 예제입니다.

2.2 XGBoost 예제 코드 작성

이제 본론입니다. 해당 코드는 StratifiedKFold 교차 검증, random seed를 이용한 반복 훈련, Feature Select, MinMaxScaler, StandardScaler 내용을 포함하는 예제입니다.
전반적인 흐름은 train.csv 파일을 읽어서 MinMaxScaler, StandardScaler 를 할 수도 있으며, 훈련 후 그 결과를 result.csv 파일에 저장하는 코드입니다.

StratifiedKFold 교차검증 관련해서는 아래 게시글 참고 부탁드립니다.

https://swlock.blogspot.com/2019/01/scikit-learn-cross-validation-iterators.html

StratifiedKFold 는 사용상 주의점이 있는데 아래 소스를 바탕으로 예제를 작성하는 경우 간혹 에러가 발생할 수 있습니다. 이유는 위 링크에서도 언급이 되어있는데, y값에 따라 발생합니다. 따라서 해당 경우에는 KFold로 변경해서 사용하시면 됩니다.

xgboost_xor_cv.py

# xgboost example with StratifiedKFold
EPOCHS_TO_TRAIN = 2
FOLD_COUNT = 2

STD_ALL = False
NOR_ALL = False
PREFIX = __file__[:-3]
BASE_PATH = "./"

SELECT_NEED = False
Y1_SELECT_LEVEL = 0
Y2_SELECT_LEVEL = 0

Y1_FEATURE_LIST_FOR_SELECT=[
 ('x1',1),
 ('x2',0)
]

Y2_FEATURE_LIST_FOR_SELECT=[
 ('x1',1),
 ('x2',0)
]

param={
 'objective':'binary:logistic',
 'random_seed':0,
 'eta':0.5
}

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
import time
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn import preprocessing
import xgboost as xgb
from sklearn.metrics import mean_squared_error

np.random.seed(1000)

class writeLog():
 def write(self, fileName, text):
  print(text)
  f=open(fileName,'a')
  f.write(text)
  f.write("\n")
  f.close()
 def writeWithoutCR(self, fileName, text):
  f=open(fileName,'a')
  f.write(text)
  f.close()

log = writeLog()
log.write(PREFIX+"log.txt","start")

train_df = pd.read_csv(BASE_PATH+'train.csv', header=0, encoding='utf8')
test_df = pd.read_csv(BASE_PATH+'test.csv', header=0, encoding='utf8')

#drop Y
Y1train_df = train_df.pop('y1')
Y2train_df = train_df.pop('y2')

Y1test_df = test_df.pop('y1')
Y2test_df = test_df.pop('y2')

allX = pd.concat([train_df, test_df], axis=0)
train_size = train_df.shape[0]
test_size = test_df.shape[0]
log.write(PREFIX+"log.txt","train size:"+str(train_df.shape))
log.write(PREFIX+"log.txt","test size:"+str(test_df.shape))
log.write(PREFIX+"log.txt","all size:"+str(allX.shape))
del (train_df, test_df)

allX = pd.get_dummies(allX)
allX = allX.fillna(value=0)

if NOR_ALL == True:
 names = allX.columns
 scaler = preprocessing.MinMaxScaler()
 allX = scaler.fit_transform(allX)
 allX = pd.DataFrame(allX, columns=names)

if STD_ALL == True:
 names = allX.columns
 scaler = preprocessing.StandardScaler()
 allX = scaler.fit_transform(allX)
 allX = pd.DataFrame(allX, columns=names)

if SELECT_NEED :
 Y1sel_feature = []
 for feature, count in Y1_FEATURE_LIST_FOR_SELECT :
  if count>=Y1_SELECT_LEVEL :
   Y1sel_feature.append(feature)
 log.write(PREFIX+"log.txt","Y1 selected:%d"%(len(Y1sel_feature)))
 
 Y2sel_feature = []
 for feature, count in Y2_FEATURE_LIST_FOR_SELECT :
  if count>=Y2_SELECT_LEVEL :
   Y2sel_feature.append(feature)
 log.write(PREFIX+"log.txt","Y2 selected:%d"%(len(Y2sel_feature)))
 
 Ysel_feature = []
 Ysel_feature = Y1sel_feature + Y2sel_feature
 Ysel_feature = list(set(Ysel_feature))
 log.write(PREFIX+"log.txt","Y selected:%d"%(len(Ysel_feature)))
else:
 Ysel_feature = []
 Y1sel_feature = []
 Y2sel_feature = []

#분리
trainX = allX[0:int(train_size)]
predictX = allX[int(train_size):int(allX.shape[0])]
log.write(PREFIX+"log.txt","train size:"+str(trainX.shape))
log.write(PREFIX+"log.txt","test size:"+str(predictX.shape))
del (allX)

def OutputData(filename,Y1,Y2):
 test_csv = pd.read_csv(BASE_PATH+'test.csv', header=0, encoding='utf8')
 test_csv.pop('y1')
 test_csv.pop('y2')
 predAll = pd.concat([test_csv,pd.DataFrame(Y1, columns=['y1'])], axis=1)
 predAll = pd.concat([predAll,pd.DataFrame(Y2, columns=['y2'])], axis=1)
 predAll.to_csv(path_or_buf=PREFIX+filename, index=False)
 del predAll

def Train(trainX, trainY, Ystr, ysize, EPOCHS, sel_feature, fold_count):
 
 stratifiedkfold = StratifiedKFold(n_splits=fold_count, random_state=0, shuffle=True)
 
 features = trainX.columns.tolist()
 if len(sel_feature)==0:
  sel_feature = features
 
 final_cv_pred = np.zeros(ysize)
 final_cv_rmse = 0
 for seed in range(EPOCHS):
  log.write(PREFIX+"log.txt","seed: %d"%(seed))
  param['random_seed'] = seed
  
  cv_pred = np.zeros(ysize)
  cv_rmse = 0
  
  trainXnp = trainX.as_matrix(columns = sel_feature)
  trainYnp = trainY.as_matrix()
  i = 0
  for train_index, validate_index in stratifiedkfold.split(trainXnp, trainYnp):
   i = i + 1
   log.write(PREFIX+"log.txt","[%d]Fold"%(i))
   X_train, X_validate = trainXnp[train_index], trainXnp[validate_index]
   y_train, y_validate = trainYnp[train_index], trainYnp[validate_index]
   
   dtrn = xgb.DMatrix(X_train, label=y_train, feature_names=sel_feature)
   dvld = xgb.DMatrix(X_validate, label=y_validate, feature_names=sel_feature)
   
   watch_list = [(dvld,'eval'),(dtrn,'train')]
   model = xgb.train(param, dtrn, num_boost_round=10000, evals=watch_list, early_stopping_rounds=10)
   y_pred = model.predict(dvld, ntree_limit=model.best_ntree_limit)
   rmse = mean_squared_error(y_validate, y_pred)
   
   log.write(PREFIX+"log.txt","[%d]rmse %f"%(i,rmse))
   log.write(PREFIX+"log.txt","[%d]%s Feature importance:split"%(i,Ystr))
   for kv in sorted([(k,v) for k, v in model.get_fscore().items()], key=lambda kv: kv [1], reverse=True):
    log.writeWithoutCR(PREFIX+"log.txt",str(kv))
    log.writeWithoutCR(PREFIX+"log.txt",",\n")
   log.writeWithoutCR(PREFIX+"log.txt",",\n\n")
   
   cv_rmse += rmse
   cv_pred += model.predict(xgb.DMatrix(predictX.as_matrix(columns=sel_feature),feature_names=sel_feature), ntree_limit=model.best_ntree_limit)
  cv_rmse /= fold_count
  cv_pred /= fold_count
  final_cv_rmse += cv_rmse
  final_cv_pred += cv_pred
 final_cv_rmse /= EPOCHS
 final_cv_pred /= EPOCHS
 return (final_cv_pred, final_cv_rmse)

y1_pred, rmse1 = Train(trainX, Y1train_df, "Y1", len(Y1test_df), EPOCHS_TO_TRAIN, Y1sel_feature, FOLD_COUNT)
y2_pred, rmse2 = Train(trainX, Y2train_df, "Y2", len(Y2test_df), EPOCHS_TO_TRAIN, Y2sel_feature, FOLD_COUNT)

log.write(PREFIX+"log.txt","Total rmse %f + %f = %f"%(rmse1,rmse2,(rmse1+rmse2)/2))

OutputData("Result.csv",y1_pred, y2_pred)

2.3 코드 설명

코드는 lightGBM과 90%정도는 중복되기 때문에 설명도 복사하였습니다.
아래 값은 seed 값을 이용해서 반복할때 사용합니다.

EPOCHS_TO_TRAIN = 2

교차 검증시 Fold 할 값입니다.

FOLD_COUNT = 2

평균과 표준 편차를 이용하여 -1~1 사이의 값으로 변환 합니다.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

STD_ALL = False

최소값(Min)과 최대값(Max)을 사용해서 '0~1' 사이의 범위로 데이터를 변환합니다.
수식적인 방법은 아래와 같습니다.

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

NOR_ALL = False

x1,x2 중 특정 x값을 선택하기 위해 만들어진 기능입니다. 여기에서는 x가 두개밖에 없어서 필요는 없습니다.

SELECT_NEED = False

랜덤 함수의 값을 고정합니다. 이것은 수행시킬때마다 값을 랜덤값을 만들기 위해 사용합니다.

np.random.seed(1000)

로그를 출력하기 위한 클래스 입니다.

class writeLog():
def write(self, fileName, text):
print(text)
f=open(fileName,'a')
f.write(text)
f.write("\n")
f.close()
def writeWithoutCR(self, fileName, text):
f=open(fileName,'a')
f.write(text)
f.close()

데이터인 train과 test용 csv 파일을 읽습니다.

train_df = pd.read_csv(BASE_PATH+'train.csv', header=0, encoding='utf8')
test_df = pd.read_csv(BASE_PATH+'test.csv', header=0, encoding='utf8')

읽은 data파일은 y값이 포함되어 있으므로 제거합니다.

#drop Y
Y1train_df = train_df.pop('y1')
Y2train_df = train_df.pop('y2')

Y1test_df = test_df.pop('y1')
Y2test_df = test_df.pop('y2')

두개의 dataframe을 아래쪽으로 붙여서 allX를 만듭니다. 이렇게 하는 이유는 데이터 변환을 하기 위해서 그렇습니다.

allX = pd.concat([train_df, test_df], axis=0)

다음으로 앞에서 설명한 옵션에 따라서 Scaler 동작을 합니다. 일부 Machine learning에서는 효과가 좋습니다.

if NOR_ALL == True:
names = allX.columns
scaler = preprocessing.MinMaxScaler()
allX = scaler.fit_transform(allX)
allX = pd.DataFrame(allX, columns=names)

if STD_ALL == True:
names = allX.columns
scaler = preprocessing.StandardScaler()
allX = scaler.fit_transform(allX)
allX = pd.DataFrame(allX, columns=names)

그 후 X feature중에 필요한것만 선택하도록 합니다. 이름이 Y1sel_feature라고 되어있지만, 실은 해당 y값을 train할때 입력이 되는 x를 선택하기 위한 이름입니다. 이 부분은 사용자가 feature를 넣고 빼고 할때 사용하기 위함입니다.

if SELECT_NEED :
Y1sel_feature = []
for feature, count in Y1_FEATURE_LIST_FOR_SELECT :
if count>=Y1_SELECT_LEVEL :
Y1sel_feature.append(feature)
log.write(PREFIX+"log.txt","Y1 selected:%d"%(len(Y1sel_feature)))

Y2sel_feature = []
for feature, count in Y2_FEATURE_LIST_FOR_SELECT :
if count>=Y2_SELECT_LEVEL :
Y2sel_feature.append(feature)
log.write(PREFIX+"log.txt","Y2 selected:%d"%(len(Y2sel_feature)))

Ysel_feature = []
Ysel_feature = Y1sel_feature + Y2sel_feature
Ysel_feature = list(set(Ysel_feature))
log.write(PREFIX+"log.txt","Y selected:%d"%(len(Ysel_feature)))
else:
Ysel_feature = []
Y1sel_feature = []
Y2sel_feature = []

지금까지 X feature를 allX에서 다시 분리합니다.

#분리
trainX = allX[0:int(train_size)]
predictX = allX[int(train_size):int(allX.shape[0])]
log.write(PREFIX+"log.txt","train size:"+str(trainX.shape))
log.write(PREFIX+"log.txt","test size:"+str(predictX.shape))
del (allX)

최종 출력을 위한 함수 입니다.

def OutputData(filename,Y1,Y2):
test_csv = pd.read_csv(BASE_PATH+'test.csv', header=0, encoding='utf8')
test_csv.pop('y1')
test_csv.pop('y2')
predAll = pd.concat([test_csv,pd.DataFrame(Y1, columns=['y1'])], axis=1)
predAll = pd.concat([predAll,pd.DataFrame(Y2, columns=['y2'])], axis=1)
predAll.to_csv(path_or_buf=PREFIX+filename, index=False)
del predAll

이제 가장 중요한 train 함수입니다. trainX는 X의 dataframe이고 trainY는 y의 dataframe입니다. Ystr는 출력을 위한 문자열입니다. 즉 Y값이 여러개일때 출력시 어떤 Y값인지 구분하기 위해서 인자를 하나 더 넣었습니다. EPOCHS seed를 이용한 반복하는 값입니다. sel_feature 는 앞에서 Y1sel_feature, Y2sel_feature 로 선택된 컬럼 값입니다. 빈 [] list일경우 전체 데이터를 선택합니다. fold_count는 교차 검증 Fold인자가 됩니다.

def Train(trainX, trainY, Ystr, ysize, EPOCHS, sel_feature, fold_count):

실자 호출은 다음과 같습니다.

y1_pred, rmse1 = Train(trainX, Y1train_df, "Y1", len(Y1test_df), EPOCHS_TO_TRAIN, Y1sel_feature, FOLD_COUNT)
y2_pred, rmse2 = Train(trainX, Y2train_df, "Y2", len(Y2test_df), EPOCHS_TO_TRAIN, Y2sel_feature, FOLD_COUNT)

train 함수 안에서는 다음코드로 EPOCHS 만큼 루프를 돌게됩니다. 그리고 prameter의 인자로 들어갑니다. 교차 검증 완료시 cv_rmse, cv_pred 값이 발생하는데 이것을 최종 final_cv_rmse, final_cv_pred 값에 넣고 다시 평균을 구합니다.

for seed in range(EPOCHS):
log.write(PREFIX+"log.txt","seed: %d"%(seed))
param['random_seed'] = seed

.... final_cv_rmse += cv_rmse
final_cv_pred += cv_pred
final_cv_rmse /= EPOCHS
final_cv_pred /= EPOCHS

내부 루프는 아래와 같은 코드로 되어있습니다. 이것은 stratifiedkfold를 위한 것입니다.
만약 KFold로 변경시 y값이 사용되지 않으므로 다음과 같이 사용해야 합니다.

kfold.split(trainXnp):

한번의 교차검증이 끝날때 마다 rmse과 model.predict(xgb.DMatrix(predictX.as_matrix(columns=sel_feature),feature_names=sel_feature), ntree_limit=model.best_ntree_limit) 예측 결과 값을 더해줍니다. 그리고 교차 검증 최종에는 fold 수만큼 나누어 줍니다. 해당 부분은 다음 부분 입니다. cv_rmse /= fold_count

XGboost은 인자로 xgb.DMatrix 을 필요로 합니다. DMatrix 을 만드는 방법이 아래 코드입니다. 인자로 numpy가 필요합니다.

dtrn = xgb.DMatrix(X_train, label=y_train, feature_names=sel_feature)
dvld = xgb.DMatrix(X_validate, label=y_validate, feature_names=sel_feature)

train 중 지켜보는 값입니다. validation, train 값을 넣어봤습니다. overfit 되는지 지켜보는 용도인데 넣지 않아도 됩니다. 10000은 전체 train 수인데, early_stopping_rounds 값에 의해서 일반적으로 해당 수치에 의해서 10000루프를 모두 돌지는 않게 됩니다. 이 부분이 ligitGBM과 다릅니다.
watch_list = [(dvld,'eval'),(dtrn,'train')]
model = xgb.train(param, dtrn, num_boost_round=10000, evals=watch_list, early_stopping_rounds=10)

값을 예측하고, 이때 ntree_limit=model.best_ntree_limit 을 사용하면(lightGBM과 다름) 중간에 최적의 값을 사용해서 예측이 가능합니다.
y_pred = model.predict(dvld, ntree_limit=model.best_ntree_limit)

validation 값에 대한 rmse 평균 제곱 오류를 계산합니다.
rmse = mean_squared_error(y_validate, y_pred)

2.4 실행 결과

random seed를 이용해서 여러번 반복하는것은 EPOCHS_TO_TRAIN = 2 값에 의해서 결정됩니다. 교차 검증은 다음 값에 의해서 결정됩니다. FOLD_COUNT = 2
Train 함수 호출 횟수는 EPOCHS_TO_TRAIN = 2 * FOLD_COUNT = 2, 4회인데 Y1, Y2 두개에 대해서 진행하므로 8회가 됩니다.

(base) E:\work\xgboost>python xgboost_xor_cv.py
start                                                                       2
train size:(1000, 2)
test size:(100, 2)
all size:(1100, 2)
train size:(1000, 2)
test size:(100, 2)
seed: 0
xgboost_xor_cv.py:144: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  trainXnp = trainX.as_matrix(columns = sel_feature)
xgboost_xor_cv.py:145: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  trainYnp = trainY.as_matrix()
[1]Fold
[21:09:56] d:\build\xgboost\xgboost-0.81.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=4
[0]     eval-error:0.007984     train-error:0.002004
Multiple eval metrics have been passed: 'train-error' will be used for early stopping.
.......
[2]rmse 0.006062
[2]Y2 Feature importance:split
xgboost_xor_cv.py:169: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  cv_pred += model.predict(xgb.DMatrix(predictX.as_matrix(columns=sel_feature),feature_names=sel_feature), ntree_limit=model.best_ntree_limit)
Total rmse 0.004275 + 0.004275 = 0.004275

최종 RMSE는 Total rmse 0.004275 + 0.004275 = 0.004275 가 평균값이라 더해서 2로 나누어 줍니다. 앞쪽은 Y1 rmse값이 되고 뒤쪽 값은 Y2 rmse 값이 됩니다.

2.5 결과 데이터 확인

xgboost_xor_cvResult.csv

x1,x2,y1,y2
1.077770981623956,0.16715055841696874,0.9538576304912567,0.0461423909291625
0.1898476282284382,-0.2003537468439492,0.0488235168159008,0.9511764645576477
-0.2648225646697367,0.8201024089303411,0.9523830413818359,0.04761694651097059
-0.23118448473185735,0.06200855703372005,0.04614610876888037,0.9538538455963135
0.9576799510927888,0.8890909020672868,0.04623681679368019,0.9537632167339325
0.0004552517531765665,1.2781596666369324,0.9370305836200714,0.06296937074512243

.......

1번째 라인
[x1=1.077770981623956(대략)1]XOR[x2=0.16715055841696874(대략)0]= 1

결과 y1=0.9538576304912567 대략(1) 결과는 정상적으로 보입니다.

다만 lightGBM 결과와 비교시에는 좀 큰 값입니다. 사유가 여러가지가 있겠지만, 여기에서 설명되지 않은 parameter 의 값을 변경해야하고 binary:logistic 는 많은 데이터를 필요로 하기 때문에 데이터가 부족해서 그럴수도 있습니다.

param={ 'objective':'binary:logistic', 'random_seed':0, 'eta':0.5}

XGBooster parameter 설정 아래 링크를 참고 바랍니다.

https://xgboost.readthedocs.io/en/latest/parameter.html

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

as_matrix() 함수 오류가 발생한다면 아래와 같이 values로 변경 필요

코드 변경=> trainXnp = trainX[sel_feature].values

코드 변경=> cv_pred += model.predict(xgb.DMatrix(predictX[sel_feature].values,feature_names=sel_feature), ntree_limit=model.best_ntree_limit)

2019년 1월 20일 일요일

Install xgboost in python (파이썬에서 xgboost 설치기)

설치전 끝까지 읽어보시기 바랍니다.

아래 글을 보고
https://xgboost.readthedocs.io/en/latest/build.html

환경 : python 3.6.4, 32 bit 환경, windows 에서 설치 시작해 보았습니다.

(py364_env) E:\Program Files (x86)\Python36-32>pip3 install xgboost
Collecting xgboost
  Downloading https://files.pythonhosted.org/packages/4f/4c/4969b10939c4557ae46e5569d07c0c7ce772b3d6b9c1401a6ed07059fdee/xgboost-0.81.tar.gz (636kB)
    100% |████████████████████████████████| 645kB 5.1MB/s
Files/directories not found in C:\Users\xxx\AppData\Local\Temp\pip-install-1eb0m6cr\xgboost\pip-egg-info

뭔가 이상한 오류가 발생합니다.

그래서 검색해보니 환경마다 다른 특정 파일을 다운로드 필요하다고 합니다.

좋은글을 발견

How to install xgboost package in python (windows platform)?
http://xgboost.readthedocs.org/en/latest/python/python_intro.html On the homepage of xgboost(above link), it says: To…stackoverflow.com

1. 다음 링크에서 적당한 파일 다운로드

https://www.lfd.uci.edu/~gohlke/pythonlibs/
(3.6버전에 32bit 환경이라 다음 파일 다운로드)
xgboost‑0.81‑cp36‑cp36m‑win32.whl

2. 다음 명령으로 설치

pip install [whl파일명]

(py364_env) E:\Program Files (x86)\Python36-32>pip install C:\Users\darts\Downloads\xgboost-0.81-cp36-cp36m-win32.whl
Processing c:\users\darts\downloads\xgboost-0.81-cp36-cp36m-win32.whl
Requirement already satisfied: numpy in e:\program files (x86)\python36-32\py364_env\lib\site-packages (from xgboost==0.81) (1.16.0)
Collecting scipy (from xgboost==0.81)
  Downloading https://files.pythonhosted.org/packages/18/d1/9ed926284d97c12cc4c07b81abce886dd4430c52b8323defc365ed04026e/scipy-1.2.0-cp36-cp36m-win32.whl (26.9MB)
    100% |████████████████████████████████| 27.0MB 734kB/s
Installing collected packages: scipy, xgboost
Successfully installed scipy-1.2.0 xgboost-0.81

(numpy가 구버전이라 설치가 안되면 numpy업그레이드 필요)

(py364_env) E:\Program Files (x86)\Python36-32>pip install numpy
Collecting numpy
  Downloading https://files.pythonhosted.org/packages/6e/ef/1402e6016ba0aa19463198be521b265c6bbe4ee892a7f42385d29e8d894d/numpy-1.16.0-cp36-cp36m-win32.whl (10.0MB)
    100% |████████████████████████████████| 10.0MB 1.9MB/s
Installing collected packages: numpy
Successfully installed numpy-1.16.0

4. 동작확인

설치는 된것 같은데 동작이 제대로 안됩니다.

(py364_env) E:\Program Files (x86)\Python36-32\py364_env\Scripts>python
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:04:45) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import xgboost
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "E:\Program Files (x86)\Python36-32\py364_env\lib\site-packages\xgboost\__init__.py", line 11, in <module>
    from .core import DMatrix, Booster
  File "E:\Program Files (x86)\Python36-32\py364_env\lib\site-packages\xgboost\core.py", line 150, in <module>
    _LIB = _load_lib()
  File "E:\Program Files (x86)\Python36-32\py364_env\lib\site-packages\xgboost\core.py", line 141, in _load_lib
    'Error message(s): {}\n'.format(os_error_list))
xgboost.core.XGBoostError: XGBoost Library (xgboost.dll) could not be loaded.
Likely causes:
  * OpenMP runtime is not installed (vcomp140.dll or libgomp-1.dll for Windows, libgomp.so for UNIX-like OSes)
  * You are running 32-bit Python on a 64-bit OS
Error message(s): ['[WinError 126] 지정된 모듈을 찾을 수 없습니다']

구글링

xgboost error not a valid Win32 application Stack Overflow

https://stackoverflow.com/.../xgboost-error-not-a-valid-win32-ap...

이 페이지 번역하기

2018. 9. 28. - Exception has occurred: xgboost.core.XGBoostError XGBoost Library (xgboost.dll)could not be loaded. Likely causes: * OpenMP runtime is not installed (vcomp140.dll or libgomp-1.dll for Windows, libgomp.so for UNIX-like OSes) * You are running 32-bit Python on a 64-bit OSError message(s): ['[WinError ...

5. 다시 원점으로

64 bit windows에서는 64 bit Python을 설치하여야 한다고 하네요

이번엔 anaconda 환경에서 설치 64 bit anaconda python 3.7 버전으로 시도

6. 다운로드

https://www.lfd.uci.edu/~gohlke/pythonlibs/ 여기에서 아래 파일 다운로드
xgboost-0.81-cp37-cp37m-win_amd64.whl

7. pip install

(base) C:\Users\xxx>python
Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> ^Z


(base) C:\Users\xxx>pip install C:\Users\darts\Downloads\xgboost-0.81-cp37-cp37m-win_amd64.whl
Processing c:\users\darts\downloads\xgboost-0.81-cp37-cp37m-win_amd64.whl
Requirement already satisfied: numpy in e:\programdata\anaconda3\lib\site-packages (from xgboost==0.81) (1.15.4)
Requirement already satisfied: scipy in e:\programdata\anaconda3\lib\site-packages (from xgboost==0.81) (1.1.0)
Installing collected packages: xgboost
Successfully installed xgboost-0.81

8. 제대로 설치되었는지 확인

(base) C:\Users\xxx>python
Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import xgboost
>>>

9. 결론

- 64 bit 윈도우에서는 64bit용 python 설치 필요
- https://www.lfd.uci.edu/~gohlke/pythonlibs/ 여기에서 운영체제와 python 버전이 맞는 파일을 다운로드
- pip install whl파일

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

10. 시간이 흘러 2020.08.23일

Python 3.6 64bit windows환경 에서 간단하게 pip로 간단하게 설치 가능합니다.

C:\Users\USER\Documents\python\stock>pip3 install xgboost
Collecting xgboost
  Downloading https://files.pythonhosted.org/packages/b8/1c/8384c92f40e9a4739d0b474573d9bfd19b7846b5d28f6c53294e2c5c5af4/xgboost-1.2.0-py3-none-win_amd64.whl (86.5MB)
     |████████████████████████████████| 86.5MB 218kB/s
Collecting scipy (from xgboost)
  Downloading https://files.pythonhosted.org/packages/9e/66/57d6cfa52dacd9a57d0289f8b8a614b2b4f9c401c2ac154d6b31ed8257d6/scipy-1.5.2-cp38-cp38-win_amd64.whl (31.4MB)
     |████████████████████████████████| 31.4MB 595kB/s
Collecting numpy (from xgboost)
  Downloading https://files.pythonhosted.org/packages/c7/7d/ea9e28c3a99f50e77ee9a0e3759adb6537b2bb7a84aef27b8c0ddc431b48/numpy-1.19.1-cp38-cp38-win_amd64.whl (13.0MB)
     |████████████████████████████████| 13.0MB 3.3MB/s
Installing collected packages: numpy, scipy, xgboost
Successfully installed numpy-1.19.1 scipy-1.5.2 xgboost-1.2.0
WARNING: You are using pip version 19.2.3, however version 20.2.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

C:\Users\USER\Documents\python\stock>python
Python 3.8.3 (tags/v3.8.3:6f8c832, May 13 2020, 22:37:02) [MSC v.1924 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import xgboost
>>>

2019년 1월 16일 수요일

xgboost 기본 예제 python

xgboost python의 기본 예제를 살펴보면서 간단하게 사용법에 대해서 알아보았습니다.

0. 들어가기에 앞서

여기에서 사용한 소스 및 데이터 출처는 아래 링크 입니다.
https://xgboost.readthedocs.io/en/latest/python/index.html#

예제 소스 링크

https://github.com/dmlc/xgboost/tree/master/demo/guide-python

사용된 데이터 링크

https://github.com/dmlc/xgboost/blob/master/demo/data/agaricus.txt.test
https://github.com/dmlc/xgboost/blob/master/demo/data/agaricus.txt.train

기본 예제 테스트시 아래와 같은 오류 발생시 소스 데이터가 제대로 되어 있는지 확인이 필요합니다.
xgboost.core.XGBoostError: b'[17:54:28] d:\\build\\xgboost\\xgboost-0.81.git\\dmlc-core\\src\\data\\strtonum.h:147: Check failed: sign == true (0 vs. 1) '

참고 링크

https://github.com/dmlc/xgboost/issues/3355
(입력 데이터가 이상함 html 코드로 되어있음)

1. 예제

#filename : boost_from_prediction.py

#!/usr/bin/python
import numpy as np
import xgboost as xgb

dtrain = xgb.DMatrix('agaricus.txt.train')
dtest = xgb.DMatrix('agaricus.txt.test')
watchlist = [(dtest, 'eval'), (dtrain, 'train')]
###
# advanced: start from a initial base prediction
#
print ('start running example to start from a initial prediction')
# specify parameters via map, definition are same as c++ version
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'}
# train xgboost for 1 round
bst = xgb.train(param, dtrain, 1, watchlist)
# Note: we need the margin value instead of transformed prediction in set_base_margin
# do predict with output_margin=True, will always give you margin values before logistic transformation
ptrain = bst.predict(dtrain, output_margin=True)
ptest = bst.predict(dtest, output_margin=True)
dtrain.set_base_margin(ptrain)
dtest.set_base_margin(ptest)


print('this is result of running from initial prediction')
bst = xgb.train(param, dtrain, 1, watchlist)

결과

(base) E:\work\ai\xgboost>python boost_from_prediction.py
[19:09:06] 6513x127 matrix with 143286 entries loaded from agaricus.txt.train
[19:09:06] 1611x127 matrix with 35442 entries loaded from agaricus.txt.test
start running example to start from a initial prediction
[0]     eval-error:0.042831     train-error:0.046522
this is result of running from initial prediction
[0]     eval-error:0.021726     train-error:0.022263

2. 입력 데이터

입력 데이터는 DMatrix 함수를 이용하여 읽습니다.
dtrain = xgb.DMatrix('agaricus.txt.train')
dtest = xgb.DMatrix('agaricus.txt.test')

DMatrix 파일 포맷은 아래 링크를 참고 부탁드립니다.
https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html
두가지 포맷을 지원합니다. ( LibSVM, CSV )

train 데이터 파일을 Text 에디터로 열어보면 아래와 같은 형태의 내용이 있습니다.

train.txt

101:1.2 102:0.03
1:2.1 10001:300 10002:400
0:1.3 1:0.3
0:0.01 1:0.3
0:0.2 1:0.3

처음 0, 1 나오는것은 label이라고 합니다. 머신 러닝에서 일반적으로 말하는 y값이 됩니다. 위 예제는 binary classification 이라서 0,1의 값만 표시된 상태입니다. 그 이후 101:1.2 라고 표현되어있는데 101은 array의 컬럼 index를 의미합니다.(여기에서는 feature 인덱스라고 표현하기도 합니다.) 0부터 시작하며 101 array의 값이 1.2가 저장된다는 의미입니다.
agaricus.txt.train 파일을 열어보면 126이 최대값을 가지고 있기 때문에 컬럼의 크기(feature의 크기)는 127이 됩니다.

3. train parameter

train 은 train함수에 의해 동작 됩니다. 이때 여러개 인자가 필요합니다. 여기에서는 다음과 같이 표현 하였습니다.

watchlist = [(dtest, 'eval'), (dtrain, 'train')]
###
# advanced: start from a initial base prediction
#
print ('start running example to start from a initial prediction')
# specify parameters via map, definition are same as c++ version
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'}
# train xgboost for 1 round
bst = xgb.train(param, dtrain, 1, watchlist)

train에 넘어가는 인자는 다음과 같습니다.

xgboost.train(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None, maximize=False, early_stopping_rounds=None, evals_result=None, verbose_eval=True, xgb_model=None, callbacks=None, learning_rates=None)
다음 링크에서 확인할 수 있습니다.
https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.training

params (dict) – Booster params.
dtrain (DMatrix) – Data to be trained.
num_boost_round (int) – Number of boosting iterations.
evals (list of pairs (DMatrix, string)) – List of items to be evaluated during training, this allows user to watch performance on the validation set.

param 은 다음 링크에서 자세한 설명이 되어있습니다.
https://xgboost.readthedocs.io/en/latest/parameter.html
learning 종류가 여러가지가 있기때문에 기본적으로 'objective' 부분이 가장 중요합니다.
binary:logistic는 로지스틱 회귀를 위한 값입니다.

binary:logistic: logistic regression for binary classification, output probability

eta는 learning_rate가 됩니다. 숫자가 작아지면 시간이 오래걸리고, 숫자가 커지면 학습속도가 올라가는 대신 overfitting 될 가능성도 있습니다.

eta [default=0.3, alias: learning_rate]

Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative.
range: [0,1]

max_depth는 트리의 최대 깊이입니다. 이 값이 증가되면 좀 더 복잡한 모델이 되지만 좀 더 overfit 하기 쉽습니다.

max_depth [default=6]

Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 indicates no limit. Note that limit is required when grow_policy is set of depthwise.
range: [0,∞]

silent 대신 verbosity 을 사용하라고 되어 있습니다. 이것은 로그를 출력하는 레벨을 설정합니다.
silent [default=0] [Deprecated]

Deprecated. Please use verbosity instead.

verbosity [default=1]

Verbosity of printing messages. Valid values are 0 (silent), 1 (warning), 2 (info), 3 (debug). Sometimes XGBoost tries to change configurations based on heuristics, which is displayed as warning message. If there’s unexpected behaviour, please try to increase value of verbosity.

dtrain 는 train data로 DMatrix의 리턴값을 사용합니다.
num_boost_round 인자로 1을 사용하였고 evals 값으로 watchlist 변수를 사용하였습니다.
다음과 같은 값을 사용하였는데, watchlist = [(dtest, 'eval'), (dtrain, 'train')] , (DMatrix, String) 두개의 쌍으로 되어 있습니다. 모델의 성능을 평가하기 위한 용도인데, 그 의미는 출력 결과를 보면 이해가 됩니다.
아래와 같이 나오는데요. 현재 완료된 모델의 성능이 dtest를 사용했을때 오류가 얼마나 발생했는지 dtrain를 사용했을때는 얼마나 타나는지 확인하기 위한 용도입니다. string을 변경하면 아래 이름도 변경이 됩니다.

[0]     eval-error:0.042831     train-error:0.046522

num_boost_round : boosting을 얼마나 돌릴지 지정합니다.

4. predict

예측은 최종 결과를 얻기 위해서 사용합니다.

# Note: we need the margin value instead of transformed prediction in set_base_margin# do predict with output_margin=True, will always give you margin values before logistic transformationptrain = bst.predict(dtrain, output_margin=True)ptest = bst.predict(dtest, output_margin=True)

예측시 predict 함수를 사용합니다. dtrain의 입력을 넣었을때 결과는 1차원 list 형태로 나옵니다. ptrain 값을 출력을 해보면 다음과 같은 형태로 나옵니다.
[ 1.7121772 -1.7004405 -1.7004405 ... -1.9407086 -1.9407086 -1.9407086]

predict(data, output_margin=False, ntree_limit=None, validate_features=True)

Predict with data.

Parameters:	data (DMatrix) – The dmatrix storing the input. output_margin (bool) – Whether to output the raw untransformed margin value. ntree_limit (int) – Limit number of trees in the prediction; defaults to best_ntree_limit if defined (i.e. it has been trained with early stopping), otherwise 0 (use all trees). validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
Returns:	prediction
Return type:	numpy array