23

I am trying to run xgboost in scikit learn. And I am only using Pandas to load the data into a dataframe. How am I supposed to use pandas df with xgboost? I am confused by the DMatrix routine required to run the xgboost algorithm.

Ethan
  • 1,633
  • 9
  • 24
  • 39
Ghostintheshell
  • 431
  • 1
  • 4
  • 7

3 Answers3

26

You can use the dataframe's .values method to access raw data once you have manipulated the columns as you need them.

E.g.

train = pd.read_csv("train.csv")
target = train['target']
train = train.drop(['ID','target'],axis=1)
test = pd.read_csv("test.csv")
test = test.drop(['ID'],axis=1)

xgtrain = xgb.DMatrix(train.values, target.values)
xgtest = xgb.DMatrix(test.values)

Obviously you may need to change which columns you drop or use as the training target. The above was for a Kaggle competition, so there was no target data for xgtest (it is held back by the organisers).

Neil Slater
  • 28,918
  • 4
  • 80
  • 100
  • When trying this way xgb.DMatrix(X_train.values, y_train.values) I am seeing TypeError: can not initialize DMatrix from dict – WestCoastProjects Sep 30 '18 at 21:15
  • @javadba: It definitely worked in 2016 on my mcahine! I cannot test this at the moment as I cannot install xgboost. It is possible some library code has changed. More likely there is something different about your situation. I found https://stackoverflow.com/questions/35402461/xgboost-typeerror-can-not-initialize-dmatrix-from-dataframe but that simply advises you to do exactly what this answer does (i.e. use .values) – Neil Slater Oct 01 '18 at 06:58
12

You can now use Pandas DataFrames directly with XGBoost. Definitely works with xgboost 0.81.

For example where X_train, X_val, y_train, and y_val are DataFrames:

import xgboost as xgb

mod = xgb.XGBRegressor(
    gamma=1,                 
    learning_rate=0.01,
    max_depth=3,
    n_estimators=10000,                                                                    
    subsample=0.8,
    random_state=34
) 

mod.fit(X_train, y_train)
predictions = mod.predict(X_val)
rmse = sqrt(mean_squared_error(y_val, predictions))
print("score: {0:,.0f}".format(rmse))

jeffhale
  • 400
  • 1
  • 4
  • 9
8

There is some good news there is a library pandas_ml which supports XGBoost. This will probably this streamline the workflow simply.

Ethan
  • 1,633
  • 9
  • 24
  • 39
user4959
  • 191
  • 1
  • 1