1 year ago
#383304
skklogw7
Tidymodels worfklow_map() function not working on Spark cluster with R sparklyr
I am attempting to run a time series cross validation ML tuning process on a Spark cluster (sparklyr on Databricks), but am getting an error. The packages I'm using are tidymodels with modeltime. The code works perfectly fine on a local machine, but fails on the 'worfklow_map()' function when running on Spark. The purpose of this function is to train each model on several time series 'folds', which are defined in the time_series_cv() function. I cannot debug this for the life of me because the Spark error trace is uninformative. Does anyone know why this would work locally but not on Spark? I'm somewhat new to working with clusters, so could be overlooking something simple.
If it is a package limitation, does anyone know if there an alternative way to do the 'resampling' CV on Spark where you can train each model on several non-overlapping time 'slices' in the series? Thank you in advance.
# Define CV Schema
cv_folds <- time_series_cv(
data = train_tbl,
assess = "6 months",
initial = "4 years",
skip = "1 months",
slice_limit = 20
)
# Create Preprocessing recipe
recipe_spec_lag <- recipe(formula(paste0(dv, ' ~ .')), data = train_tbl) %>%
step_dummy(all_nominal()) %>%
step_rm(date) %>%
step_zv(all_predictors())
# Create hyperparameter grid
grid_tbl_xgb <- grid_regular(
learn_rate(),
trees(),
levels = 3
)
grid_tbl_xgb <- grid_tbl_xgb %>%
create_model_grid(
f_model_spec = boost_tree,
engine_name = "spark", #also tried this with engine_name = 'xgboost'
mode = "regression",
engine_params = list(max_depth = 5)
)
# Define workflow
model_wfset <- workflow_set(
preproc = list(
recipe_spec_lag
),
models = grid_tbl_xgb$.models,
cross = TRUE
)
# Error here (works locally but fails on Databricks)
# Train models across grid and CV folds
test <- workflow_map(model_wfset, fn = "fit_resamples", resamples = cv_folds)
I get the following error
i 1 of 9 resampling: recipe_boost_tree_1
ā 1 of 9 resampling: recipe_boost_tree_1 failed with: org.apache.spark.SparkException:
Job aborted due to stage failure: Task 6 in stage 211.0 failed 4 times, most recent
failure: Lost task 6.3 in stage 211.0 (TID 1510) (192.18.29.13 executor 0):
java.lang.Exception: sparklyr worker rscript failure with status 255, check worker logs
for details. at sparklyr.Rscript.init(rscript.scala:83) at
sparklyr.WorkerApply$$anon$2.run(workerapply.scala:138)Driver stacktrace: at
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:29
84) at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2931)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:
2925) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2925) at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala
:1345) at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGSchedul
er.scala:1345) at scala.Option.foreach(Option.scala:407) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1345)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:31
93) at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3134
) at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3122
) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)Caused by:
java.lang.Exception: sparklyr worker rscript failure with status 255, check worker logs
for details. at sparklyr.Rscript.init(rscript.scala:83) at
sparklyr.WorkerApply$$anon$2.run(workerapply.scala:138)
r
apache-spark
databricks
sparklyr
tidymodels
0 Answers
Your Answer