1 year ago

#96266

test-img

Evan Zamir

How to properly define GLM with Tweedie family in PySpark?

I'm trying to adapt the simple GLM example from the docs to use Tweedie:

def create_fake_losses_data(self):
    df = self._spark.createDataFrame([
        ("a", 100.0, 12, 1, Vectors.dense(0.0, 0.0)),
        ("b", 0.0, 24, 1, Vectors.dense(1.0, 2.0)),
        ("c", 0.0, 36, 1, Vectors.dense(0.0, 0.0)),
        ("d", 2000.0, 48, 1, Vectors.dense(1.0, 1.0)), ], ["user_hashed", "label", "offset", "weight", "features"])
    logging.info(df.collect())
    setattr(self, 'fake_data', df)
    try:
        glr = GeneralizedLinearRegression(
            family="tweedie", variancePower=1.5, offsetCol='offset')
        glr.setRegParam(0.3)
        model = glr.fit(df)
        logging.info(model)
    except Py4JJavaError as e:
        print(e)
    return self

This gives me the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o96.toString.
: java.util.NoSuchElementException: Failed to find a default value for link
        at org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756)
        at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753)
        at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41)
        at org.apache.spark.ml.param.Params.$(params.scala:762)
        at org.apache.spark.ml.param.Params.$$(params.scala:762)
        at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41)
        at org.apache.spark.ml.regression.GeneralizedLinearRegressionModel.toString(GeneralizedLinearRegression.scala:1117)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)

According to the docs, however, when using Tweedie it seems you should leave link undefined. So I'm very confused here. Has anyone actually done a proper Tweedie regression using PySpark (or any version of Spark really)? The docs are also confusing me regarding the difference between variancePower and linkPower when using Tweedie. Which am I supposed to use? Which one is the p in a Tweedie distribution?

apache-spark

pyspark

glm

apache-spark-ml

tweedie

0 Answers

Your Answer

Accepted video resources