Bubbling Exceptions to Driver from UDF / Microsoft.Spark.Worker #894

dbeavon · 2021-04-08T15:27:47Z

dbeavon
Apr 8, 2021

I may be missing something obvious. Is there any way to get exception information to bubble out of a UDF to the driver?

The following is the most common message I get when failures are encountered in my UDF's.


JvmException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 14 in stage 17.0 failed 1 times, most recent failure: Lost task 14.0 in stage 17.0 (TID 2583, 172.30.11.206, executor 1): java.lang.IllegalArgumentException
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
	at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:669)
	at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
	at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:163)
	at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:169)
	at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:160)
	at org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:62)
	at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:89)

... unfortunately this message is meaningless and to get the real underlying problem you have to open the browser, find the worker, and inspect the stderr output logs. This is quite a lot of effort if you need to do it all day long, while writing UDF logic. It would be far better if there was some auto-magical way for the underlying exception to bubble out to the driver.

I use spark.task.maxFailures = 1, so typically there is only one error apiece from each parallel worker.

Today the issue happens to be with my Microsoft.Data.SqlClient libraries. I found this in my stderr logs.

System.PlatformNotSupportedException: System.Data.SqlClient is not supported on this platform.
   at System.Data.SqlClient.SqlConnection..ctor(String connectionString)

... but there are a wide assortment of exceptions that can come out of a UDF. Every time I bump into a new one I get the same incoherent message and have to start clicking around in the Spark console in the browser to investigate. After losing five minutes of my time, I've learned which exception was thrown and can get back to fixing the root cause, and continuing with my work.

It would be nice to have a pattern that we could follow to send exception details back to the driver, so that it can appear in the visual studio IDE. Am I missing something obvious, like perhaps I need to catch exceptions in the UDF and rethrow them in some sort of way that is compatible/serializable?

Below is the full stack from the worker, I believe. This is a JVM stack and, interestingly, there is no mention of microsoft .net for spark. I'm assuming this stack indicates that the microsoft stuff has already failed, and the java side is trying to deserialize/reserialize something back to the driver. Is that so? Do python developers experience the same unfriendly message when their UDF's are failing? Or maybe their exceptions/errors are always serializable unlike with .Net?


	at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
	at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:669)
	at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
	at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:163)
	at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:169)
	at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:160)
	at org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:62)
	at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:89)
	at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:49)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
	at org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anon$1.hasNext(InMemoryRelation.scala:132)
	at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:311)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

dbeavon · 2021-04-08T15:56:30Z

dbeavon
Apr 8, 2021
Author

I did some research on the python side of things. It looks like those developers are often given a very verbose error message ("traceback") that describes the python-based exception that occurred during the execution of their UDF. See the following link for an example.

https://stackoverflow.com/questions/59739846/pyspark-implement-helper-in-rdd-map

It appears that the parsing of their exceptions is possible by way of a method, org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException.

... but perhaps our .Net UDF's don't have something that corresponds to this handler.

If the exception details won't be relayed back to the driver, nor appear in the driver logs, then it makes development/debugging challenging. But I am wondering what the impact will be in our production environment as well. We don't have log delivery enabled, and we rely on only the default files that are available in the databricks workspace (driver log, standard output). It seems to me that failures in our UDF's won't produce any meaningful output in these default files and that will require us to enable/manage log delivery in production. Is that what others are doing?

Please let me know if there are any other techniques I might use - both in development and in production - to capture errors from our UDF's.

2 replies

imback82 Apr 8, 2021

.NET UDF piggybacks on the Python UDF framework, so they should work the same. For example:

spark/src/csharp/Microsoft.Spark.Worker/TaskRunner.cs

Line 200 in ea83dac

SerDe.Write(outputStream, (int)SpecialLengths.PYTHON_EXCEPTION_THROWN);

Do you see the UDF error messages in the driver log for PySpark?

dbeavon Apr 8, 2021
Author

Just to recap, I do get exception details sent into the worker logs (stderr in C:\myworkspace\spark-3.0.0-bin-hadoop3.2\work\app1234).

But it is a pain to go hunting for them in there, or in the spark management console.

Do you see the UDF error messages in the driver log for PySpark?

I'm not aware that I have a driver log for PySpark.

FYI, I'm running a standalone cluster on my local windows machine with a master and two workers. Apache spark was installed on C: at C:\myworkspace\spark-3.0.0-bin-hadoop3.2.

I'm debugging my application using the spark-submit of the DotnetRunner with the debug argument

spark-submit --master spark://172.30.11.206:7077 --class org.apache.spark.deploy.dotnet.DotnetRunner --jars whatever microsoft-spark-3-0_2.12-1.0.0.jar debug

The workers are sending their information into files, and the driver is dumping INFO -level detail into the console window.

When exceptions are thrown within UDF's, I get to see the generic/unhelpful "IllegalArgumentException" at the driver, but that forces me to open up the worker logs to determine the root cause. I will try to investigate whether there is more I can do with the log4j configuration, it is still new to me.... Maybe there is something in there to enable pyspark logging.... In any case, I'm not sure how this would help me to get a better exception to bubble out to my driver. The "IllegalArgmentException" probably wouldn't transform into some fancy "PysparkTracebackException" or anything like that....

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bubbling Exceptions to Driver from UDF / Microsoft.Spark.Worker #894

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Bubbling Exceptions to Driver from UDF / Microsoft.Spark.Worker #894

dbeavon Apr 8, 2021

Replies: 1 comment · 2 replies

dbeavon Apr 8, 2021 Author

imback82 Apr 8, 2021

dbeavon Apr 8, 2021 Author

dbeavon
Apr 8, 2021

Replies: 1 comment 2 replies

dbeavon
Apr 8, 2021
Author

dbeavon Apr 8, 2021
Author