You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using Spark ML and other libraries like RasterFrames it is now common to attach metadata to the dataframe schema via StructField. In kernel_init.py#move_to_local_sqlContext, the schema is completely thrown out:
This not only loses any metadata attached to the schema, it further causes a Spark "action" to be invoked (e.g. SparkContext.runJob) in order to grab a sample line and infer the schema from its contents, creating unnecessary and potentially undesirable computation before the user demands it.
While I don't fully understand the purpose of this method, my guess would that it needs to be changed to something like the following. But without knowing how to test it, I'm not comfortable submitting a PR.
When using Spark ML and other libraries like RasterFrames it is now common to attach metadata to the dataframe schema via
StructField
. Inkernel_init.py#move_to_local_sqlContext
, the schema is completely thrown out:seahorse/remote_notebook/code/pyspark_kernel/kernel_init.py
Line 76 in b775f5d
This not only loses any metadata attached to the schema, it further causes a Spark "action" to be invoked (e.g.
SparkContext.runJob
) in order to grab a sample line and infer the schema from its contents, creating unnecessary and potentially undesirable computation before the user demands it.While I don't fully understand the purpose of this method, my guess would that it needs to be changed to something like the following. But without knowing how to test it, I'm not comfortable submitting a PR.
The text was updated successfully, but these errors were encountered: