Skip to content

[Documentation]Instructions on how to take your application to production #345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions docs/take-to-prod.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
Taking your .NET for Apache Spark Application to Production
===

# Table of Contents
This how-to provides general instructions on how to take your .NET for Apache Spark application to production.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean to take an app to production? Perhaps add a couple words/sentence defining that (does it just mean running on-prem? Deploying to cloud? Building and running spark-submit? CI/CD?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point! @rapoth Could you please help with elaborating this a little more?

In this documentation, we will summarize the most commonly asked scenarios when running a .NET for Apache Spark Application.
You will also learn how to package your application and submit your application with [spark-submit](https://spark.apache.org/docs/latest/submitting-applications.html) and [Apachy Livy](https://livy.incubator.apache.org/).
- [How to deploy your application when you have a single dependency](#how-to-deploy-your-application-when-you-have-a-single-dependency)
- [Scenarios](#scenarios---single-dependency)
- [Package your application](#package-your-application---single-dependency)
- [Launch your application](#launch-your-application---single-dependency)
- [How to deploy your application when you have multiple dependencies](#how-to-deploy-your-application-when-you-have-multiple-dependencies)
- [Scenarios](#scenarios---multiple-dependencies)
- [Package your application](#package-your-application---multiple-dependencies)
- [Launch your application](#launch-your-application---multiple-dependencies)

## How to deploy your application when you have a single dependency
### Scenarios - single dependency
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does single dependency mean? I think it could help users to include a short explanation here or at the top of the document of what a dependency means in the .NET for Spark context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I am not so sure if we should use single dependency and multiple dependency to define and separate these scenarios. @rapoth and @imback82 any suggestions? Thanks.

#### Scenario 1. SparkSession code and business logic in the same Program.cs file
This would be the simple use case when you have `SparkSession` code and business logic (UDFs) in the same Program.cs file and in the same project (e.g. mySparkApp.csproj).
#### Scenario 2. SparkSession code and business logic in the same project, but different .cs files
This would be the use case when you have `SparkSession` code and business logic (UDFs) in the different .cs files but in the same project (e.g. SparkSession in Program.cs, business logic in BusinessLogic.cs and both are in mySparkApp.csproj).

### Package your application - single dependency
Please follow [Get Started](https://github.com/dotnet/spark/#get-started) to build your application in Scenario 1 and Scenario 2.

### Launch your application - single dependency
#### 1. Using spark-submit
Please see below as an example of running your app with `spark-submit` in Scenario 1 and Scenario 2.
```shell
%SPARK_HOME%\bin\spark-submit \
--class org.apache.spark.deploy.dotnet.DotnetRunner \
--master local \
--files bin\Debug\netcoreapp3.0\mySparkApp.dll \
bin\Debug\<dotnet version>\microsoft-spark-<spark_majorversion.spark_minorversion.x>-<spark_dotnet_version>.jar \
dotnet bin\Debug\netcoreapp3.0\mySparkApp.dll <app arg 1> <app arg 2> ... <app arg n>
```
#### 2. Using Apache Livy
Please see below as an example of running your app with Apache Livy in Scenario 1 and Scenario 2.
```shell
{
"file": "adl://<cluster name>.azuredatalakestore.net/<some dir>/microsoft-spark-<spark_majorversion.spark_minorversion.x>-<spark_dotnet_version>.jar",
"className": "org.apache.spark.deploy.dotnet.DotnetRunner",
"files": [“adl://<cluster name>.azuredatalakestore.net/<some dir>/mySparkApp.dll" ],
"args": ["dotnet","adl://<cluster name>.azuredatalakestore.net/<some dir>/mySparkApp.dll","<app arg 1>","<app arg 2>,"...","<app arg n>"]
}
```

## How to deploy your application when you have multiple dependencies
### Scenarios - multiple dependencies
#### Scenario 3. SparkSession code in one project that references another project including the business logic
This would be the use case when you have `SparkSession` code in one project (e.g. mySparkApp.csproj) and business logic (UDFs) in another project (e.g. businessLogic.csproj).
#### Scenario 4. SparkSession code references a function from a Nuget package that has been installed in the csproj
This would be the use case when `SparkSession` code references a function from a Nuget package in the same project (e.g. mySparkApp.csproj).
#### Scenario 5. SparkSession code references a function from a DLL on the user's machine
This would be the use case when `SparkSession` code reference business logic (UDFs) on the user's machine (e.g. `SparkSession` code in the mySparkApp.csproj and businessLogic.dll on a different machine).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does businessLogic.dll be from a different machine ?

#### Scenario 6. SparkSession code references functions and business logic from multiple projects/solutions that themselves depend on multiple Nuget packages
This would be a more complex use case when you have `SparkSession` code reference business logic (UDFs) and functions from Nuget packages in multiple projects and/or solutions.

### Package your application - multiple dependencies
- Please follow [Get Started](https://github.com/dotnet/spark/#get-started) to build your mySparkApp.csproj in Scenario 4 and Scenario 5 (and businessLogic.csproj for Scenario 3).
- Please see detailed steps [here](https://github.com/dotnet/spark/tree/master/deployment#preparing-your-spark-net-app) on how to build, publish and zip your application in Scenario 6. After packaging your .NET for Apache Spark application, you will have a zip file (e.g. mySparkApp.zip) which has all the dependencies.

### Launch your application - multiple dependencies
#### 1. Using spark-submit
- Please see below as an example of running your app with `spark-submit` in Scenario 3 and Scenario 5.
Additionally, you should use `--files bin\Debug\netcoreapp3.0\nugetLibrary.dll` in Scenario 4.
```shell
%SPARK_HOME%\bin\spark-submit \
--class org.apache.spark.deploy.dotnet.DotnetRunner \
--master local \
--files bin\Debug\netcoreapp3.0\businessLogic.dll \
bin\Debug\<dotnet version>\microsoft-spark-<spark_majorversion.spark_minorversion.x>-<spark_dotnet_version>.jar \
dotnet bin\Debug\netcoreapp3.0\mySparkApp.dll <app arg 1> <app arg 2> ... <app arg n>
```
- Please see below as an example of running your app with `spark-submit` in Scenario 6.
```shell
spark-submit \
--class org.apache.spark.deploy.dotnet.DotnetRunner \
--master yarn \
--deploy-mode cluster \
--conf spark.yarn.appMasterEnv.DOTNET_ASSEMBLY_SEARCH_PATHS=./udfs \
--conf spark.yarn.appMasterEnv.DOTNET_ASSEMBLY_SEARCH_PATHS=./myLibraries.zip \
--archives hdfs://<path to your files>/businessLogics.zip#udfs,hdfs://<path to your files>/myLibraries.zip \
hdfs://<path to jar file>/microsoft-spark-<spark_majorversion.spark_minorversion.x>-<spark_dotnet_version>.jar \
hdfs://<path to your files>/mySparkApp.zip mySparkApp <app arg 1> <app arg 2> ... <app arg n>
```
#### 2. Using Apache Livy
- Please see below as an example of running your app with Apache Livy in Scenario 3 and Scenario 5.
Additionally, you should use `"files": ["adl://<cluster name>.azuredatalakestore.net/<some dir>/nugetLibrary.dll"]` in Scenario 4.
```shell
{
"file": "adl://<cluster name>.azuredatalakestore.net/<some dir>/microsoft-spark-<spark_majorversion.spark_minorversion.x>-<spark_dotnet_version>.jar",
"className": "org.apache.spark.deploy.dotnet.DotnetRunner",
"files": [“adl://<cluster name>.azuredatalakestore.net/<some dir>/businessLogic.dll" ],
"args": ["dotnet","adl://<cluster name>.azuredatalakestore.net/<some dir>/mySparkApp.dll","<app arg 1>","<app arg 2>,"...","<app arg n>"]
}
```
Comment on lines +91 to +98
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should just provide the zip example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments! I have resolved all of them in this pr #349(I open a new pr #349 cause I could not edit on this one and will close this soon). Let's discuss and review in the new pr. Thanks for your understanding and sorry for the inconvenience.

- Please see below as an example of running your app with Apache Livy in Scenario 6.
```shell
{
"file": "adl://<cluster name>.azuredatalakestore.net/<some dir>/microsoft-spark-<spark_majorversion.spark_minorversion.x>-<spark_dotnet_version>.jar",
"className": "org.apache.spark.deploy.dotnet.DotnetRunner",
    "conf": {"spark.yarn.appMasterEnv.DOTNET_ASSEMBLY_SEARCH_PATHS": "./udfs, ./myLibraries.zip"},
"archives": ["adl://<cluster name>.azuredatalakestore.net/<some dir>/businessLogics.zip#udfs”, "adl://<cluster name>.azuredatalakestore.net/<some dir>/myLibraries.zip”],
"args": ["adl://<cluster name>.azuredatalakestore.net/<some dir>/mySparkApp.zip","mySparkApp","<app arg 1>","<app arg 2>,"...","<app arg n>"]
}
```