Skip to content

Commit d8d4290

Browse files
authored
licensed_install emr part is updated (#1130)
1 parent 8d9daac commit d8d4290

File tree

7 files changed

+94
-35
lines changed

7 files changed

+94
-35
lines changed

docs/en/image-1.png

95.9 KB
Loading

docs/en/image-2.png

83.8 KB
Loading

docs/en/image-3.png

62.1 KB
Loading

docs/en/image-4.png

30 KB
Loading

docs/en/image-5.png

30.7 KB
Loading

docs/en/image.png

90.9 KB
Loading

docs/en/licensed_install.md

Lines changed: 94 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -843,46 +843,105 @@ In this page we explain how to setup Spark-NLP + Spark-NLP Healthcare in AWS EMR
843843
</div><div class="h3-box" markdown="1">
844844
845845
### Steps
846-
1. You must go to the blue button "Create Cluster" on the UI. By doing that you will get directed to the "Create Cluster - Quick Options" page. Don't use the quick options, click on "Go to advanced options" instead.
847-
2. Now in Advanced Options, on Step 1, "Software and Steps", please pick the following selection in the checkboxes,
848-
![software config](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/platforms/emr/software_configs.png?raw=true)
849-
Also in the "Edit Software Settings" page, enter the following,
850846
851-
```
852-
[{
853-
"Classification": "spark-env",
854-
"Configurations": [{
855-
"Classification": "export",
856-
"Properties": {
857-
"PYSPARK_python": "/usr/bin/python3",
858-
"AWS_ACCESS_KEY_ID": "XYXYXYXYXYXYXYXYXYXY",
859-
"AWS_SECRET_ACCESS_KEY": "XYXYXYXYXYXYXYXYXYXY",
860-
"SPARK_NLP_LICENSE": "XYXYXYXYXYXYXYXYXYXYXYXYXYXY"
861-
}
862-
}]
863-
},
864-
{
865-
"Classification": "spark-defaults",
847+
1. Go to AWS services, and select EMR
848+
849+
2. Press Create Cluster and start:
850+
- Name your cluster
851+
- select EMR version
852+
- select required applications
853+
854+
855+
![alt text](image.png)
856+
857+
- Specify EC2 instances for the cluster, as primary/master node and cores/workers
858+
- Specify the storage/ EBS volume
859+
860+
![alt text](image-1.png)
861+
862+
- Choose Cluster scaling and provisioning
863+
- Choose Networking / VPC
864+
865+
![alt text](image-2.png)
866+
867+
- Choose Security Groups/Firewall for primary/master node and cores/workers/slaves
868+
869+
![alt text](image-3.png)
870+
871+
- If you have add steps , that will be executed after cluster is provisioned
872+
- Specify the S3 location for logs
873+
- Under **Tags** section, please add a `KEY: VALUE` pair with `for-use-with-amazon-emr-managed-policies` `true`
874+
875+
**Important**
876+
- Specify the Bootstrap Action
877+
878+
[jsl_emr_bootstrap.sh](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/platforms/emr/jsl_emr_bootstrap.sh)
879+
880+
Put this sample shell script in a S3 location and specify it in the form:
881+
You will have spark-nlp and spark-nlp-jsl and spark-ocr installed by bootstrap action, this file is executed during the cluster provisioning. Version of Libraries and other credentials provided by Johnsnowlabs will be in this file.
882+
883+
![add bootstrap action](image-5.png)
884+
885+
886+
**Important**
887+
- Specify the Configuration for spark:
888+
Here is a sample configuration, you can copy/paste into Software settings tab or load from S3.
889+
You can change spark configuration according to your needs.
890+
891+
```
892+
[
893+
{
894+
"Classification": "spark-env",
895+
"Configurations": [
896+
{
897+
"Classification": "export",
898+
"Properties": {
899+
"JSL_EMR": "1",
900+
"PYSPARK_PYTHON": "/usr/bin/python3",
901+
"SPARK_NLP_LICENSE": "XYXYXYXYXY"
902+
}
903+
}
904+
],
905+
"Properties": {}
906+
},
907+
{
908+
"Classification": "yarn-env",
909+
"Configurations": [
910+
{
911+
"Classification": "export",
912+
"Properties": {
913+
"JSL_EMR": "1",
914+
"SPARK_NLP_LICENSE": "XYXYXYXYXY"
915+
}
916+
}
917+
],
918+
"Properties": {}
919+
},
920+
{
921+
"Classification": "spark-defaults",
866922
"Properties": {
867-
"spark.yarn.stagingDir": "hdfs:///tmp",
868-
"spark.yarn.preserve.staging.files": "true",
923+
"spark.driver.maxResultSize": "0",
924+
"spark.driver.memory": "64G",
925+
"spark.dynamicAllocation.enabled": "true",
926+
"spark.executor.memory": "64G",
927+
"spark.executorEnv.SPARK_NLP_LICENSE": "XYXYXYXYXY",
928+
"spark.jsl.settings.aws.credentials.access_key_id": "XYXYXYXYXY",
929+
"spark.jsl.settings.aws.credentials.secret_access_key": "XYXYXYXYXY",
930+
"spark.jsl.settings.aws.region": "us-east-1",
931+
"spark.jsl.settings.pretrained.credentials.access_key_id": "XYXYXYXYXY",
932+
"spark.jsl.settings.pretrained.credentials.secret_access_key": "XYXYXYXYXY",
869933
"spark.kryoserializer.buffer.max": "2000M",
934+
"spark.rpc.message.maxSize": "1024",
870935
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
871-
"spark.driver.maxResultSize": "0",
872-
"spark.driver.memory": "32G"
936+
"spark.yarn.appMasterEnv.SPARK_NLP_LICENSE": "XYXYXYXYXY",
937+
"spark.yarn.preserve.staging.files": "true",
938+
"spark.yarn.stagingDir": "hdfs:///tmp"
873939
}
874-
}]
875-
```
876-
Make sure that you replace all the secret information(marked here as XYXYXYXYXY) by the appropriate values that you received with your license.<br/>
877-
3. In "Step 2" choose the hardware and networking configuration you prefer, or just pick the defaults. Move to next step by clocking the "Next" blue button.<br/>
878-
4. Now you are in "Step 3", in which you assign a name to your cluster, and you can change the location of the cluster logs. If the location of the logs is OK for you, take note of the path so you can debug potential problems by using the logs.<br/>
879-
5. Still on "Step 3", go to the bottom of the page, and expand the "Bootstrap Actions" tab. We're gonna add an action to execute during bootstrap of the cluster. Select "Custom Action", then press on "Configure and add".<br/>
880-
You need to provide a path to a script on S3. The path needs to be public. Keep this in mind, no secret information can be contained there.<br/>
881-
The script we'll used for this setup is [emr_bootstrap.sh](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/platforms/emr/emr_bootstrap.sh) .
882-
<br/>
883-
This script will install Spark-NLP 3.1.0, and Spark-NLP Healthcare 3.1.1. You'll have to edit the script if you need different versions.<br/>
884-
After you entered the route to S3 in which you place the `emr_bootstrap.sh` file, and before clicking "add" in the dialog box, you must pass an additional parameter containing the SECRET value you received with your license. Just paste the secret on the "Optional arguments" field in that dialog box.<br/>
885-
6. There's not much additional setup you need to perform. So just start a notebook server, connect it to the cluster you just created(be patient, it takes a while), and test with the [NLP_EMR_Setup.ipynb](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/platforms/emr/NLP_EMR_Setup.ipynb) test notebook.<br/>
940+
}
941+
]
942+
```
943+
944+
- There's not much additional setup you need to perform. So just start a notebook server, connect it to the cluster you just created(be patient, it takes a while), and test with the [jsl_test_notebook_for_emr.ipynb](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/platforms/emr/NLP_EMR_Setup.ipynb) test notebook.<br/>
886945

887946
</div><div class="h3-box" markdown="1">
888947

0 commit comments

Comments
 (0)