More refinements

ByteMeDirk · ByteMeDirk · commit 09804f0b5101 · 2021-12-01T15:38:39.000+02:00
diff --git a/.github/workflows/docker-publish.yml b/.github/workflows/docker-publish.yml
@@ -1,4 +1,4 @@
-name: Publish Docker image
+name: Publish Docker Image
 
 on:
   release:
diff --git a/Dockerfile b/Dockerfile
@@ -4,18 +4,18 @@ ENV SPARK_VERSION=3.1.2
 ENV HADOOP_VERSION=2.7
 ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
 
+WORKDIR /app
+
 COPY requirements.txt .
 RUN apt-get update \
     && apt-get install -y python3 python3-pip wget software-properties-common openjdk-8-jdk \
     && export JAVA_HOME \
     && pip3 install --upgrade pip \
-    && pip3 install  -r requirements.txt \
+    && pip3 install --no-cache -r requirements.txt \
     && wget --no-verbose http://apache.mirror.iphh.net/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
     && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
     && mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark \
     && rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
     && apt-get remove -y curl bzip2  \
     && apt-get autoremove -y  \
     && apt-get clean
-
-ENTRYPOINT ["spark-submit"]
diff --git a/README.md b/README.md
@@ -2,5 +2,32 @@
 A single node PySpark3 docker container based on OpenJDK.
 Using Python 3, PySpark 3.0.3 with Spark 3.1.2 and Hadoop 2.7.
 
+The image is set up to allow for any extensive Python3 with Spark development for 
+testing, local development and pipelines. 
+
 [![](https://img.shields.io/docker/image-size/dirkscgm/pyspark3/latest)](https://hub.docker.com/r/dirkscgm/pyspark3) 
 [![](https://img.shields.io/docker/v/dirkscgm/pyspark3?sort=semver)](https://hub.docker.com/r/dirkscgm/pyspark) 
+[![Publish to Docker](https://github.com/DirksCGM/pyspark3-docker/actions/workflows/docker-publish.yml/badge.svg)](https://github.com/DirksCGM/pyspark3-docker/actions/workflows/docker-publish.yml)
+
+
+Image includes AWS tools for Python:
+ - AWS Cli : https://pypi.org/project/awscli/
+ - Boto3: https://pypi.org/project/boto3/
+
+## Running Docker Image for AWS Development:
+
+The docker image may assume local AWS configurations and secrets on run for specific python scripts:
+```shell
+docker run --rm=true -v ~/.aws:/root/.aws <etc...>
+```
+
+This image can be extended to run any PySpark .py script using python3 or spark-submit.
+
+```docker
+FROM dirkscgm/pyspark3:latest
+
+WORKDIR /app
+COPY scripts/* scripts/
+
+ENTRYSCRIPT ["python3", "scripts/main.py"]
+```
diff --git a/scripts/main.py b/scripts/main.py
@@ -0,0 +1,19 @@
+from pyspark.context import SparkContext
+from pyspark.sql import SparkSession, DataFrame
+
+
+def main():
+    """
+    Short and simple PySpark3 script to test the docker.
+    """
+    sc: SparkSession = SparkContext.getOrCreate()
+    spark: SparkSession = SparkSession(sc)
+
+    mock_data_values: list = [(4454, "Alex", 28), (8776, "Lee", 29)]
+    mock_data_columns: list = ["id", "name", "age"]
+    data_frame: DataFrame = spark.createDataFrame(mock_data_values, mock_data_columns)
+    data_frame.show()
+
+
+if __name__ == "__main__":
+    main()

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-name: Publish Docker image`
	`1`	`+name: Publish Docker Image`
`2`	`2`
`3`	`3`	`on:`
`4`	`4`	`release:`