Skip to content

Commit 09804f0

Browse files
committed
More refinements
1 parent a83fbff commit 09804f0

File tree

4 files changed

+50
-4
lines changed

4 files changed

+50
-4
lines changed

.github/workflows/docker-publish.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: Publish Docker image
1+
name: Publish Docker Image
22

33
on:
44
release:

Dockerfile

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,18 @@ ENV SPARK_VERSION=3.1.2
44
ENV HADOOP_VERSION=2.7
55
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
66

7+
WORKDIR /app
8+
79
COPY requirements.txt .
810
RUN apt-get update \
911
&& apt-get install -y python3 python3-pip wget software-properties-common openjdk-8-jdk \
1012
&& export JAVA_HOME \
1113
&& pip3 install --upgrade pip \
12-
&& pip3 install -r requirements.txt \
14+
&& pip3 install --no-cache -r requirements.txt \
1315
&& wget --no-verbose http://apache.mirror.iphh.net/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
1416
&& tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
1517
&& mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark \
1618
&& rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
1719
&& apt-get remove -y curl bzip2 \
1820
&& apt-get autoremove -y \
1921
&& apt-get clean
20-
21-
ENTRYPOINT ["spark-submit"]

README.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,32 @@
22
A single node PySpark3 docker container based on OpenJDK.
33
Using Python 3, PySpark 3.0.3 with Spark 3.1.2 and Hadoop 2.7.
44

5+
The image is set up to allow for any extensive Python3 with Spark development for
6+
testing, local development and pipelines.
7+
58
[![](https://img.shields.io/docker/image-size/dirkscgm/pyspark3/latest)](https://hub.docker.com/r/dirkscgm/pyspark3)
69
[![](https://img.shields.io/docker/v/dirkscgm/pyspark3?sort=semver)](https://hub.docker.com/r/dirkscgm/pyspark)
10+
[![Publish to Docker](https://github.com/DirksCGM/pyspark3-docker/actions/workflows/docker-publish.yml/badge.svg)](https://github.com/DirksCGM/pyspark3-docker/actions/workflows/docker-publish.yml)
11+
12+
13+
Image includes AWS tools for Python:
14+
- AWS Cli : https://pypi.org/project/awscli/
15+
- Boto3: https://pypi.org/project/boto3/
16+
17+
## Running Docker Image for AWS Development:
18+
19+
The docker image may assume local AWS configurations and secrets on run for specific python scripts:
20+
```shell
21+
docker run --rm=true -v ~/.aws:/root/.aws <etc...>
22+
```
23+
24+
This image can be extended to run any PySpark .py script using python3 or spark-submit.
25+
26+
```docker
27+
FROM dirkscgm/pyspark3:latest
28+
29+
WORKDIR /app
30+
COPY scripts/* scripts/
31+
32+
ENTRYSCRIPT ["python3", "scripts/main.py"]
33+
```

scripts/main.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
from pyspark.context import SparkContext
2+
from pyspark.sql import SparkSession, DataFrame
3+
4+
5+
def main():
6+
"""
7+
Short and simple PySpark3 script to test the docker.
8+
"""
9+
sc: SparkSession = SparkContext.getOrCreate()
10+
spark: SparkSession = SparkSession(sc)
11+
12+
mock_data_values: list = [(4454, "Alex", 28), (8776, "Lee", 29)]
13+
mock_data_columns: list = ["id", "name", "age"]
14+
data_frame: DataFrame = spark.createDataFrame(mock_data_values, mock_data_columns)
15+
data_frame.show()
16+
17+
18+
if __name__ == "__main__":
19+
main()

0 commit comments

Comments
 (0)