Skip to content

Databricks tests #83

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 113 commits into
base: 3.14.0-release-candidate
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
113 commits
Select commit Hold shift + click to select a range
cdb1b09
Added jenkinsfile
mykolamelnykml May 23, 2022
f063637
Updated jenkinsfile
mykolamelnykml May 23, 2022
8f0b3ee
Updated jenkinsfile
mykolamelnykml May 23, 2022
918de40
Updated jenkinsfile
mykolamelnykml May 23, 2022
e4dd7e0
Updated jenkinsfile
mykolamelnykml May 23, 2022
09a7171
Updated jenkinsfile
mykolamelnykml May 23, 2022
b3d93a1
Updated jenkinsfile
mykolamelnykml May 24, 2022
917788c
Updated jenkinsfile
mykolamelnykml May 24, 2022
9998301
Updated jenkinsfile
mykolamelnykml May 24, 2022
3535cdb
Updated jenkinsfile
mykolamelnykml May 24, 2022
9953f9a
Updated jenkinsfile
mykolamelnykml May 24, 2022
9e15fba
Updated jenkinsfile
mykolamelnykml May 24, 2022
fd5cb03
Updated jenkinsfile
mykolamelnykml May 24, 2022
3c970d2
Updated jenkinsfile
mykolamelnykml May 24, 2022
2e101c6
Updated jenkinsfile
mykolamelnykml May 24, 2022
8793722
Updated jenkinsfile
mykolamelnykml May 24, 2022
8a0932d
Updated jenkinsfile
mykolamelnykml May 24, 2022
792c604
Updated jenkinsfile
mykolamelnykml May 24, 2022
971ea63
Updated jenkinsfile
mykolamelnykml May 24, 2022
7675238
Updated jenkinsfile
mykolamelnykml May 24, 2022
8e407bb
Updated jenkinsfile
mykolamelnykml May 24, 2022
6c70af8
Updated jenkinsfile
mykolamelnykml May 25, 2022
e9bfb60
Updated jenkinsfile
mykolamelnykml May 25, 2022
6e322a6
Updated jenkinsfile
mykolamelnykml May 25, 2022
0ab14dc
Updated jenkinsfile
mykolamelnykml May 25, 2022
5a91bb2
Updated jenkinsfile
mykolamelnykml May 25, 2022
09fd219
Updated Jenkinsfile
mykolamelnykml May 26, 2022
c1e440a
Updated Jenkinsfile
mykolamelnykml May 26, 2022
b3bc04c
Updated Jenkinsfile
mykolamelnykml May 26, 2022
4cf1321
Updated Jenkinsfile
mykolamelnykml May 26, 2022
494c5c4
Updated Jenkinsfile
mykolamelnykml May 26, 2022
724a17a
Updated Jenkinsfile
mykolamelnykml May 26, 2022
57ac9ff
Updated Jenkinsfile
mykolamelnykml May 26, 2022
219c7cc
Updated Jenkinsfile
mykolamelnykml May 26, 2022
3c6203e
Updated Jenkinsfile
mykolamelnykml May 26, 2022
541e194
Updated Jenkinsfile
mykolamelnykml May 26, 2022
4670a6f
Updated Jenkinsfile
mykolamelnykml May 26, 2022
f33194d
Updated Jenkinsfile
mykolamelnykml May 26, 2022
abdd369
Updated Jenkinsfile
mykolamelnykml May 26, 2022
c375813
Updated Jenkinsfile
mykolamelnykml May 26, 2022
8a86470
Updated Jenkinsfile
mykolamelnykml May 27, 2022
6ae741c
Updated Jenkinsfile
mykolamelnykml May 27, 2022
13a108f
Updated Jenkinsfile
mykolamelnykml May 27, 2022
b81c853
Updated Jenkinsfile
mykolamelnykml May 27, 2022
482e8c6
Updated Jenkinsfile
mykolamelnykml May 27, 2022
bffb2fc
Updated Jenkinsfile
mykolamelnykml May 27, 2022
b94cfeb
Updated Jenkinsfile
mykolamelnykml May 27, 2022
003953a
Updated Jenkinsfile
mykolamelnykml May 27, 2022
eb79641
Updated Jenkinsfile
mykolamelnykml May 27, 2022
dae45d7
Updated Jenkinsfile
mykolamelnykml May 28, 2022
5e3dacc
Updated Jenkinsfile
mykolamelnykml May 28, 2022
3fd5ff5
Updated Jenkinsfile
mykolamelnykml May 28, 2022
d3b2667
Updated jenkinsfile
mykolamelnykml May 31, 2022
05e8700
Updated jenkinsfile
mykolamelnykml May 31, 2022
2ccd7c0
Updated jenkinsfile
mykolamelnykml May 31, 2022
772cbcf
Updated jenkinsfile
mykolamelnykml May 31, 2022
06e069a
Updated jenkinsfile
mykolamelnykml May 31, 2022
44f8eba
Updated jenkinsfile
mykolamelnykml May 31, 2022
7e03afd
Updated jenkinsfile
mykolamelnykml May 31, 2022
bc97e15
Updated jenkinsfile
mykolamelnykml May 31, 2022
da861ef
Updated jenkinsfile
mykolamelnykml May 31, 2022
0b213a3
Updated jenkinsfile
mykolamelnykml May 31, 2022
c0c3059
Updated jenkinsfile
mykolamelnykml May 31, 2022
6d4258d
Updated jenkinsfile
mykolamelnykml May 31, 2022
ea41871
Updated jenkinsfile
mykolamelnykml Jun 1, 2022
cd957ff
Updated jenkinsfile
mykolamelnykml Jun 1, 2022
21c5815
Updated jenkinsfile
mykolamelnykml Jun 1, 2022
c53b92b
Updated jenkinsfile
mykolamelnykml Jun 1, 2022
4d62a3f
Updated jenkinsfile
mykolamelnykml Jun 1, 2022
b1b376b
Updated jenkinsfile
mykolamelnykml Jun 1, 2022
dcc1719
Updated jenkinsfile
mykolamelnykml Jun 1, 2022
b534247
Updated jenkinsfile
mykolamelnykml Jun 1, 2022
d88234e
Updated jenkinsfile
mykolamelnykml Jun 1, 2022
f4db299
Updated jenkinsfile
mykolamelnykml Jun 1, 2022
8f52b6e
Updated jenkinsfile
mykolamelnykml Jun 1, 2022
2066289
Updated jenkinsfile
mykolamelnykml Jun 1, 2022
1900ad2
Updated jenkinsfile
mykolamelnykml Jun 1, 2022
f99d57e
Updated jenkinsfile
mykolamelnykml Jun 1, 2022
3735bce
Updated jenkinsfile
mykolamelnykml Jun 1, 2022
cb93a6e
Updated jenkinsfile
mykolamelnykml Jun 1, 2022
b472d9d
Updated jenkinsfile
mykolamelnykml Jun 1, 2022
776a83a
Updated jenkinsfile
mykolamelnykml Jun 1, 2022
2a1e5b4
Updated jenkinsfile
mykolamelnykml Jun 1, 2022
4eac3ac
Updated jenkinsfile
mykolamelnykml Jun 1, 2022
576922f
Updated jenkinsfile
mykolamelnykml Jun 1, 2022
efde4e3
Updated jenkinsfile
mykolamelnykml Jun 1, 2022
8cd879a
Updated jenkinsfile
mykolamelnykml Jun 2, 2022
c5f802a
Updated jenkinsfile
mykolamelnykml Jun 2, 2022
ecd1dab
Updated jenkinsfile
mykolamelnykml Jun 2, 2022
bc18891
Updated jenkinsfile
mykolamelnykml Jun 2, 2022
19f80e2
Updated jenkinsfile
mykolamelnykml Jun 2, 2022
4975e8d
Updated jenkinsfile
mykolamelnykml Jun 2, 2022
117b151
Updated jenkinsfile
mykolamelnykml Jun 2, 2022
29d1bf0
Updated jenkinsfile
mykolamelnykml Jun 2, 2022
c066e76
Updated jenkinsfile
mykolamelnykml Jun 2, 2022
6c1491e
Updated jenkinsfile
mykolamelnykml Jun 2, 2022
b31f407
Updated jenkinsfile
mykolamelnykml Jun 2, 2022
6b258e5
Updated jenkinsfile
mykolamelnykml Jun 2, 2022
c1b2fb4
Updated jenkinsfile
mykolamelnykml Jun 2, 2022
a72afce
Updated jenkinsfile
mykolamelnykml Jun 2, 2022
3f54d87
Updated jenkinsfile
mykolamelnykml Jun 2, 2022
61a3114
Updated jenkinsfile
mykolamelnykml Jun 2, 2022
348f021
Updated jenkinsfile
mykolamelnykml Jun 2, 2022
0136be3
Updated jenkinsfile
mykolamelnykml Jun 2, 2022
b1dd392
Updated jenkinsfile
mykolamelnykml Jun 2, 2022
6f2e931
Updated jenkinsfile
mykolamelnykml Jun 3, 2022
e55cca0
Updated jenkinsfile
mykolamelnykml Jun 3, 2022
4e0bb93
Updated jenkinsfile
mykolamelnykml Jun 3, 2022
7b9b0b1
Updated jenkinsfile
mykolamelnykml Jun 3, 2022
b68cee2
Updated jenkinsfile
mykolamelnykml Jun 3, 2022
ee9a3a8
Updated jenkinsfile
mykolamelnykml Jun 3, 2022
0a9004b
Updated jenkinsfile
mykolamelnykml Jun 3, 2022
7fa5bd1
Updated jenkinsfile
mykolamelnykml Jun 3, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .ci/Dockerfile.build
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Build set_ambient
FROM python:3.7-alpine

ENV LC_ALL=C

RUN pip install databricks-cli requests pytest
50 changes: 50 additions & 0 deletions .ci/evaluatenotebookruns.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# evaluatenotebookruns.py
import unittest
import json
import glob
import os
import logging

class TestJobOutput(unittest.TestCase):

test_output_path = '#ENV#'

# def test_performance(self):
# path = self.test_output_path
# statuses = []
#
# for filename in glob.glob(os.path.join(path, '*.json')):
# print('Evaluating: ' + filename)
# data = json.load(open(filename))
# duration = data['execution_duration']
# if duration > 100000:
# status = 'FAILED'
# else:
# status = 'SUCCESS'
#
# statuses.append(status)
#
# self.assertFalse('FAILED' in statuses)


def test_job_run(self):
path = self.test_output_path
statuses = []


for filename in glob.glob(os.path.join(path, '*.json')):
logging.info('Evaluating: ' + filename)
print('Evaluating: ' + filename)
data = json.load(open(filename))
print(data)
if data['state']['life_cycle_state'] == "RUNNING":
statuses.append('NOT_COMPLETED')
else:
status = data['state']['result_state']
statuses.append(status)

self.assertFalse('FAILED' in statuses)
self.assertFalse('NOT_COMPLETED' in statuses)

if __name__ == '__main__':
unittest.main()
116 changes: 116 additions & 0 deletions .ci/executenotebook.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# executenotebook.py
#!/usr/bin/python3
import json
import requests
import os
import sys
import getopt
import time
import logging


def main():
workspace = ''
token = ''
clusterid = ''
localpath = ''
workspacepath = ''
outfilepath = ''
ignore = ''

try:
opts, args = getopt.getopt(sys.argv[1:], 'hs:t:c:lwo',
['workspace=', 'token=', 'clusterid=', 'localpath=', 'workspacepath=', 'outfilepath=', 'ignore='])
except getopt.GetoptError:
print(
'executenotebook.py -s <workspace> -t <token> -c <clusterid> -l <localpath> -w <workspacepath> -o <outfilepath>)')
sys.exit(2)

for opt, arg in opts:
if opt == '-h':
print(
'executenotebook.py -s <workspace> -t <token> -c <clusterid> -l <localpath> -w <workspacepath> -o <outfilepath>')
sys.exit()
elif opt in ('-s', '--workspace'):
workspace = arg
elif opt in ('-t', '--token'):
token = arg
elif opt in ('-c', '--clusterid'):
clusterid = arg
elif opt in ('-l', '--localpath'):
localpath = arg
elif opt in ('-w', '--workspacepath'):
workspacepath = arg
elif opt in ('-o', '--outfilepath'):
outfilepath = arg
elif opt in ('-i', '--ignore'):
ignore = arg

print('-s is ' + workspace)
print('-t is ' + token)
print('-c is ' + clusterid)
print('-l is ' + localpath)
print('-w is ' + workspacepath)
print('-o is ' + outfilepath)
print('-i is ' + ignore)
# Generate array from walking local path

ignore = ignore.split(',')

notebooks = []
for path, subdirs, files in os.walk(localpath):
for name in files:
if name in ignore:
logging.warning(f'Ignore ${name}')
continue
fullpath = path + '/' + name
# removes localpath to repo but keeps workspace path
fullworkspacepath = workspacepath + path.replace(localpath, '')

name, file_extension = os.path.splitext(fullpath)
if file_extension.lower() in ['.ipynb']:
row = [fullpath, fullworkspacepath, 1]
notebooks.append(row)

# run each element in list
for notebook in notebooks:
nameonly = os.path.basename(notebook[0])
workspacepath = notebook[1]

name, file_extension = os.path.splitext(nameonly)

# workpath removes extension
fullworkspacepath = workspacepath + '/' + name

print('Running job for:' + fullworkspacepath)
values = {'run_name': name, 'existing_cluster_id': clusterid, 'timeout_seconds': 3600, 'notebook_task': {'notebook_path': fullworkspacepath}}

resp = requests.post(workspace + '/api/2.0/jobs/runs/submit',
data=json.dumps(values), auth=("token", token))
runjson = resp.text
print("runjson:" + runjson)
d = json.loads(runjson)
runid = d['run_id']

i = 0
waiting = True
while waiting:
time.sleep(20)
jobresp = requests.get(workspace + '/api/2.0/jobs/runs/get?run_id='+str(runid),
data=json.dumps(values), auth=("token", token))
jobjson = jobresp.text
print("jobjson:" + jobjson)
j = json.loads(jobjson)
current_state = j['state']['life_cycle_state']
runid = j['run_id']
if current_state in ['TERMINATED', 'INTERNAL_ERROR', 'SKIPPED'] or i >= 24:
break
i = i + 1

if outfilepath != '':
file = open(outfilepath + '/' + str(runid) + '.json', 'w')
file.write(json.dumps(j))
file.close()

if __name__ == '__main__':
main()
213 changes: 213 additions & 0 deletions Jenkinsfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
@Library('jenkinslib')_

cluster_id = ""
ocr_versions = ""
nlp_versions = ""
nlp_healthcare_versions = ""
databricks_versions = ""
nlp_version_prefix = ""

def DBTOKEN = "DATABRICKS_TOKEN"
def DBURL = "https://dbc-6ca13d9d-74bb.cloud.databricks.com"
def SCRIPTPATH = "./.ci"
def NOTEBOOKPATH = "./databricks/python"
def WORKSPACEPATH = "/Shared/Spark OCR/tests"
def OUTFILEPATH = "."
def TESTRESULTPATH = "./reports/junit"
def IGNORE = "3. Compare CPU and GPU image processing with Spark OCR.ipynb"

def SPARK_NLP_VERSION = params.nlp_version
def SPARK_NLP_HEALTHCARE_VERSION = params.nlp_healthcare_version
def SPARK_OCR_VERSION = params.ocr_version

def PYPI_REPO_HEALTHCARE_SECRET = sparknlp_helpers.spark_nlp_healthcare_secret(SPARK_NLP_HEALTHCARE_VERSION)
def PYPI_REPO_OCR_SECRET = sparknlp_helpers.spark_ocr_secret(SPARK_OCR_VERSION)

def DATABRICKS_RUNTIME_VERSION = params.databricks_runtime == null ? '7.3.x-scala2.12' : params.databricks_runtime.tokenize('|')[1]
def SPARK_VERSION = params.spark_version == null ? 'spark30' : params.spark_version

switch(SPARK_VERSION) {
case 'spark24':
nlp_version_prefix="-spark24"
break
case 'spark23':
nlp_version_prefix="-spark23"
break
case 'spark30':
nlp_version_prefix=""
break
case 'spark32':
nlp_version_prefix="-spark32"
}

def String get_releases(repo)
{
def sparkOcrVesrionsString = sh(returnStdout: true, script: """gh api --paginate -H "Accept: application/vnd.github.v3+json" /repos/${repo}/releases""")
def sparkOcrVesrionsStringJson = readJSON text: sparkOcrVesrionsString
return sparkOcrVesrionsStringJson.collect{ it['tag_name']}.join("\n")
}

node {
withCredentials([usernamePassword(credentialsId: '55e7e818-4ccf-4d23-b54c-fd97c21081ba',
usernameVariable: 'GITHUB_USER',
passwordVariable: 'GITHUB_TOKEN')]) {
ocr_versions = get_releases("johnsnowlabs/spark-ocr")
nlp_versions = get_releases("johnsnowlabs/spark-nlp")
nlp_healthcare_versions = get_releases("johnsnowlabs/spark-nlp-internal")

}
withCredentials([string(credentialsId: DBTOKEN, variable: 'TOKEN')]) {

def databricksVersionsString = sh(returnStdout: true, script:'curl --header "Authorization: Bearer $TOKEN" -X GET https://dbc-6ca13d9d-74bb.cloud.databricks.com/api/2.0/clusters/spark-versions')
def databricksVersionsStringJson = readJSON text: databricksVersionsString
databricks_versions = databricksVersionsStringJson['versions'].collect{ it['name'] + " |" + it['key']}.sort().join("\n")
}
}

pipeline {
agent {
dockerfile {
filename '.ci/Dockerfile.build'
}
}
environment {
DATABRICKS_CONFIG_FILE = ".databricks.cfg"
GITHUB_CREDS = credentials('55e7e818-4ccf-4d23-b54c-fd97c21081ba')
}
parameters {
choice(
name:'databricks_runtime',
choices: '7.3 LTS Spark 3.0.1 |7.3.x-scala2.12\n' + databricks_versions,
description: 'Databricks runtime version'
)
choice(
name:'ocr_version',
choices: ocr_versions,
description:'Spark Ocr version'
)
choice(
name:'spark_version',
choices:'spark30\nspark32\nspark24\nspark23',
description:'define spark version'
)
choice(
name:'nlp_version',
choices: nlp_versions,
description:'Spark Nlp version'
)
choice(
name:'nlp_healthcare_version',
choices: nlp_healthcare_versions,
description:'Spark Nlp for Healthcare version'
)
}
stages {
stage('Setup') {
steps {
script {
withCredentials([string(credentialsId: DBTOKEN, variable: 'TOKEN')]) {
sh('echo "${TOKEN}" > secret.txt')
sh("databricks configure --token-file secret.txt --host ${DBURL}")
}
}
}
}
stage('Copy notebooks to Databricks') {
steps {
script {
sh("databricks workspace import_dir -o '${NOTEBOOKPATH}' '${WORKSPACEPATH}'")
}
}
}
stage('Create Cluster') {
steps {
script {
withCredentials([string(credentialsId:'TEST_SPARK_NLP_LICENSE',variable:'SPARK_OCR_LICENSE'),[
$class: 'AmazonWebServicesCredentialsBinding',
credentialsId: 'a4362e3b-808e-45e0-b7d2-1c62b0572df4',
accessKeyVariable: 'AWS_ACCESS_KEY_ID',
secretKeyVariable: 'AWS_SECRET_ACCESS_KEY']]) {
def jsonCluster = """
{
"num_workers": 1,
"cluster_name": "Spark Ocr Notebook Test",
"spark_version": "${DATABRICKS_RUNTIME_VERSION}",
"spark_conf": {
"spark.sql.legacy.allowUntypedScalaUDF": "true"
},
"aws_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK",
"zone_id": "us-west-2a",
"spot_bid_price_percent": 100,
"ebs_volume_count": 0
},
"node_type_id": "i3.xlarge",
"driver_node_type_id": "i3.xlarge",
"spark_env_vars": {
"JSL_OCR_LICENSE": "${SPARK_OCR_LICENSE}",
"AWS_ACCESS_KEY_ID": "${AWS_ACCESS_KEY_ID}",
"AWS_SECRET_ACCESS_KEY": "${AWS_SECRET_ACCESS_KEY}"
},
"autotermination_minutes": 20
}
"""
writeFile file: 'cluster.json', text: jsonCluster
def clusterRespString = sh(returnStdout: true, script: "databricks clusters create --json-file cluster.json")
def clusterRespJson = readJSON text: clusterRespString
cluster_id = clusterRespJson['cluster_id']
sh "rm cluster.json"
}
}
}
}
stage('Install deps to Cluster') {
steps {
script {
sh("databricks libraries install --cluster-id ${cluster_id} --jar s3://pypi.johnsnowlabs.com/${PYPI_REPO_OCR_SECRET}/jars/spark-ocr-assembly-${SPARK_OCR_VERSION}-${SPARK_VERSION}.jar")
sh("databricks libraries install --cluster-id ${cluster_id} --jar s3://pypi.johnsnowlabs.com/${PYPI_REPO_HEALTHCARE_SECRET}/spark-nlp-jsl-${SPARK_NLP_HEALTHCARE_VERSION}${nlp_version_prefix}.jar")
sh("databricks libraries install --cluster-id ${cluster_id} --maven-coordinates com.johnsnowlabs.nlp:spark-nlp${nlp_version_prefix}_2.12:${SPARK_NLP_VERSION}")
sh("databricks libraries install --cluster-id ${cluster_id} --whl s3://pypi.johnsnowlabs.com/${PYPI_REPO_OCR_SECRET}/spark-ocr/spark_ocr-${SPARK_OCR_VERSION}+${SPARK_VERSION}-py3-none-any.whl")
sh("databricks libraries install --cluster-id ${cluster_id} --whl s3://pypi.johnsnowlabs.com/${PYPI_REPO_HEALTHCARE_SECRET}/spark-nlp-jsl/spark_nlp_jsl-${SPARK_NLP_VERSION}-py3-none-any.whl")
sh("databricks libraries install --cluster-id ${cluster_id} --pypi-package spark-nlp==${SPARK_NLP_VERSION}")
timeout(10) {
waitUntil {
script {
def respStringWaitLib = sh script: "databricks libraries cluster-status --cluster-id ${cluster_id}", returnStdout: true
def respJsonWaitLib = readJSON text: respStringWaitLib
return (respJsonWaitLib['library_statuses'].every{ it['status'] == 'INSTALLED'} );
}
}
}
}
}
}
stage('Run Notebook Tests') {
steps {
script {
withCredentials([string(credentialsId: DBTOKEN, variable: 'TOKEN')]) {
sh """python3 $SCRIPTPATH/executenotebook.py --workspace=$DBURL\
--token=$TOKEN\
--clusterid=${cluster_id}\
--localpath=${NOTEBOOKPATH}\
--workspacepath='${WORKSPACEPATH}'\
--outfilepath='${OUTFILEPATH}'\
--ignore='${IGNORE}'
"""
sh """sed -i -e 's #ENV# ${OUTFILEPATH} g' ${SCRIPTPATH}/evaluatenotebookruns.py
python3 -m pytest -s --junit-xml=${TESTRESULTPATH}/TEST-notebookout.xml ${SCRIPTPATH}/evaluatenotebookruns.py
"""

}
}
}
}
}
post {
always {
sh "databricks clusters permanent-delete --cluster-id ${cluster_id}"
sh "find ${OUTFILEPATH} -name '*.json' -exec rm {} +"
junit allowEmptyResults: true, testResults: "**/reports/junit/*.xml"
}
}
}