Skip to content

Commit ac50542

Browse files
Merge pull request #591 from michael-balint/20.06
20.06.1
2 parents 69f5e47 + 16028c3 commit ac50542

File tree

8 files changed

+679
-339
lines changed

8 files changed

+679
-339
lines changed

config.example/gpu-dashboard.json

Lines changed: 349 additions & 97 deletions
Large diffs are not rendered by default.

docs/kubeflow.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ Additionally Kubeflow offers [hyper-parameter tuning](https://github.com/kubeflo
1010

1111
Kubeflow is an [open source project](https://github.com/kubeflow/kubeflow) and is regularly evolving and adding [new features](https://github.com/kubeflow/kubeflow/blob/master/ROADMAP.md).
1212

13+
As part of the Kubeflow installation, the MPI Operator will also be installed. This will add the `MPIJob` CustomResourceDefinition to the cluster, enabling multi-pod or multi-node workloads. See [here](https://github.com/kubeflow/mpi-operator/tree/master/) for details and examples.
14+
1315
## Installation
1416

1517
Deploy Kubernetes by following the [DeepOps Kubernetes Deployment Guide](kubernetes-cluster.md)
@@ -38,8 +40,78 @@ Kubeflow configuration files will be saved to `./config/kubeflow-install`.
3840

3941
The kfctl binary will be saved to `./config/kfctl`. For easier management this file can be copied to `/usr/local/bin` or added to the `PATH`.
4042

43+
The services can be reached from the following address:
44+
* Kubeflow: http://mgmt:31380
45+
4146
## Login information
4247

4348
The default username is `admin@kubeflow.org` and the default password is `12341234`.
4449

4550
These can be modified at startup time following the steps outlined [here](https://www.kubeflow.org/docs/started/k8s/kfctl-existing-arrikto/).
51+
52+
## Other usage
53+
54+
For the most up-to-date usage information run `./scripts/k8s_deploy_kubeflow.sh -h`.
55+
56+
```sh
57+
./scripts/k8s_deploy_kubeflow.sh -h
58+
Usage:
59+
-h This message.
60+
-p Print out the connection info for Kubeflow.
61+
-d Delete Kubeflow from your system (skipping the CRDs and istio-system namespace that may have been installed with Kubeflow.
62+
-D Deprecated, same as -d. Previously 'Fully Delete Kubeflow from your system along with all Kubeflow CRDs the istio-system namespace. WARNING, do not use this option if other components depend on istio.'
63+
-x Install Kubeflow with multi-user auth (this utilizes Dex, the default is no multi-user auth).
64+
-c Specify a different Kubeflow config to install with (this option is deprecated).
65+
-w Wait for Kubeflow homepage to respond (also polls for various Kubeflow Deployments to have an available status).
66+
```
67+
68+
## Kubeflow Admin
69+
70+
### Uninstalling
71+
72+
To uninstall and re-install Kubeflow run:
73+
74+
```sh
75+
./scripts/k8s_deploy_kubeflow.sh -d
76+
./scripts/k8s_deploy_kubeflow.sh
77+
```
78+
79+
### Modifying Kubeflow configuration
80+
81+
To modify the Kubeflow configuration, modify the downloaded `CONFIG` YAML file in `config/kubeflow-install/` or one of the many overlay YAML files in `config/kubeflow-install/kustomize`.
82+
83+
After modifying the configuration, apply the changes to the cluster using `kfctl`:
84+
85+
```sh
86+
cd config/kubeflow-install
87+
../kfctl apply -f kfctl_k8s_istio.yaml
88+
```
89+
90+
## Debugging common issues
91+
92+
### No DefaultStorageClass defined or ready
93+
94+
A common issue with Kubeflow installation is that no DefaultStorageClass has been defined or that Ceph has been not been deployed correctly.
95+
96+
This can be idenfitied if most of the Kubeflow Pods are running and the MySQL pod and several others remain in a Pending state. The GUI may also load and throw a "Profile Error". Run the following to debug further:
97+
98+
```sh
99+
kubectl get pods -n kubeflow
100+
```
101+
> NOTE: Everything should be in a running state.
102+
103+
Verify Ceph is running and/or a DefaultStorageClass is defined:
104+
105+
```
106+
kubectl get storageclass | grep default
107+
./scripts/ceph_poll.sh
108+
```
109+
> NOTE: If Ceph is being used, `ceph_poll.sh` should exit after several seconds and Ceph should be the default StorageClass.
110+
111+
112+
To correct this issue:
113+
1. Uninstall Rook/Ceph: `./scripts/rmrook.sh`
114+
2. Uninstall Kubeflow: `./scripts/k8s_deploy_kubeflow.sh -D`
115+
3. Re-install Rook/ceph: `./scripts/k8s_deploy_rook.sh`
116+
4. Poll for Ceph to initialize (wait for this script to exit): `./scripts/ceph_poll.sh`
117+
5. Re-install Kubeflow: `./scripts/k8s_deploy_kubeflow.sh`

scripts/ceph_poll.sh

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,11 @@
11
#!/usr/bin/env bash
2+
# See https://rook.io/docs/rook/v1.1/ceph-quickstart.html
23
echo "Beginning to poll for Ceph and Rook setup completion."
34
echo "This may throw several errors and take up to 10 minutes. This behavior is expected."
4-
5-
rook_tools_pod=$(kubectl -n rook-ceph get pod -l app=rook-ceph-tools -o name | cut -d \/ -f2 | sed -e 's/\\r$//g')
5+
echo "The script will polling when Ceph setup is completed and in a healthy state".
66

77
while true; do
8+
rook_tools_pod=$(kubectl -n rook-ceph get pod -l app=rook-ceph-tools -o name | cut -d \/ -f2 | sed -e 's/\\r$//g')
89
kubectl -n rook-ceph exec -ti $rook_tools_pod ceph status # Run once to print output
910
kubectl -n rook-ceph exec -ti $rook_tools_pod ceph status | grep "mds: cephfs" | grep "up:active" | grep "standby-replay" # Run again to check for completion
1011
if [ "${?}" == "0" ]; then

scripts/k8s_deploy_kubeflow.sh

Lines changed: 69 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -10,17 +10,24 @@ CONFIG_DIR="${ROOT_DIR}/config"
1010
export KUBEFLOW_USER_EMAIL="${KUBEFLOW_USER_EMAIL:-admin@kubeflow.org}"
1111
export KUBEFLOW_PASSWORD="${KUBEFLOW_PASSWORD:-12341234}"
1212

13+
# Poll for these to be available with the -w flag
14+
KUBEFLOW_POLL_DEPLOYMENTS="${KUBEFLOW_DEPLOYMENTS:-profiles-deployment notebook-controller-deployment centraldashboard ml-pipeline minio mysql metadata-db jupyter-web-app-deployment katib-mysql}"
15+
1316
# Speificy how long to poll for Kubeflow to start
1417
export KUBEFLOW_TIMEOUT="${KUBEFLOW_TIMEOUT:-600}"
1518

1619
# Local files/directories to create and place scripts
1720
export KF_DIR="${KF_DIR:-${CONFIG_DIR}/kubeflow-install}"
1821
export KFCTL="${KFCTL:-${CONFIG_DIR}/kfctl}"
22+
export KUSTOMIZE="${KUSTOMIZE:-${CONFIG_DIR}/kustomize}"
1923
export KUBEFLOW_DEL_SCRIPT="${KF_DIR}/deepops-delete-kubeflow.sh"
2024

21-
# Download URLs and versions # XXX: kfctl introcuded a version mismatch, this is naming only
22-
export KFCTL_FILE=kfctl_v1.0.2-0-ga476281_linux.tar.gz # https://github.com/kubeflow/kfctl/releases/tag/v1.0.2
23-
export KFCTL_URL="https://github.com/kubeflow/kfctl/releases/download/v1.0.2/${KFCTL_FILE}"
25+
export KUBEFLOW_MPI_DIR="${KUBEFLOW_MPI_DIR:-${KF_DIR}/mpi}"
26+
export KUBEFLOW_MPI_MANIFESTS_REPO="${KUBEFLOW_MPI_MANIFESTS_REPO:-https://github.com/kubeflow/manifests}"
27+
28+
# Download URLs and versions, note the kfctl version does not always match the manifest/config version, but best-effort should be made to keep their versions close
29+
export KFCTL_FILE=kfctl_v1.1.0-0-g9a3621e_linux.tar.gz # https://github.com/kubeflow/kfctl/releases/tag/v1.1.0
30+
export KFCTL_URL="https://github.com/kubeflow/kfctl/releases/download/v1.1.0/${KFCTL_FILE}"
2431

2532
# Config 1: https://www.kubeflow.org/docs/started/k8s/kfctl-existing-arrikto/
2633
export AUTH_CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/55d1a9c84ca796f9a098bbeec406acbdcfa6aebe/kfdef/kfctl_istio_dex.v1.0.2.yaml"
@@ -31,15 +38,16 @@ export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/928cf483
3138
export CONFIG_FILE="${KF_DIR}/kfctl_k8s_istio.yaml" # Not v1.0.2 due to https://github.com/kubeflow/manifests/issues/991
3239

3340

41+
3442
function help_me() {
3543
echo "Usage:"
3644
echo "-h This message."
37-
echo "-p Print out the connection info for Kubeflow"
38-
echo "-d Delete Kubeflow from your system (skipping the CRDs and istio-system namespace that may have been installed with Kubeflow"
39-
echo "-D Full Delete Kubeflow from your system along with all Kubeflow CRDs the istio-system namespace. WARNING, do not use this option if other components depend on istio."
45+
echo "-p Print out the connection info for Kubeflow."
46+
echo "-d Delete Kubeflow from your system (skipping the CRDs and istio-system namespace that may have been installed with Kubeflow."
47+
echo "-D Deprecated, same as -d. Previously 'Fully Delete Kubeflow from your system along with all Kubeflow CRDs the istio-system namespace. WARNING, do not use this option if other components depend on istio.'"
4048
echo "-x Install Kubeflow with multi-user auth (this utilizes Dex, the default is no multi-user auth)."
41-
echo "-c Specify a different Kubeflow config to install with (this option is deprecated)"
42-
echo "-w Wait for Kubeflow homepage to respond"
49+
echo "-c Specify a different Kubeflow config to install with (this option is deprecated)."
50+
echo "-w Wait for Kubeflow homepage to respond (also polls for various Kubeflow Deployments to have an available status)."
4351
}
4452

4553

@@ -65,6 +73,7 @@ function get_opts() {
6573
D)
6674
KUBEFLOW_DELETE=true
6775
KUBEFLOW_FULL_DELETE=true
76+
echo "The -D flag is deprecated, use -d instead"
6877
;;
6978
Z)
7079
# This is a dangerous command and is not included in the help
@@ -125,6 +134,27 @@ function install_dependencies() {
125134
}
126135

127136

137+
function install_mpi_operator() {
138+
# Download kustomize, as required by mpi
139+
cd ${CONFIG_DIR}
140+
curl -s https://api.github.com/repos/kubernetes-sigs/kustomize/releases |\
141+
grep browser_download |\
142+
grep linux |\
143+
cut -d '"' -f 4 |\
144+
grep /kustomize/v |\
145+
sort | tail -n 1 |\
146+
xargs curl -s -O -L
147+
tar xzf ./kustomize_v*_linux_amd64.tar.gz
148+
mv kustomize ${KUSTOMIZE}
149+
150+
mkdir -p ${KUBEFLOW_MPI_DIR}
151+
cd ${KUBEFLOW_MPI_DIR}
152+
git clone ${KUBEFLOW_MPI_MANIFESTS_REPO}
153+
cd manifests/mpi-job/mpi-operator
154+
${KUSTOMIZE} build base | kubectl apply -f -
155+
}
156+
157+
128158
function stand_up() {
129159
# Download the kfctl binary and move it to the default location
130160
pushd .
@@ -140,16 +170,26 @@ function stand_up() {
140170
mkdir ${KF_DIR}
141171

142172
# Make cleanup scripts first in case deployment fails
143-
# TODO: This kfctl delete seems to be failing due to a Kubeflow config bug
144-
echo "cd ${KF_DIR} && ${KFCTL} delete -V -f ${CONFIG_FILE} --delete_storage; cd && sudo rm -rf ${KF_DIR}" > ${KUBEFLOW_DEL_SCRIPT}
173+
# TODO: This kfctl delete seems to be failing occasionally with the cert-manager ns (due to a Kubeflow config bug)
174+
# XXX: We manually delete the mpijobs crd because this is currently installed outside of the kfctl apply
175+
echo "kubectl delete crd mpijobs.kubeflow.org; cd ${KF_DIR} && ${KFCTL} delete -V -f ${CONFIG_FILE} --force-deletion --delete_storage; cd && sudo rm -rf ${KF_DIR}" > ${KUBEFLOW_DEL_SCRIPT}
145176
chmod +x ${KUBEFLOW_DEL_SCRIPT}
146177

147178
# Initialize and apply the Kubeflow project using the specified config. We do this in two steps to allow a chance to customize the config
148179
cd ${KF_DIR}
149180
${KFCTL} build -V -f ${CONFIG_URI}
150181

182+
# Occassionally the kfctl will fail, if this occurs halt all installation
183+
if [ $? != 0 ]; then
184+
echo -e "\nDeepOps ERROR: Failure building Kubeflow Manifest at ${CONFIG_URI} in ${KF_DIR}"
185+
exit 1
186+
fi
187+
188+
sed -i '/metadata:.*/a\ ClusterName: cluster.local' ${CONFIG_FILE} # BUGFIX: Need to add the ClusterName for proper deletion:https://github.com/kubeflow/kubeflow/issues/4815
189+
151190
# Update Kubeflow with the NGC containers and NVIDIA configurations
152-
${SCRIPT_DIR}/update_kubeflow_config.py
191+
# BUG: Commented out until NGC containers add Kubeflow support, see https://github.com/NVIDIA/deepops/tree/master/containers/ngc
192+
# ${SCRIPT_DIR}/update_kubeflow_config.py
153193

154194
# XXX: Add potential CONFIG customizations here before applying
155195
${KFCTL} apply -V -f ${CONFIG_FILE}
@@ -170,17 +210,20 @@ function tear_down() {
170210
# Kubeflow use leads to some user created namespaces that are not torn down during kfctl delete
171211
namespaces="kubeflow"
172212

173-
# Delete other NS that were installed. These might be part of other apps and is slightly dangerous
174-
if [ "${KUBEFLOW_FULL_DELETE}" == "true" ]; then
175-
namespaces=" ${namespaces} admin auth cert-manager istio-system knative-serving ${KUBEFLOW_EXTRA_NS}"
176-
fi
177-
178213
# This runs kfctl delete pointing to the CONFIG that was used at install
179214
bash ${KUBEFLOW_DEL_SCRIPT} && sleep 5 # There seems to be a timing issue here in kfctl, so we sleep a bit.
180215

181-
# delete all namespaces, including namespaces that "should" already have been deleted by kfctl delete
182-
echo "Re-deleting namespaces ${namespaces} for a full cleanup"
183-
kubectl delete ns ${namespaces}
216+
# Delete other NS that were installed. These might be part of other apps and is slightly dangerous
217+
# LEGACY: This code was implemented to workaround https://github.com/kubeflow/kubeflow/issues/3767, this is supposedly fixed
218+
#if [ "${KUBEFLOW_FULL_DELETE}" == "true" ]; then
219+
# namespaces=" ${namespaces} admin auth cert-manager istio-system knative-serving ${KUBEFLOW_EXTRA_NS}"
220+
# # delete all namespaces, including namespaces that "should" already have been deleted by kfctl delete
221+
# echo "Re-deleting namespaces ${namespaces} for a full cleanup"
222+
# kubectl delete ns ${namespaces}
223+
# # These should probably be deleted by kfctl, but they are not
224+
# kubectl delete crd -l app.kubernetes.io/part-of=kubeflow -o name
225+
# kubectl delete all -l app.kubernetes.io/part-of=kubeflow --all-namespaces
226+
#fi
184227

185228
# There is an issues in the kfctl delete command that does not properly clean up and leaves NSs in a terminating state, this is a bit hacky but resolves it
186229
if [ "${KUBEFLOW_EXTRA_FULL_DELETE}" == "true" ]; then
@@ -189,17 +232,17 @@ function tear_down() {
189232
fix_terminating_ns ${namespaces}
190233
fi
191234

192-
if [ "${KUBEFLOW_FULL_DELETE}" == "true" ]; then
193-
# These should probably be deleted by kfctl, but they are not
194-
kubectl delete crd -l app.kubernetes.io/part-of=kubeflow -o name
195-
kubectl delete all -l app.kubernetes.io/part-of=kubeflow --all-namespaces
196-
fi
197-
198235
rm ${KFCTL}
199236
}
200237

201238

202239
function poll_url() {
240+
kubectl wait --for=condition=available --timeout=${KUBEFLOW_TIMEOUT}s -n kubeflow deployments ${KUBEFLOW_POLL_DEPLOYMENTS}
241+
if [ "${?}" != "0" ]; then
242+
echo "Kubeflow did not complete deployment within ${KUBEFLOW_TIMEOUT} seconds"
243+
exit 1
244+
fi
245+
203246
# It typically takes ~5 minutes for all pods and services to start, so we poll for ten minutes here
204247
time=0
205248
while [ ${time} -lt ${KUBEFLOW_TIMEOUT} ]; do
@@ -300,6 +343,7 @@ elif [ ${KUBEFLOW_WAIT} ]; then
300343
else
301344
install_dependencies
302345
stand_up
346+
install_mpi_operator
303347
get_url
304348
print_info
305349
fi

0 commit comments

Comments
 (0)