You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Kubeflow is an [open source project](https://github.com/kubeflow/kubeflow) and is regularly evolving and adding [new features](https://github.com/kubeflow/kubeflow/blob/master/ROADMAP.md).
12
12
13
+
As part of the Kubeflow installation, the MPI Operator will also be installed. This will add the `MPIJob` CustomResourceDefinition to the cluster, enabling multi-pod or multi-node workloads. See [here](https://github.com/kubeflow/mpi-operator/tree/master/) for details and examples.
14
+
13
15
## Installation
14
16
15
17
Deploy Kubernetes by following the [DeepOps Kubernetes Deployment Guide](kubernetes-cluster.md)
@@ -38,8 +40,78 @@ Kubeflow configuration files will be saved to `./config/kubeflow-install`.
38
40
39
41
The kfctl binary will be saved to `./config/kfctl`. For easier management this file can be copied to `/usr/local/bin` or added to the `PATH`.
40
42
43
+
The services can be reached from the following address:
44
+
* Kubeflow: http://mgmt:31380
45
+
41
46
## Login information
42
47
43
48
The default username is `admin@kubeflow.org` and the default password is `12341234`.
44
49
45
50
These can be modified at startup time following the steps outlined [here](https://www.kubeflow.org/docs/started/k8s/kfctl-existing-arrikto/).
51
+
52
+
## Other usage
53
+
54
+
For the most up-to-date usage information run `./scripts/k8s_deploy_kubeflow.sh -h`.
55
+
56
+
```sh
57
+
./scripts/k8s_deploy_kubeflow.sh -h
58
+
Usage:
59
+
-h This message.
60
+
-p Print out the connection info for Kubeflow.
61
+
-d Delete Kubeflow from your system (skipping the CRDs and istio-system namespace that may have been installed with Kubeflow.
62
+
-D Deprecated, same as -d. Previously 'Fully Delete Kubeflow from your system along with all Kubeflow CRDs the istio-system namespace. WARNING, do not use this option if other components depend on istio.'
63
+
-x Install Kubeflow with multi-user auth (this utilizes Dex, the default is no multi-user auth).
64
+
-c Specify a different Kubeflow config to install with (this option is deprecated).
65
+
-w Wait for Kubeflow homepage to respond (also polls for various Kubeflow Deployments to have an available status).
66
+
```
67
+
68
+
## Kubeflow Admin
69
+
70
+
### Uninstalling
71
+
72
+
To uninstall and re-install Kubeflow run:
73
+
74
+
```sh
75
+
./scripts/k8s_deploy_kubeflow.sh -d
76
+
./scripts/k8s_deploy_kubeflow.sh
77
+
```
78
+
79
+
### Modifying Kubeflow configuration
80
+
81
+
To modify the Kubeflow configuration, modify the downloaded `CONFIG` YAML file in`config/kubeflow-install/` or one of the many overlay YAML files in`config/kubeflow-install/kustomize`.
82
+
83
+
After modifying the configuration, apply the changes to the cluster using `kfctl`:
84
+
85
+
```sh
86
+
cd config/kubeflow-install
87
+
../kfctl apply -f kfctl_k8s_istio.yaml
88
+
```
89
+
90
+
## Debugging common issues
91
+
92
+
### No DefaultStorageClass defined or ready
93
+
94
+
A common issue with Kubeflow installation is that no DefaultStorageClass has been defined or that Ceph has been not been deployed correctly.
95
+
96
+
This can be idenfitied if most of the Kubeflow Pods are running and the MySQL pod and several others remain in a Pending state. The GUI may also load and throw a "Profile Error". Run the following to debug further:
97
+
98
+
```sh
99
+
kubectl get pods -n kubeflow
100
+
```
101
+
> NOTE: Everything should be in a running state.
102
+
103
+
Verify Ceph is running and/or a DefaultStorageClass is defined:
104
+
105
+
```
106
+
kubectl get storageclass | grep default
107
+
./scripts/ceph_poll.sh
108
+
```
109
+
> NOTE: If Ceph is being used, `ceph_poll.sh` should exit after several seconds and Ceph should be the default StorageClass.
Copy file name to clipboardExpand all lines: scripts/ceph_poll.sh
+3-2Lines changed: 3 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -1,10 +1,11 @@
1
1
#!/usr/bin/env bash
2
+
# See https://rook.io/docs/rook/v1.1/ceph-quickstart.html
2
3
echo"Beginning to poll for Ceph and Rook setup completion."
3
4
echo"This may throw several errors and take up to 10 minutes. This behavior is expected."
4
-
5
-
rook_tools_pod=$(kubectl -n rook-ceph get pod -l app=rook-ceph-tools -o name | cut -d \/ -f2 | sed -e 's/\\r$//g')
5
+
echo"The script will polling when Ceph setup is completed and in a healthy state".
6
6
7
7
whiletrue;do
8
+
rook_tools_pod=$(kubectl -n rook-ceph get pod -l app=rook-ceph-tools -o name | cut -d \/ -f2 | sed -e 's/\\r$//g')
8
9
kubectl -n rook-ceph exec -ti $rook_tools_pod ceph status # Run once to print output
9
10
kubectl -n rook-ceph exec -ti $rook_tools_pod ceph status | grep "mds: cephfs"| grep "up:active"| grep "standby-replay"# Run again to check for completion
# Download URLs and versions, note the kfctl version does not always match the manifest/config version, but best-effort should be made to keep their versions close
export CONFIG_FILE="${KF_DIR}/kfctl_k8s_istio.yaml"# Not v1.0.2 due to https://github.com/kubeflow/manifests/issues/991
32
39
33
40
41
+
34
42
functionhelp_me() {
35
43
echo"Usage:"
36
44
echo"-h This message."
37
-
echo"-p Print out the connection info for Kubeflow"
38
-
echo"-d Delete Kubeflow from your system (skipping the CRDs and istio-system namespace that may have been installed with Kubeflow"
39
-
echo"-D Full Delete Kubeflow from your system along with all Kubeflow CRDs the istio-system namespace. WARNING, do not use this option if other components depend on istio."
45
+
echo"-p Print out the connection info for Kubeflow."
46
+
echo"-d Delete Kubeflow from your system (skipping the CRDs and istio-system namespace that may have been installed with Kubeflow."
47
+
echo"-D Deprecated, same as -d. Previously 'Fully Delete Kubeflow from your system along with all Kubeflow CRDs the istio-system namespace. WARNING, do not use this option if other components depend on istio.'"
40
48
echo"-x Install Kubeflow with multi-user auth (this utilizes Dex, the default is no multi-user auth)."
41
-
echo"-c Specify a different Kubeflow config to install with (this option is deprecated)"
42
-
echo"-w Wait for Kubeflow homepage to respond"
49
+
echo"-c Specify a different Kubeflow config to install with (this option is deprecated)."
50
+
echo"-w Wait for Kubeflow homepage to respond (also polls for various Kubeflow Deployments to have an available status)."
43
51
}
44
52
45
53
@@ -65,6 +73,7 @@ function get_opts() {
65
73
D)
66
74
KUBEFLOW_DELETE=true
67
75
KUBEFLOW_FULL_DELETE=true
76
+
echo"The -D flag is deprecated, use -d instead"
68
77
;;
69
78
Z)
70
79
# This is a dangerous command and is not included in the help
@@ -125,6 +134,27 @@ function install_dependencies() {
# TODO: This kfctl delete seems to be failing occasionally with the cert-manager ns (due to a Kubeflow config bug)
174
+
# XXX: We manually delete the mpijobs crd because this is currently installed outside of the kfctl apply
175
+
echo"kubectl delete crd mpijobs.kubeflow.org; cd ${KF_DIR} && ${KFCTL} delete -V -f ${CONFIG_FILE} --force-deletion --delete_storage; cd && sudo rm -rf ${KF_DIR}">${KUBEFLOW_DEL_SCRIPT}
145
176
chmod +x ${KUBEFLOW_DEL_SCRIPT}
146
177
147
178
# Initialize and apply the Kubeflow project using the specified config. We do this in two steps to allow a chance to customize the config
148
179
cd${KF_DIR}
149
180
${KFCTL} build -V -f ${CONFIG_URI}
150
181
182
+
# Occassionally the kfctl will fail, if this occurs halt all installation
183
+
if [ $?!= 0 ];then
184
+
echo -e "\nDeepOps ERROR: Failure building Kubeflow Manifest at ${CONFIG_URI} in ${KF_DIR}"
185
+
exit 1
186
+
fi
187
+
188
+
sed -i '/metadata:.*/a\ ClusterName: cluster.local'${CONFIG_FILE}# BUGFIX: Need to add the ClusterName for proper deletion:https://github.com/kubeflow/kubeflow/issues/4815
189
+
151
190
# Update Kubeflow with the NGC containers and NVIDIA configurations
152
-
${SCRIPT_DIR}/update_kubeflow_config.py
191
+
# BUG: Commented out until NGC containers add Kubeflow support, see https://github.com/NVIDIA/deepops/tree/master/containers/ngc
192
+
# ${SCRIPT_DIR}/update_kubeflow_config.py
153
193
154
194
# XXX: Add potential CONFIG customizations here before applying
155
195
${KFCTL} apply -V -f ${CONFIG_FILE}
@@ -170,17 +210,20 @@ function tear_down() {
170
210
# Kubeflow use leads to some user created namespaces that are not torn down during kfctl delete
171
211
namespaces="kubeflow"
172
212
173
-
# Delete other NS that were installed. These might be part of other apps and is slightly dangerous
# # delete all namespaces, including namespaces that "should" already have been deleted by kfctl delete
221
+
# echo "Re-deleting namespaces ${namespaces} for a full cleanup"
222
+
# kubectl delete ns ${namespaces}
223
+
# # These should probably be deleted by kfctl, but they are not
224
+
# kubectl delete crd -l app.kubernetes.io/part-of=kubeflow -o name
225
+
# kubectl delete all -l app.kubernetes.io/part-of=kubeflow --all-namespaces
226
+
#fi
184
227
185
228
# There is an issues in the kfctl delete command that does not properly clean up and leaves NSs in a terminating state, this is a bit hacky but resolves it
186
229
if [ "${KUBEFLOW_EXTRA_FULL_DELETE}"=="true" ];then
@@ -189,17 +232,17 @@ function tear_down() {
189
232
fix_terminating_ns ${namespaces}
190
233
fi
191
234
192
-
if [ "${KUBEFLOW_FULL_DELETE}"=="true" ];then
193
-
# These should probably be deleted by kfctl, but they are not
194
-
kubectl delete crd -l app.kubernetes.io/part-of=kubeflow -o name
195
-
kubectl delete all -l app.kubernetes.io/part-of=kubeflow --all-namespaces
0 commit comments