README

This repo can be used as a starter kit to setup a fully git integrated Machine Learning Operations enviroment using Cloud Pak for Data and (in the future) watsonx. It uses a simple "credit score prediction" usecase that is split up into 4 jupyter notebooks as an example, which can easily be adapted to your business problem.

It tries to be as simple as possible and showing the basic concepts of MLOps using IBM tools. The intended use it that after you have set everyhting up and familiarized yourself with the concepts you throw out all the "credit score prediction" code and replace it with whatever problem you are trying to solve.

high level overview using three stages

Setup Instructions

These instructions will guide you through the setup of a simple MLOps environment that uses just two stages ("dev" and "prod"). The setup can be easily extended to more stages if needed.

It is assumed that you have a "Cloud Pak for Data" instance available and that you have admin rights to it (This will not work with the cloud based "as a Service" Offering).

detailed view using two stages

1. Fork this repo

need a detailed description?

click the "Fork" button in the upper right corner of this repo. IMPORTANT: uncheck the "only fork the master branch" checkbox. This will create a copy of this repo in your own github account. We will be using this copy in the following steps.

2. Create one git-enabled project called "00-datascience-playground"

need a detailed description?

Overview

this is the project that we are creating in this step

Step by step

navigate to all projects create a project that is "integrated with git". In the next window we will need to provide the github repo address and a private access token. So lets create that token first. navigate to https://github.com/settings/tokens and choose "Generate new token". Give it a name and select the "repo" scope as shown in the next image. Copy the generated token to your clipboard. You will not be able to see it again after you close the window. Make this token available within your CP4D by creating a "New Token" and using the token you just created. Once you created it use the dropdown to select it. add the Repo URL (dont forget the .git at the end ;-) and choose the main branch. Then hit "Create"

Use the github repo address and your private access token You can Alter the notebooks to your needs if you want to. It is important that you keep the naming of the notebooks.

3. Create one git-enabled project called "01-staging-area"

need a detailed description?

Overview

this is the project that we are creating in this step

Step by step

navigate to all projects In your CP4D Instance you access the project overview by clicking on the "Projects" Icon in the upper left corner. Then click on "New Project" and select "Create a project integrated with a Git repository". Give it the name "01-staging-area" and select "create"

Use the same github repo address and your private access token as in 2

4. Configure custom enviroment in "01-staging-area"

need a detailed description?

TODO: Add description here! (use custom_env.yaml)

5. Configure Jobs in "01-staging-area"

need a detailed description?

Overview

this is the project that we are creating in this step

Step by step

navigate to "view local branch"

click "New code job"

choose the first notebook "00-git-pull.ipynb" and click "configure job"

give it the same name as the notebook and click "next" TODO: choose correct enviroment for every job accept all the defaults and click "next" until you can click "create job" TODO: add the "was_successful" output to every job repeat those steps for all six notebooks.

once you are done it should look like this.

We also need to create a .env file within the "01-staging-area" project. This file will contain the credentials that the pipeline will use to pull the code from github.

Click "Launch IDE" and then "JupyterLab" to get access to the JupyterLab environment.

You will be greeted by a tab called "Terminal 1". There you copy the following commands and hit enter:

echo "repo_adresse=PUT_YOUR_REPO_ADDRESS_HERE" > .env
echo "personal_access_token=PUT_YOUR_TOKEN_HERE" >> .env
echo "project_id=PUT_YOUR_PROJECT_ID_HERE" >> .env
echo "branch_name=main" >> .env
echo "cpd_technical_user=PUT_USERNAME_HERE" >> .env
echo "cpd_technical_user_password=PUT_PASSWORD_HERE">> .env
echo "cpd_url=PUT_URL_HERE">> .env

cpd_technical_user is a user that was created only to be used as a proxy in those scripts. If this is not available you can also use a personal user (i.e. the credentials you use to login) even though this not best practise

You can check if everything worked by typing

cat .env

If that command displays the content of the .env file you are good to go.

5. Create a NON-git-enabled project called "02-automation-area"

need a detailed description?

Overview

this is the project that we are creating in this step

Step by step

repeat the same steps as in 2 and 3 but choose "create an empty project" to create a NON-git-enabled project. Name it "02-automation-area"

6. Configure pipeline in "02-automation-area"

need a detailed description?

Overview

those are the pieces we are creating in this step

Step by step

TODO: add global parameters

Click "New Asset" and choose "Pipeline". Name the pipeline "mlops_pipeline"

go to "Run">"Run Notebook Job" and drag it onto the plane. Then doubleclick this newly created node and click "select Job".

choose "01-staging-area" and there the first notebook "00-git-pull.ipynb" and click "choose" and then "save"

TODO: choose enviroment TODO: add pipeline params

repeat those steps for all notebooks until you end up with something that looks like this.

Click "Run Pipeline" and then "create job". Give it a name like "mlops_pipeline_job" . IMPORTANT: The github action assumes that you only have ONE job in this project. If you have more than one job you will need to change the github action accordingly.

7. Setup Github Actions

need a detailed description?

Overview

this is the piece that we are creating in this step

Step by step

We need a set of secrets to be able to run the github actions. Those secrets are:

API_KEY
USER_NAME
CLUSTER_URL
PROJECT_ID
PERSONAL_ACCESS_TOKEN_GITHUB

We will now go through all those step by step:

navigate to your fork of the github repo then "Settings">"Secrets and variables">"actions">"new repository secret"

7.1. retriving your CP4D API_KEY and USER_NAME

go to the "profile and settings" tab in your cp4d instance

copy the api key to your clipboard (and write it down somewhere. You will not be able to see it again after you close the window)

go back to github and creaete a new repository secret called "API_KEY"

Also create the repository secret USER_NAME using the username that you use to login to your CP4D instance

7.2. retriving your CP4D CLUSTER_URL

this one is simple :-)

just take the URL of the cluster that you have been workin on

and use it to create a secret called "CLUSTER_URL"

7.3. retriving your CP4D PROJECT_ID

7.4. retriving your github PERSONAL_ACCESS_TOKEN_GITHUB

You can use the same token you used in step 2. If you dont have it anymore you can create a new one by following the steps in 2.

8. Create deployment space

need a detailed description?

TODO: describe how to create deployment space

9. Setup monitoring using open scale

need a detailed description?

TODO: describe how to set up open scale

10. Try it out :-)

11. Future Work and known issues

need a detailed description?

Future Work:
- Put AI Fact sheets back into the "03-train_model" notebook
- Figure out what is wrong with the deployments and fix it
- Figure out what is wrong with monitoring (probably issue with the cluster we use)
- Finish Documentation of 8. Create deployment space and 9. Setup monitoring using open scale
- Delete all projects and set everything up again acording to documentation to find what is missing(~ one day of work)
- describe how good usermanagement can work (e.g. normal Users can only see the "01_data_science_playground" project)
- integrate Model Inventory/ model versioning
Known Issues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

README

Setup Instructions

1. Fork this repo

2. Create one git-enabled project called "00-datascience-playground"

Overview

Step by step

3. Create one git-enabled project called "01-staging-area"

Overview

Step by step

4. Configure custom enviroment in "01-staging-area"

5. Configure Jobs in "01-staging-area"

Overview

Step by step

5. Create a NON-git-enabled project called "02-automation-area"

Overview

Step by step

6. Configure pipeline in "02-automation-area"

Overview

Step by step

7. Setup Github Actions

Overview

Step by step

7.1. retriving your CP4D API_KEY and USER_NAME

7.2. retriving your CP4D CLUSTER_URL

7.3. retriving your CP4D PROJECT_ID

7.4. retriving your github PERSONAL_ACCESS_TOKEN_GITHUB

8. Create deployment space

9. Setup monitoring using open scale

10. Try it out :-)

11. Future Work and known issues

Known Issues

Files

README.md

Latest commit

History

README.md

File metadata and controls

README

Setup Instructions

1. Fork this repo

2. Create one git-enabled project called "00-datascience-playground"

Overview

Step by step

3. Create one git-enabled project called "01-staging-area"

Overview

Step by step

4. Configure custom enviroment in "01-staging-area"

5. Configure Jobs in "01-staging-area"

Overview

Step by step

5. Create a NON-git-enabled project called "02-automation-area"

Overview

Step by step

6. Configure pipeline in "02-automation-area"

Overview

Step by step

7. Setup Github Actions

Overview

Step by step

7.1. retriving your CP4D API_KEY and USER_NAME

7.2. retriving your CP4D CLUSTER_URL

7.3. retriving your CP4D PROJECT_ID

7.4. retriving your github PERSONAL_ACCESS_TOKEN_GITHUB

8. Create deployment space

9. Setup monitoring using open scale

10. Try it out :-)

11. Future Work and known issues

Known Issues