This repo can be used as a starter kit to setup a fully git integrated Machine Learning Operations enviroment using Cloud Pak for Data and (in the future) watsonx. It uses a simple "credit score prediction" usecase that is split up into 4 jupyter notebooks as an example, which can easily be adapted to your business problem.
It tries to be as simple as possible and showing the basic concepts of MLOps using IBM tools. The intended use it that after you have set everyhting up and familiarized yourself with the concepts you throw out all the "credit score prediction" code and replace it with whatever problem you are trying to solve.
high level overview using three stages
These instructions will guide you through the setup of a simple MLOps environment that uses just two stages ("dev" and "prod"). The setup can be easily extended to more stages if needed.
It is assumed that you have a "Cloud Pak for Data" instance available and that you have admin rights to it (This will not work with the cloud based "as a Service" Offering).
detailed view using two stages
need a detailed description?
click the "Fork" button in the upper right corner of this repo. IMPORTANT: uncheck the "only fork the master branch" checkbox. This will create a copy of this repo in your own github account. We will be using this copy in the following steps.
need a detailed description?
this is the project that we are creating in this step
navigate to all projects
create a project that is "integrated with git". In the next window we will need to provide the github repo address and a private access token. So lets create that token first.
navigate to https://github.com/settings/tokens and choose "Generate new token". Give it a name and select the "repo" scope as shown in the next image.
Copy the generated token to your clipboard. You will not be able to see it again after you close the window.
Make this token available within your CP4D by creating a "New Token" and using the token you just created. Once you created it use the dropdown to select it.
add the Repo URL (dont forget the .git at the end ;-) and choose the main branch. Then hit "Create"
Use the github repo address and your private access token You can Alter the notebooks to your needs if you want to. It is important that you keep the naming of the notebooks.
need a detailed description?
this is the project that we are creating in this step
navigate to all projects
In your CP4D Instance you access the project overview by clicking on the "Projects" Icon in the upper left corner. Then click on "New Project" and select "Create a project integrated with a Git repository". Give it the name "01-staging-area" and select "create"
Use the same github repo address and your private access token as in 2
need a detailed description?
TODO: Add description here! (use custom_env.yaml)
need a detailed description?
this is the project that we are creating in this step
navigate to "view local branch"
choose the first notebook "00-git-pull.ipynb" and click "configure job"
give it the same name as the notebook and click "next"
TODO: choose correct enviroment for every job
accept all the defaults and click "next" until you can click "create job"
TODO: add the "was_successful" output to every job
repeat those steps for all six notebooks.
once you are done it should look like this.
We also need to create a .env file within the "01-staging-area" project. This file will contain the credentials that the pipeline will use to pull the code from github.
Click "Launch IDE" and then "JupyterLab" to get access to the JupyterLab environment.
You will be greeted by a tab called "Terminal 1". There you copy the following commands and hit enter:
echo "repo_adresse=PUT_YOUR_REPO_ADDRESS_HERE" > .env
echo "personal_access_token=PUT_YOUR_TOKEN_HERE" >> .env
echo "project_id=PUT_YOUR_PROJECT_ID_HERE" >> .env
echo "branch_name=main" >> .env
echo "cpd_technical_user=PUT_USERNAME_HERE" >> .env
echo "cpd_technical_user_password=PUT_PASSWORD_HERE">> .env
echo "cpd_url=PUT_URL_HERE">> .env
cpd_technical_user is a user that was created only to be used as a proxy in those scripts. If this is not available you can also use a personal user (i.e. the credentials you use to login) even though this not best practise
You can check if everything worked by typing
cat .env
If that command displays the content of the .env file you are good to go.
need a detailed description?
this is the project that we are creating in this step
repeat the same steps as in 2 and 3 but choose "create an empty project" to create a NON-git-enabled project. Name it "02-automation-area"
need a detailed description?
those are the pieces we are creating in this step
TODO: add global parameters
Click "New Asset" and choose "Pipeline". Name the pipeline "mlops_pipeline"
go to "Run">"Run Notebook Job" and drag it onto the plane. Then doubleclick this newly created node and click "select Job".
choose "01-staging-area" and there the first notebook "00-git-pull.ipynb" and click "choose" and then "save"
TODO: choose enviroment TODO: add pipeline params
repeat those steps for all notebooks until you end up with something that looks like this.
Click "Run Pipeline" and then "create job". Give it a name like "mlops_pipeline_job" . IMPORTANT: The github action assumes that you only have ONE job in this project. If you have more than one job you will need to change the github action accordingly.
need a detailed description?
this is the piece that we are creating in this step
We need a set of secrets to be able to run the github actions. Those secrets are:
- API_KEY
- USER_NAME
- CLUSTER_URL
- PROJECT_ID
- PERSONAL_ACCESS_TOKEN_GITHUB
We will now go through all those step by step:
navigate to your fork of the github repo then "Settings">"Secrets and variables">"actions">"new repository secret"
go to the "profile and settings" tab in your cp4d instance
copy the api key to your clipboard (and write it down somewhere. You will not be able to see it again after you close the window)
go back to github and creaete a new repository secret called "API_KEY"
Also create the repository secret USER_NAME using the username that you use to login to your CP4D instance
just take the URL of the cluster that you have been workin on
and use it to create a secret called "CLUSTER_URL"
You can use the same token you used in step 2. If you dont have it anymore you can create a new one by following the steps in 2.
need a detailed description?
TODO: describe how to create deployment space
need a detailed description?
TODO: describe how to set up open scale
need a detailed description?
-
Future Work:
- Put AI Fact sheets back into the "03-train_model" notebook
- Figure out what is wrong with the deployments and fix it
- Figure out what is wrong with monitoring (probably issue with the cluster we use)
- Finish Documentation of 8. Create deployment space and 9. Setup monitoring using open scale
- Delete all projects and set everything up again acording to documentation to find what is missing(~ one day of work)
- describe how good usermanagement can work (e.g. normal Users can only see the "01_data_science_playground" project)
- integrate Model Inventory/ model versioning