HMPPS Integration API Runbook

This is a runbook to document how this service is supported, as described in: MOJ Runbooks

Last review date: 15/12/2023

About the service

This service is composed of a set of long-lived API interfaces to share person related MOJ data with external consumers. Prison (Nomis, DPS) and probation (Delius) data is combined from upstream APIs into one cohesive response while masking the source.

Tech stack

Containerised Kotlin Spring Boot application running on Cloud Platform’s Kubernetes cluster (eu-west-2).

AWS API Gateway sits in front of this service with mutual TLS authentication. It does not persist any data and is purely a Facade API.

Service URLs

Development: https://dev.integration-api.hmpps.service.justice.gov.uk
Pre-Production: https://preprod.integration-api.hmpps.service.justice.gov.uk
Production: https://integration-api.hmpps.service.justice.gov.uk

Incident response hours

Office hours, usually 9am-6pm on working days.

Incident contact details

hmpps-integration-api@digital.justice.gov.uk

Service team contact

The service team can be reached on MOJ Slack: #ask-hmpps-integration-api

Other URLs

Application source code

https://github.com/ministryofjustice/hmpps-integration-api

Documentation

Source: https://github.com/ministryofjustice/hmpps-integration-api-docs

OpenAPI Specification: https://ministryofjustice.github.io/hmpps-integration-api-docs/documentation/api/index.html

Cloud platform infrastructure as code

Development
Pre-production
Production

Monitoring and alerting dashboards

Development
Pre-production
Production

Expected speed and frequency of releases

Trunk based development and continuous integration is practiced on this service. If changes pass all automated tests, they are deployed all the way to production. There is no change request process and the delivery team pushes to production regularly (multiple times a day on average).

Automatic alerts

There are a number of automatic alerts set up to be delivered into Slack #hmpps-integration-api-alerts

These include:

Documentation up for review
Security scan results (Trivy, OWASP, Vera)
Application exceptions from Sentry
Failed CircleCI automated tests, image builds, deployments and system heartbeat

Impact of an outage

Since we have a variety of consumers, the impact will be different for each of them. In all cases it would prevent civil servants from doing their work and the impact would be quite significant.

Restrictions on access

Consumers need to be onboarded and go through a mutual TLS authentication. They also need to send a pre-shared key (AWS API Gateway API Key) as a header for identification before being allowed to access the service.

Once authenticated, there is an authorisation step at the application level to ensure the consumers are allowed to access the requested resources. There are no IP restrictions in place.

How to resolve specific issues

Errors reported to Slack from our monitoring dashboards

There are a number of errors that can be raised on Slack, too many to capture in this document. Below are some of the more common errors that could be raised, along with some basic guidance to troubleshoot.

API Gateway errors

To see the error in the API Gateway logs, which will contain more details, log into AWS and follow the link to the API Gateway CloudWatch logs which is stored in Kubernetes secrets under aws-services.

API GW 4xx Error

If it's a 403 Unauthorised error, it indicates that the request is unauthenticated to access the requested resource.

Establish who is trying to access the service, then check:

Mutual TLS authentication, ensure certificates are valid.
Check that the API key is correct by checking against the value for the consumer in Kubernetes secrets under consumer-api-keys.
Check that the client certificate contains the correct Common Name (CN), and that it has been added to the Spring Boot application properties, listing allowed paths.

If it's a 404 error, check what path was requested and why the API was unable to serve the request. Could be a miss-typed URL on the client side.

API GW 5xx Errors

Establish what sort of 5xx error has occurred by checking the logs. For general debugging of 500 errors, please see troubleshoot 5xx errors for API Gateway

API GW Client Error, API GW ExecutionError, API GW integrationError

When any of these errors occur, and no obvious recent changes have been made that could cause this, it is recommended to re-deploy the API Gateway and application. Get in touch with Cloud Platform as they will have to approve any pull requests to run the infrastructure pipeline.

It could also be an issue on the AWS side. The AWS service status page should be checked as well.

API GW integrationLatency and latency

This error may not indicate downtime of the service, but should be closely monitored as it will become disruptive. If no recent changes have been made then re-deploy the application. It could be an issue on the AWS side. The AWS service status page should be checked as well.

Latency

This error may not indicate downtime of the service, but should be closely monitored as it will become disruptive. If no recent obvious changes have been made that could cause this, then re-deploy the application. This API integrates with a number of upstream APIs and the latency could be coming from there.

Blocked request

If a large number of blocked requests are observed, it may indicate that someone is trying to gain unauthorised entry.
The service is only consumed in the UK and if any IPs are known to be from different countries, it should be investigated.

CPU > 80%

When the CPU is under strain, it may indicate a number of issues. It may be that there is a high volume of legitimate requests being served, or it can indicate a bug in the service. First trigger a deployment through the build pipeline to rebuild the application. If this doesn't solve the problem, consider adding more pods to ease the load while the problem can be looked at.

memory > 80%

When the memory is under strain, it may indicate a number of issues. It may be that there is a high volume of legitimate requests being served, or it can indicate a bug / memory leak in the service. First trigger a deployment through the build pipeline to rebuild the application. If this doesn't solve the problem, consider adding more pods to ease the load while the problem can be looked at.

less than 2 pods running

At any given time (for any environment), there should be at least 2 pods running for high availability. The service also has a zero downtime deployment strategy in place, which means that old pods are gracefully terminated before traffic is redirected to new pods. Check whether any recent obvious updates have been made that could have caused this regression. Roll back or forward with a fix if necessary and trigger another deployment through the build pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runbook.md

runbook.md

HMPPS Integration API Runbook

About the service

Tech stack

Service URLs

Incident response hours

Incident contact details

Service team contact

Other URLs

Application source code

Documentation

Cloud platform infrastructure as code

Monitoring and alerting dashboards

Expected speed and frequency of releases

Automatic alerts

Impact of an outage

Restrictions on access

How to resolve specific issues

Errors reported to Slack from our monitoring dashboards

API Gateway errors

API GW 4xx Error

API GW 5xx Errors

API GW Client Error, API GW ExecutionError, API GW integrationError

API GW integrationLatency and latency

Latency

Blocked request

CPU > 80%

memory > 80%

less than 2 pods running

Files

runbook.md

Latest commit

History

runbook.md

File metadata and controls

HMPPS Integration API Runbook

About the service

Tech stack

Service URLs

Incident response hours

Incident contact details

Service team contact

Other URLs

Application source code

Documentation

Cloud platform infrastructure as code

Monitoring and alerting dashboards

Expected speed and frequency of releases

Automatic alerts

Impact of an outage

Restrictions on access

How to resolve specific issues

Errors reported to Slack from our monitoring dashboards

API Gateway errors

API GW 4xx Error

API GW 5xx Errors

API GW Client Error, API GW ExecutionError, API GW integrationError

API GW integrationLatency and latency

Latency

Blocked request

CPU > 80%

memory > 80%

less than 2 pods running