You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
EWC backups and restore & token renewal cron job (#10)
* install etcdctl, add api6 snapshot script, move common logic to own file
* add apisix backup job, use common namespace for shared secret and use role and role binding to read the values
* vault backup changes, local value for etcd cluster host address
* add vault, and apisix related variables
* use general image ment for all cron jobs
* fix vault service account reference
* revert to namespace specific secrets, common one with roles and bindings not worked as expected
* change backup bucket paths
* change kubernetes host to match cluster DNS search paths
* add postgresql-client
* add keycloak snapshot script
* add keycloak backup cron job to dev-portal module
* pass new vars to dev-portal module
* mv cron-jobs to jobs, add action to build & push jobs image to registry
* name cron_jobs to jobs since restore jobs not cron
* update job scripts
* update restore jobs
* add initial disaster recovery part
* update vault keys need when recreate for data restore
* update
* renaming resources, cleaning up
* parameterize replicaCounts, apisix helm chart etc.
* run fmt
* parameterize keycloak stuff
* use single s3 backup base path var and construct the rest in job def
* add renewal job, change role name to match more general purpose
* parameterize more
* use parameterized apisix and vault values
* parameterize vault helm release name
* fix api6 helm release name
* use paramaterized helm release name
* use release name from param
* use apisix helm release name
* supress default stdout, quite noisy. check init status for each pod
* compress or decompress snapshot files, improve logging and error handling
* update
* remove hard-coded role and reference from terraform resource, separate policies and gather in role
* Pass renewable token to job as an array
* fix token var
* Store secret as array
---------
Co-authored-by: Joona Halonen <joona.halonen@cgi.com>
Copy file name to clipboardExpand all lines: README.md
+146-1Lines changed: 146 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -41,7 +41,9 @@ vault_unseal_keys = <sensitive>
41
41
```
42
42
43
43
> [!IMPORTANT]
44
-
> Make sure to store `vault_root_token``vault_unseal_keys` and `dev-portal_keycloak_secret` somewhere safe.
44
+
> Make sure to store `vault_root_token``vault_unseal_keys` and `dev-portal_keycloak_secret` somewhere safe.
45
+
>
46
+
> If the Vault is recreated for a data restore operation, do not delete the previous `vault_unseal_keys`. Continue using the old unseal keys and ignore the new ones. The new `vault_root_token` is needed. For more details, see the [Vault Restore](#vault-restore) section.
APISIX and the Dev Portal use service tokens to communicate with Vault. These tokens have a maximum TTL of 768 hours (32 days). To prevent token revocation, a cron job is scheduled to run on the 1st and 15th of each month to reset the token period.
86
+
87
+
## Disaster Recovery
88
+
89
+
The disaster recovery plan includes backing up application databases and logical data, and restoring them from snapshot files. The backup and restore processes are performed using database-specific tools like `pg_dump` and `pg_restore`.
90
+
91
+
### Backups
92
+
93
+
Each application's database (Keycloak PostgreSQL, APISIX etcd, Vault raft) has a dedicated Cron job for backups. The backup schedule can be adjusted using Terraform if needed. Currently, backups are saved to an AWS S3 bucket. If there is no need to store files older than a certain number of days, bucket retention policies can be used to manage this.
94
+
95
+
### Restore
96
+
97
+
Each application has dedicated job(s) to restore data from snapshots. These jobs are invoked with independent commands, but the job templates are managed within Terraform.
98
+
99
+
> [!IMPORTANT]
100
+
> Ensure that the Terraform state and the actual cluster state are aligned before running restore jobs to avoid potential issues.
101
+
>
102
+
> You can try to take manual snapshot from desired database(s) before attempting the restore operation(s).
103
+
104
+
#### Keycloak Restore
105
+
106
+
```sh
107
+
export KUBECONFIG="~/.kube/config"# Replace with the path to your kubeconfig file
POD_NAME=$(kubectl get pods -n keycloak -l job-name=$JOB_NAME -o jsonpath='{.items[0].metadata.name}')
115
+
# Optionally, tail the logs
116
+
kubectl logs -f $POD_NAME -n keycloak
117
+
# Optionally, delete the job and its resources after completion
118
+
kubectl delete job $JOB_NAME -n keycloak
119
+
120
+
###########################################
121
+
# Restore
122
+
###########################################
123
+
124
+
export SNAPSHOT_NAME="specific_snapshot.db.gz"# Optionally provide a specific snapshot name if you need to restore a snapshot other than the latest one
# Optionally, delete the job and its resources after completion
133
+
kubectl delete $JOB_NAME -n keycloak
134
+
```
135
+
136
+
#### Vault Restore
137
+
138
+
**Note:** Vault restore requires the UNSEAL_KEYS that were in use when the backup snapshot was taken. The VAULT_TOKEN is the latest root token that was created and used. If the Vault cluster becomes unresponsive or is completely wiped out, the existing cluster might need to be removed and a new one initialized. The new cluster will have new tokens and unseal keys. To access the cluster, the new token is needed, but unsealing the cluster requires the unseal keys used by the data in the snapshot.
139
+
140
+
```sh
141
+
export KUBECONFIG="~/.kube/config"# Replace with the path to your kubeconfig file
POD_NAME=$(kubectl get pods -n apisix -l job-name=$JOB_NAME -o jsonpath='{.items[0].metadata.name}')
187
+
# Optionally, tail the logs
188
+
kubectl logs -f $POD_NAME -n apisix
189
+
# Optionally, delete the job and its resources after completion
190
+
kubectl delete job $JOB_NAME -n apisix
191
+
192
+
###########################################
193
+
# Restore
194
+
###########################################
195
+
196
+
export SNAPSHOT_NAME="specific_snapshot.snap.gz"# Optionally provide a specific snapshot name if you need to restore a snapshot other than the latest one
197
+
198
+
# Create the pre-restore job and capture the job name
0 commit comments