Restore MAS Applications¤
Overview¤
This role supports restoring MAS application resources and data from backups created by the suite_app_backup role. Currently supported applications:
manage: Restores Manage namespace resources (CRs, secrets, subscriptions) and persistent volume data
Future support planned for: iot, monitor, health, optimizer, visualinspection
The restore process follows these phases: 1. Phase 1: Restore Kubernetes resources like Project, Secrets, Configmaps, Subscription, Certificates and ManageApp CR (not ManageWorkspace CR) 2. Phase 2: Check if ManageWorkspace CR is already available 3. Phase 3: Restore persistent volume data - If ManageWorkspace exists: Scale it down, restore data, then continue - If ManageWorkspace doesn't exist: Create PVCs, create dummy pod, restore data, delete dummy pod 4. Phase 4: Restore ManageWorkspace CR 5. Phase 5: Wait for Manage deployment to be activated
Important
- An application backup can only be restored to an instance with the same MAS instance ID
- This role restores application resources and PV data only. Database restores must be performed separately using the appropriate database restore role
- For Manage, see the db2 role for database restore
- The restore process is designed to be idempotent and can handle both fresh installations and existing deployments
Role Variables - General¤
mas_app_id¤
Defines the MAS application ID for the restore action.
- Required
- Environment Variable:
MAS_APP_ID - Default: None
- Valid Values:
manage(currently supported)
mas_instance_id¤
Defines the MAS instance ID for the restore action. Must match the instance ID from the backup.
- Required
- Environment Variable:
MAS_INSTANCE_ID - Default: None
mas_workspace_id¤
Defines the MAS workspace ID for the restore action. Must match the workspace ID from the backup.
- Required
- Environment Variable:
MAS_WORKSPACE_ID - Default: None
mas_backup_dir¤
Defines the directory where backups are stored. The role will look for the backup version subdirectory within this location.
- Required
- Environment Variable:
MAS_BACKUP_DIR - Default: None
- Example:
/backup/mas
mas_app_backup_version¤
Specifies which backup version to restore. This should match the version identifier used during backup.
- Required
- Environment Variable:
MAS_APP_BACKUP_VERSION - Default: None
- Example:
20240315-143022orv1.0-prod
mas_app_restore_wait_retries¤
Maximum time in seconds to wait for ManageWorkspace to become ready after restore.
- Optional
- Environment Variable:
MAS_APP_RESTORE_WAIT_RETRIES - Default:
120
mas_app_restore_wait_delay¤
Delay in seconds between status checks when waiting for ManageWorkspace to become ready.
- Optional
- Environment Variable:
MAS_APP_RESTORE_WAIT_DELAY - Default:
360
helper_pod_image¤
Name of the image to use by the PVC-restore-helper pod. This pod will be deployed on the app's namespace to mount PVC volumes and extract the pvc backup to the mounted path.
Image must have tar installed for oc cp to work.
- Optional
- Environment Variable:
HELPER_POD_IMAGE - Default:
registry.redhat.io/ubi9/ubi:latest
override_storageclass¤
Enable or disable storage class override during restore. When enabled, the restore process will use custom storage classes instead of the storage classes from the backup.
- Optional
- Environment Variable:
OVERRIDE_STORAGECLASS - Default:
false - Valid Values:
true,false
mas_app_custom_storage_class_rwx¤
Custom storage class to use for PVCs with ReadWriteMany (RWX) access mode when override_storageclass is enabled. If not provided and override is enabled, the default storage class will be used.
- Optional
- Environment Variable:
MAS_APP_CUSTOM_STORAGE_CLASS_RWX - Default: Empty (uses default storage class)
- Example:
ocs-storagecluster-cephfs
mas_app_custom_storage_class_rwo¤
Custom storage class to use for PVCs with ReadWriteOnce (RWO) access mode when override_storageclass is enabled. If not provided and override is enabled, the default storage class will be used.
- Optional
- Environment Variable:
MAS_APP_CUSTOM_STORAGE_CLASS_RWO - Default: Empty (uses default storage class)
- Example:
ocs-storagecluster-ceph-rbd
What Gets Restored¤
Manage Application¤
When restoring the Manage application, the following resources are restored:
Namespace Resources (Phase 1 & 4):
- Project (namespace)
- Encryption secrets
- Certificates with mas.ibm.com/instanceId label
- IBM entitlement secret
- All referenced secrets
- Subscription and OperatorGroup
- ManageApp CR (Phase 1)
- ManageWorkspace CR (Phase 4)
Persistent Volume Data (Phase 3):
- All persistent volumes defined in spec.settings.deployment.persistentVolumes
- Data is restored from compressed tar.gz archives
- Each PVC's mount path is restored separately
- Archives are read from the data subdirectory
NOT Restored (must be restored separately): - Manage database (Db2) - use the db2 role - Suite-level resources - use the suite_restore role
How Persistent Volume Restore Works¤
The role intelligently handles PV restoration based on whether ManageWorkspace CR is already deployed:
Scenario 1: ManageWorkspace CR Does Not Exist (Fresh Restore)¤
- Read Configuration: Extracts PVC configuration from ManageWorkspace CR backup
- Create PVCs: Creates all PVCs defined in the backup
- Create Dummy Pod: Creates a temporary pod that mounts all PVCs
- Restore Data: Copies tar.gz archives to the pod and extracts them to mount paths
- Cleanup: Deletes the dummy pod
- Deploy CR: Restores the ManageWorkspace CR
- Wait: Waits for Manage deployment to be activated
Scenario 2: ManageWorkspace CR Already Exists (Re-restore/Update)¤
- Scale Down: Sets ManageWorkspace
spec.settings.deployment.modetodown - Wait: Waits for workspace to scale down
- Find Pod: Locates UI, ALL, or maxinst pod for data access
- Restore Data: Copies tar.gz archives to the pod and extracts them to mount paths
- Update CR: Restores the ManageWorkspace CR (which will scale back up)
- Wait: Waits for Manage deployment to be activated
Dummy Pod Specification¤
When a dummy pod is created for restoration, it uses:
- Image: registry.redhat.io/ubi8/ubi-minimal:latest
- Command: sleep infinity (keeps pod running)
- Volumes: All PVCs from ManageWorkspace CR configuration
- Labels: Tagged with instance ID and workspace ID for easy identification
Restore Process Phases¤
Phase 1: Restore Resources Until ManageApp CR¤
- Restores all namespace resources except ManageWorkspace CR
- Includes: ManageApp, secrets, certificates, subscriptions, operator groups
- Waits for ManageApp CR to become ready
- Auto-discovers and restores referenced secrets
Phase 2: Check ManageWorkspace Status¤
- Checks if ManageWorkspace CR already exists
- If exists: Sets
spec.settings.deployment.modetodownand waits for scale down - If not exists: Proceeds to create PVCs and dummy pod
Phase 3: Restore Persistent Volume Data¤
- Reads PVC configuration from ManageWorkspace CR backup
- Creates PVCs if needed (when ManageWorkspace doesn't exist)
- Creates dummy pod or uses existing server bundle pod
- Restores data from tar.gz archives to each PVC
- Cleans up dummy pod if created
Phase 4: Restore ManageWorkspace CR¤
- Restores the ManageWorkspace CR
- This triggers the Manage deployment to start/restart
Phase 5: Wait for Deployment Activation¤
- Monitors ManageWorkspace CR status
- Waits for Ready condition to be True
- Configurable timeout and delay between checks
Expected Backup Directory Structure¤
The role expects the backup directory to have the following structure:
<mas_backup_dir>/
└── backup-<version>-app-manage/
├── resources/
│ ├── projects
│ │ └── mas-<instance_id>-manage.yaml
| ├── secrets
│ │ └── <secret1_name>.yaml
│ │ └── <secret2_name>.yaml
| ├── configmaps
│ │ └── <configmap1_name>.yaml
│ │ └── <configmap2_name>.yaml
| ├── subscriptions
│ │ └── <subscription_name>.yaml
│ └── ... (other resources)
└── data/
├── <pvc-name-1>.tar.gz
├── <pvc-name-2>.tar.gz
Example Playbooks¤
Basic Restore¤
Restore Manage namespace resources and persistent volumes from a backup:
- hosts: localhost
any_errors_fatal: true
vars:
mas_instance_id: inst1
mas_workspace_id: ws1
mas_app_id: manage
mas_backup_dir: /backup/mas
mas_app_backup_version: "20240315-143022"
roles:
- ibm.mas_devops.suite_app_restore
Restore with Custom Timeout¤
Restore with a custom wait timeout for large deployments:
- hosts: localhost
any_errors_fatal: true
vars:
mas_instance_id: inst1
mas_workspace_id: ws1
mas_app_id: manage
mas_backup_dir: /backup/mas
mas_app_backup_version: "prod-backup-20240315"
mas_app_restore_wait_retries: 180
mas_app_restore_wait_delay: 360 # Check 6 minutes
roles:
- ibm.mas_devops.suite_app_restore
Complete Restore Workflow¤
Complete workflow including database restore:
- hosts: localhost
any_errors_fatal: true
vars:
mas_instance_id: inst1
mas_workspace_id: ws1
mas_backup_dir: /backup/mas
backup_version: "20240315-143022"
tasks:
# 1. Restore Db2 database first
- name: "Restore Manage database"
include_role:
name: ibm.mas_devops.db2
vars:
db2_action: restore
db2_backup_version: "{{ backup_version }}"
# 2. Restore Manage application
- name: "Restore Manage application"
include_role:
name: ibm.mas_devops.suite_app_restore
vars:
mas_app_id: manage
mas_app_backup_version: "{{ backup_version }}"
Restore with Storage Class Override¤
Restore to a different cluster with different storage classes:
- hosts: localhost
any_errors_fatal: true
vars:
mas_instance_id: inst1
mas_workspace_id: ws1
mas_app_id: manage
mas_backup_dir: /backup/mas
mas_app_backup_version: "20240315-143022"
# Enable storage class override
override_storageclass: true
# Specify custom storage classes
mas_app_custom_storage_class_rwo: "ocs-storagecluster-cephfs"
mas_app_custom_storage_class_rwx: "ocs-storagecluster-ceph-rbd"
roles:
- ibm.mas_devops.suite_app_restore
Restore with Storage Class Override Using Default Classes¤
Restore with override enabled but using cluster's default storage classes:
- hosts: localhost
any_errors_fatal: true
vars:
mas_instance_id: inst1
mas_workspace_id: ws1
mas_app_id: manage
mas_backup_dir: /backup/mas
mas_app_backup_version: "20240315-143022"
# Enable storage class override without specifying custom classes
# Will automatically use the cluster's default storage class
override_storageclass: true
roles:
- ibm.mas_devops.suite_app_restore
Troubleshooting¤
Restore Fails at Phase 1¤
Issue: Resources fail to restore in Phase 1
Solution:
- Check that the backup directory exists and contains the expected files
- Verify namespace exists: oc get namespace mas-<instance_id>-manage
- Check for conflicting resources: oc get manageapp,subscription -n mas-<instance_id>-manage
Dummy Pod Fails to Start¤
Issue: Dummy pod remains in Pending state
Solution:
- Check PVC status: oc get pvc -n mas-<instance_id>-manage
- Verify storage class exists and can provision volumes
- Check pod events: oc describe pod <pod-name> -n mas-<instance_id>-manage
PV Data Restore Fails¤
Issue: Data extraction fails in the pod Solution: - Verify tar.gz archives are not corrupted - Check pod has sufficient disk space - Verify mount paths are accessible in the pod
ManageWorkspace Never Becomes Ready¤
Issue: Phase 5 times out waiting for ManageWorkspace
Solution:
- Check ManageWorkspace status: oc describe manageworkspace -n mas-<instance_id>-manage
- Verify database is accessible and restored
- Check pod logs: oc logs -l mas.ibm.com/appType=serverBundle -n mas-<instance_id>-manage
- Increase mas_app_restore_wait_retries and mas_app_restore_wait_delay values` if deployment is slow
Notes¤
- Database Restore: This role does NOT restore the Manage database. Use the db2 role to restore Db2 databases separately, and do this BEFORE running the application restore
- Suite Resources: This role restores application-specific resources only. For suite-level resources (Suite CR, workspace CRs, etc.), use the suite_restore role
- Instance ID Match: The restore must be performed on a cluster with the same MAS instance ID as the backup
- Idempotent: The restore process is idempotent and can be run multiple times
- Dummy Pod: When ManageWorkspace doesn't exist, a temporary pod is created for data restoration and automatically cleaned up
- Scale Down: When ManageWorkspace exists, it's automatically scaled down before data restoration
- Automatic Detection: Persistent volumes are automatically detected from the ManageWorkspace CR backup
- Compression: All PV data is stored as compressed tar.gz archives
- Wait Time: The default wait timeout is 1 hour, but this can be adjusted based on deployment size
- Prerequisites: Ensure the MAS operator and Manage operator are installed before running restore
- Storage Class Override: When restoring to a different cluster with different storage classes, enable
override_storageclassto automatically map PVCs to appropriate storage classes based on access modes (RWX/RWO) - Default Storage Classes: If
override_storageclassis enabled but custom storage classes are not specified, the cluster's default storage class will be used automatically - Access Mode Mapping: The role intelligently assigns storage classes based on PVC access modes - RWX (ReadWriteMany) uses
mas_app_custom_storage_class_rwxand RWO (ReadWriteOnce) usesmas_app_custom_storage_class_rwo