Today's incident was caused by a combination of factors, as follows:
1. An upgrade to our API backend service has failed due to a large DB migration. That caused instances to no longer be recognized via the API. This happened despite our automatic Unit Tests and local testing procedures that are supposed to prevent this from happening.
2. While this deployment error happened, the old DB tables for the API were becoming unavailable. Because of that, the API was returning the default data for the docker image that was supposed to run on the GKE cluster for all sites.
3. The API returning an empty docker image caused a full network deployment for all sites on all clusters over GKE, causing wp-admin areas to restart with the "new" image.
4. The default image from wordpress-operator (https://github.com/presslabs/wordpress-operator
) was used instead because none was specified by the API, which in turn was not working properly on the latest version of our infrastructure, throwing a Not Found error for wp-admin areas.
5. Since our Backend API service controls the entire GKE deployment process, being down didn't allow us to quickly revert the changes. We had to manually disable the deployment service and manually update the images for each site. Hence the delays in the recovery process.
6. At this point, the automatic deployment workflow is turned off for GKE clusters and we're still working on fully repairing the API backend servers. Most of them are now partially working in normal conditions.
Going forward we'll be taking extra steps in our development/deployment process to prevent such issues from occurring again.