Few thoughts since I (and I suspect many others here) have some experience in these kinds of production upgrades.
Typically, these types of upgrades involve a rehearsal of the entire process in a pre-production environment. This is to validate the steps and testing required (upgrade, regression tests, functional tests, performance tests, rollback (if needed)) and also the expected duration for each of steps when running the change in the production environment. If something is failing in production - as seems to be the case now - then there will an allotted period of time to try and solve the issue and stay on the new release. If this fix period expires, then typically a rollback to the old version will be needed (and all the testing to make sure the rollback was successful). In my experience doing this kind of activity, failures in production upgrades often come down to a mis-match between pre-production and production environments which results in failure conditions in the production upgrade which either weren't or couldn't be tested in pre-prod.
My 2 pecs