People need to complain all the time.
And it is often those DEMANDING the accountablity of others the most that fail to take accountabilty for their own role in
This was circa 2015.
Massively parallel Windows in virtual machines running java applications with a Linux backend running parallel Oracle RAC systems. At the network level, there were dual firewalls in a failover configuration as well as dual F5 load balancers also in a failover configuration.
Besides fault tolerant software, you need a fault tolerant infrastructure. Plus, an extensive QA process. Before a single line of code went live, it has gone through four levels of testing, both by manual test plans and automated regression testing.
It can be done, but you have to commit to the software and hardware infrastructure and a rigorous approach to testing
Expensive up front, but the payback is short. You get what you pay for.
First, you don't clarify your load/transactions levels which has major impact.
Second, you don't clarify if you were in an on-prem version of the RAC (which in 2015 is likely), a private/closed cloud or a standard cloud deployment.
Third, there is no mention of you having to run multiple API's (again, in 2015 was far more rare) vs full blown integrations with other systems that took years to roll out and greatly limited your reach.
In any case - It is possible you weren't on a SaaS platform and using API configurations to pull together multiple systems across multiple companies as is standard protocol for a large global organization and in systems like those being used in 2020 vs 2015.
In this specific case - we are isolating this down further and upset for a partner where the failure was likely an API issue combining the core solution with an API feed from the local POS at the San Angel Inn (privately owned and not disney)
Specific to the Oracle RAC system the uptime is often based on a maximum tolerable length of a single unplanned outage . "If the event lasts less than 30 seconds, then it may cause very little impact and may be barely perceptible to users. As the length of the outage grows, the effect may grow exponentially and negatively affect the business."
Basically it is 100% at over 30 seconds. For 99.999% of companies that is fine. As it likely was for yours.
That isn't a knock it is just a fact. And assuming that a company that ever faces any issues with a upgrades or product enhancements release is being cheap and effectively not doing their testing is something that would make sense when running a smaller bredth of a product. A SaaS solution today - like and ERP or even an HCM solution may have an infinate number of transactions that could be a result of a single change and there are protocols (even the strictest) of how that software testing gets done. Via tech or human, there may always be a one of situation where it it can happen
But again, the issue here is behavior. And entitlement that perfection is the only baseline and no one has a room to fix and error unless it is EXACTLY what someone wants NOW which in any technology isnt' always possible.