This session will include the following subject(s):
scaling & robustness for heat:
Lets make heat more robust and scalable when dealing with real world clouds.
TripleO has spent a year working with Heat and learning about common failure modes and glitches that make production use by non-experts hard (at best) and impossible at worst.
The issues we've encountered: - scaling of single large stacks (e.g. I have a 10K node cluster, why is that constrained to run in a single heat engine) - dealing with the real world: backend APIs can and do fail - in myriad ways - manual intervention to fix these is pointless - given a desired cluster definition, it is heats job to keep pushing to converge on that state - fast, graceful failover [e.g. look more like something like galera to clients] of failed heat engines - a failed heat engine is a fact of life in production environments (e.g. due to deployments) and having that cause user visible issues is a significant confidence issue). - stacks mid update cannot have their templates/parameters updated until it completes. - heat doesn't notice that resources have failed or stopped behaving correctly
Clint and I will give a quick walk through a possible underlying architecture to address scale and robustness from the ground up, and then the rest of the session can be a mix of poking holes in that approach / coming up with alternative designs.