This event has ended. View the official site or create your own event → Check it out
This event has ended. Create your own
View analytic
Wednesday, May 14 • 11:00am - 11:40am
Heat Scaling, Robustness and Convergence

Sign up or log in to save this to your schedule and see who's attending!

This session will include the following subject(s):

scaling & robustness for heat:

Lets make heat more robust and scalable when dealing with real world clouds.

TripleO has spent a year working with Heat and learning about common failure modes and glitches that make production use by non-experts hard (at best) and impossible at worst.

The issues we've encountered:
- scaling of single large stacks (e.g. I have a 10K node cluster, why is that constrained to run in a single heat engine)
- dealing with the real world: backend APIs can and do fail - in myriad ways - manual intervention to fix these is pointless - given a desired cluster definition, it is heats job to keep pushing to converge on that state
- fast, graceful failover [e.g. look more like something like galera to clients] of failed heat engines - a failed heat engine is a fact of life in production environments (e.g. due to deployments) and having that cause user visible issues is a significant confidence issue).
- stacks mid update cannot have their templates/parameters updated until it completes.
- heat doesn't notice that resources have failed or stopped behaving correctly

Clint and I will give a quick walk through a possible underlying architecture to address scale and robustness from the ground up, and then the rest of the session can be a mix of poking holes in that approach / coming up with alternative designs.

(Session proposed by Robert Collins)

Wednesday May 14, 2014 11:00am - 11:40am

Attendees (98)