Juno Design Summit has ended
Back To Schedule
Thursday, May 15 • 2:20pm - 3:00pm
Monitoring in Trove: Of DBAdmins and Buses

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Trove today has very limited monitoring through agent heartbeats.
Going forward, this will be somewhat insufficient if we want to achieve goals like active slave promotion, failover and agent remediation.

Specifically we should discuss improving monitoring along these fronts:

- Better Agent Monitoring, and remediation:
What should we do about "Lost Agents"

- Upgrade Monitoring:
How do we ensure that all agents have been upgraded to a "baseline" version correctly.
How do we deal with agents that haven't?

- Connectivity monitoring:
The datastore agent might be up and running, but how do we monitor instances to ensure that a customer is actually able to connect to it?

- Replication Monitoring:
How do we monitor the master / slave, and achieve quick and active failover from them master to the slave in case the master goes down?
How do we provision a new slave to replace the old one?

- Self healing clusters:
How should we monitor cluster nodes and what is our remediation strategy in case a cluster node goes down?

Etherpad at: https://etherpad.openstack.org/p/TroveMonitoring

(Session proposed by Nikhil)

Thursday May 15, 2014 2:20pm - 3:00pm EDT

Attendees (0)