Trove today has very limited monitoring through agent heartbeats. Going forward, this will be somewhat insufficient if we want to achieve goals like active slave promotion, failover and agent remediation.
Specifically we should discuss improving monitoring along these fronts:
- Better Agent Monitoring, and remediation: What should we do about "Lost Agents"
- Upgrade Monitoring: How do we ensure that all agents have been upgraded to a "baseline" version correctly. How do we deal with agents that haven't?
- Connectivity monitoring: The datastore agent might be up and running, but how do we monitor instances to ensure that a customer is actually able to connect to it?
- Replication Monitoring: How do we monitor the master / slave, and achieve quick and active failover from them master to the slave in case the master goes down? How do we provision a new slave to replace the old one?
- Self healing clusters: How should we monitor cluster nodes and what is our remediation strategy in case a cluster node goes down?