Turbot Monitoring Strategy Guide

Turbot Monitoring Strategy Guide

Turbot’s Enterprise solution is a single tenant platform that deploy’s into the the customer’s own
VPC. It was designed as a cloud native application to reap the benefits of scale and availability
necessary to serve its role as a security control plane for hundreds of accounts. Where possible
we leverage cloud native platform capabilities to reduce the operational burden and increase
reliability. Considerable effort has gone into making Turbot scale and self-correct through the
use of platform level services, dynamic provisioning, auto-scaling and unhealthy node detection.

Monitoring Objectives

Monitoring of Turbot plays a critical role in ensuring that Turbot is able to perform its role in a customer’s cloud operations. While there are many uses for monitoring, this strategy guide focuses on Incident Detection and Remediation.

Monitoring Data Sources

Overall Turbot health can be determined from the following sources:

  • Turbot Health API
  • Turbot CloudWatch Dashboards
  • Turbot-Database
  • Turbot-Events
  • Turbot-Servers

Each of these sources will be covered below. These sources were intentionally designed to be
simple and easily accessible.

Turbot Health API

Turbot offers a heartbeat check in the form of the Turbot Health API. It can be found at

<turbot-host-name>/api/latest/turbot/health


Additional information about the Health API can be found at

https://poc.turbot.com/help/api#op-GetTurbotHealth


A simple HTTP GET to the Health URL provides a short JSON response like this (formatted for
readability):

{ 
"instanceId":"i-0a2a3705ef11df1cf",
"ec2Metadata":"OK",
"cache":"OK",
"dynamoDB":"OK",
"internet (non-AWS) https://google.com":"OK",
"connectivityCheckRunTime":77,
"workersCheck":{"status":"LOCKED"},
"totalHealthCheckRunTime":78
}


The various fields in this response indicate the health of major Turbot subsystems, specifically
EC2 instances, Elasticache and DynamoDB. If all these systems are operational then a status
of “OK” will appear next to each subsystem and the response will return a status code of 200.
Failing responses will either be an HTTP 400 or HTTP 503.

The numbers for ConnectivityCheckRunTime and TotalHealthCheckRunTime are the time taken
in milliseconds for the checks to complete. These numbers will vary with network conditions.
A note on WorkersCheck: this field doesn’t always appear in the Health API response and
should be considered optional. Its presence or absence does not indicate an error condition.

Turbot CloudWatch Dashboards

While the Turbot Health API is sufficient for most monitoring situations, sometimes more
detailed information is required. Turbot offers the following three dashboards to help customers
diagnose Turbot performance problems. Each dashboard will be covered in turn.

Turbot_CloudWatch_Dashboards.png

CloudWatch can provide timeframes out to 15 months for trending data. Use this historical
information to establish normal operation conditions and to identify any short-term performance
problems. Customers are encouraged to examine these dashboards to get familiar with their
workloads and usage patterns.

Access to these dashboards requires AWS/CloudWatch/Metadata permissions on the Turbot
Master account.

Turbot-Database

DynamoDB serves as persistent storage of all Turbot data. Prior to the availability of on-demand DDB tables in late November 2018, a common cause of Web and Worker slowdowns was throttling of DDB queries. However, with all Turbot DDB tables set to on-demand, throttling incidents should be greatly reduced.

The Turbot-Database dashboard provides information on DDB throttling. With on-demand DDB
tables, or sufficient manual capacity allotment, this dashboard should be empty. Any data on
this table indicates DynamoDB throttling has occurred and the effected tables/indexes need to
be scaled up to meet that demand.

Turbot_Cloudwatch_DynamoDB_Dashboard.png

Turbot-Events

The Turbot Events dashboard covers the two main sources of traffic in and out of Turbot, API calls and Events.

High traffic events such an upgrade of the Turbot cluster or importing new cloud accounts will show as spikes in this dashboard. Turbot is designed to automatically scale up and down the cluster size depending on the CPU load.

Turbot_CloudWatch_Events_Dashboard.png

In some instances, throttling of cloud provider services can cause large numbers of events that queue without increasing CPU load; this can prevent workers from scaling out to meet increased demand. When you see #of messages visible continuously grow and the auto-scaling group not responding in kind, then increase the “desired” value in the Worker Autoscaling group by a factor of 2x, and monitor the response.

If the queue does not begin recovery, or if this type of activity happens at any recurring frequency then please contact Turbot support.

Turbot-Servers

The Turbot-Servers dashboard covers the health and utilization of the Turbot EC2 instances. Turbot uses two classifications of servers: Web and Workers. Turbot Web instances work with Elastic Load Balancing to serve the Turbot web front-end and also handle the endpoints for API calls.

Turbot_Cloudwatch_Servers_Dashboard.png

Turbot Worker instances continuously pull work from the SQS queues and run guardrails. If the
CPU utilization stays unusually high or unusually low for an extended period then further
examination is required.

Turbot CloudWatch Event Logs

Customers may require additional information not summarized in the provided dashboards. In
these cases, customers can create their own dashboards using the underlying CloudWatch
metrics and alarms. This data can be found in the CloudWatch page in the AWS console for
Turbot Master. Many monitoring tools are able to ingest the CloudWatch logs directly from AWS,
thus widening the scope of how this performance data can be used.

CloudWatch_Logs.png

Turbot Event Queue Depth

The strongest measure of Turbot Cluster health is the queue depth for the SQS queue called
Turbot-Q-Events-[long ID]. The queue depth is available using the

ApproximateNumberOfMessagesVisible 

metric on the Turbot Events queue. Turbot Worker servers pull messages from that queue to know what operations to perform. Should the queue depth go to zero and the in-flight event count stays low, then events are being processed as they come in. Upward trends in queue depth over multiple hours may indicate a need to tune the cluster performance. Short lived spikes in queue depth are often correlated with account import activity, significant policy changes or high activity in the managed cloud accounts.

Resolving Performance Problems

If you have questions or concerns about the performance of your cluster, then contact Turbot
Support on Slack or by sending an email to help@turbot.com.

Was this article helpful?
1 out of 1 found this helpful