Traffic Monitor

Traffic Monitor is an HTTP service that monitors the cache servers in a CDN for a variety of metrics. These metrics are for use in determining the overall “health” of a given cache server and the related Delivery Services. A given CDN can operate a number of Traffic Monitors, from a number of geographically diverse locations, to prevent false positives caused by network problems at a given site. Traffic Monitors operate independently, but use the state of other Traffic Monitors in conjunction with their own state to provide a consistent view of CDN cache server health to downstream applications such as Traffic Router. Health Protocol governs the cache server and Delivery Service availability. Traffic Monitor provides a view into CDN health using several RESTful JSON endpoints, which are consumed by other Traffic Monitors and downstream components such as Traffic Router. Traffic Monitor is also responsible for serving the overall CDN configuration to Traffic Router, which ensures that the configuration of these two critical components remain synchronized as operational and health related changes propagate through the CDN.

Cache Monitoring

Traffic Monitor polls all cache servers configured with a status of REPORTED or ADMIN_DOWN at an interval specified as a configuration parameter in Traffic Ops. If the cache server is set to ADMIN_DOWN it is marked as unavailable but still polled for availability and statistics. If the cache server is explicitly configured with a status of ONLINE or OFFLINE, it is not polled by Traffic Monitor and presented to Traffic Router as configured, regardless of actual availability. Traffic Monitor makes HTTP requests at regular intervals to a special URL on each Edge-tier cache server and consumes the JSON output. The special URL is served by a plugin running on the ATS cache servers called “astats”, which is restricted to Traffic Monitor only. The astats plugin provides insight into application and system performance, such as:

Throughput (e.g. bytes in, bytes out, etc).
Transactions (e.g. number of 2xx, 3xx, 4xx responses, etc).
Connections (e.g. from clients, to parents, origins, etc).
Cache performance (e.g.: hits, misses, refreshes, etc).
Storage performance (e.g.: writes, reads, frags, directories, etc).
System performance (e.g: load average, network interface throughput, etc).

Many of the application-level statistics are available at the global or aggregate level, some at the Delivery Service level. Traffic Monitor uses the system-level performance to determine the overall health of the cache server by evaluating network throughput and load against values configured in Traffic Ops. Traffic Monitor also uses throughput and transaction statistics at the Delivery Service level to determine Delivery Service health. If astats is unavailable due to a network-related issue or the system statistics have exceeded the configured thresholds, Traffic Monitor will mark the cache server as unavailable. If the Delivery Service statistics exceed the configured thresholds, the Delivery Service is marked as unavailable, and Traffic Router will start sending clients to the overflow destinations for that Delivery Service, but the cache server remains available to serve other content.

Health Protocol

Optimistic Health Protocol

Redundant Traffic Monitor servers operate independently from each other but take the state of other Traffic Monitors into account when asked for health state information. In Cache Monitoring, the behavior of a single Traffic Monitor instance is described. The Health Protocol adds another dimension to the health state of the CDN by merging the states of all Traffic Monitors into one, and then taking the optimistic approach when dealing with a cache server or Delivery Service that might have been marked as unavailable by this particular instance or a peer instance of Traffic Monitor. Upon startup or configuration change in Traffic Ops, in addition to cache servers, Traffic Monitor begins polling its peer Traffic Monitors whose state is set to ONLINE. Each ONLINE Traffic Monitor polls all of its peers at a configurable interval and saves the peer’s state for later use. When polling its peers, Traffic Monitor asks for the raw health state from each respective peer, which is strictly that instance’s view of the CDN’s health. When any ONLINE Traffic Monitor is asked for CDN health by a downstream component, such as Traffic Router, the component gets the Health Protocol-influenced version of CDN health (non-raw view). In operation of the Health Protocol, Traffic Monitor takes all health states from all peers, including the locally known health state, and serves an optimistic outlook to the requesting client. This means that, for example, if three of the four Traffic Monitors see a given cache server or Delivery Service as exceeding its thresholds and unavailable, it is still considered available. Only if all Traffic Monitors agree that the given object is unavailable is that state propagated to downstream components. This optimistic approach to the Health Protocol is counter to the “fail fast” philosophy, but serves well for large networks with complicated geography and/or routing. The optimistic Health Protocol allows network failures or latency to occur without affecting overall traffic routing, as Traffic Monitors can and do have a different view of the network when deployed in geographically diverse locations.

Optimistic Quorum

In order to prevent split-brain monitoring scenarios, a minimum of three Traffic Monitors are required to properly monitor a given CDN and the optimistic quorum feature should be enabled. If three or more Traffic Monitors are set to ONLINE, the optimistic quorum can be employed by setting the peer_optimistic_quorum_min property in traffic_monitor.cfg to a value greater than zero. This value represents the minimum number of peers that must be available in order to participate in the Optimistic Health Protocol. If Traffic Monitor detects that the number of available peers is less than this number, Traffic Monitor withdraws itself from participation in the health protocol by serving 503s for cache health state calls until connectivity is restored.

The optimistic quorum prevents invalid state propagation caused by a Traffic Monitor losing connectivity to the network and consequently marking all peers and caches as unavailable. When connectivity is restored, a race between peering recovery and polling from Traffic Routers begins. If Traffic Router were to poll a Traffic Monitor that has no available peers and optimistic quorum is not enabled or cannot be used (i.e.: too few Traffic Monitors), the Traffic Monitor will serve its local state only until peer connectivity is restored. If Traffic Router polls the Traffic Monitor when in this state, that is, prior to regaining peering, negative cache states caused by the lack of connectivity would be consumed and directly impact which caches are available for consideration for routing, until the Traffic Router polls a Traffic Monitor that has good state, or peering is restored. For this reason, it is recommended to run a minimum of three Traffic Monitors, with peer_optimistic_quorum_min set to a value of 1 or greater. Note that this value cannot exceed the number of peers of any given Traffic Monitor; that is, a value of 2 is the maximum value that can be used when three Traffic Monitors are in use. If this number exceeds the number of peers, the Traffic Monitor will always serve 503s and an error will be logged.

Protocol Engagement

Short polling intervals of both the cache servers and Traffic Monitor peers help to reduce customer impact of outages. It is not uncommon for a cache server to be marked unavailable by Traffic Monitor - in fact, it is business as usual for many CDNs. Should a widely requested video asset cause a single cache server to get close to its interface capacity, the Health Protocol will “kick in,” and Traffic Monitor marks the cache server as unavailable. New clients want to see the same asset, and now Traffic Router will send these customers to another cache server in the same Cache Group. The load is now shared between the two cache servers. As clients finish watching the asset on the overloaded cache server, it will drop below the threshold and gets marked available again, and new clients will begin to be directed to it once more. It is less common for a Delivery Service to be marked unavailable by Traffic Monitor. The Delivery Service thresholds are usually used for overflow situations at extreme peaks to protect other Delivery Services in the CDN from being impacted.