Traffic Monitor is an HTTP service that monitors the cache servers in a CDN for a variety of metrics. These metrics are for use in determining the overall “health” of a given cache server and the related Delivery Services. A given CDN can operate a number of Traffic Monitors, from a number of geographically diverse locations, to prevent false positives caused by network problems at a given site. Traffic Monitors operate independently, but use the state of other Traffic Monitors in conjunction with their own state to provide a consistent view of CDN cache server health to upstream applications such as Traffic Router. Health Protocol governs the cache server and Delivery Service availability. Traffic Monitor provides a view into CDN health using several RESTful JSON endpoints, which are consumed by other Traffic Monitors and upstream components such as Traffic Router. Traffic Monitor is also responsible for serving the overall CDN configuration to Traffic Router, which ensures that the configuration of these two critical components remain synchronized as operational and health related changes propagate through the CDN.
Traffic Monitor polls all cache servers configured with a status of
ADMIN_DOWN at an interval specified as a configuration parameter in Traffic Ops. If the cache server is set to
ADMIN_DOWN it is marked as unavailable but still polled for availability and statistics. If the cache server is explicitly configured with a status of
OFFLINE, it is not polled by Traffic Monitor and presented to Traffic Router as configured, regardless of actual availability. Traffic Monitor makes HTTP requests at regular intervals to a special URL on each Edge-tier cache server and consumes the JSON output. The special URL is served by a plugin running on the ATS cache servers called “astats”, which is restricted to Traffic Monitor only. The astats plugin provides insight into application and system performance, such as:
- Throughput (e.g. bytes in, bytes out, etc).
- Transactions (e.g. number of 2xx, 3xx, 4xx responses, etc).
- Connections (e.g. from clients, to parents, origins, etc).
- Cache performance (e.g.: hits, misses, refreshes, etc).
- Storage performance (e.g.: writes, reads, frags, directories, etc).
- System performance (e.g: load average, network interface throughput, etc).
Many of the application-level statistics are available at the global or aggregate level, some at the Delivery Service level. Traffic Monitor uses the system-level performance to determine the overall health of the cache server by evaluating network throughput and load against values configured in Traffic Ops. Traffic Monitor also uses throughput and transaction statistics at the Delivery Service level to determine Delivery Service health. If astats is unavailable due to a network-related issue or the system statistics have exceeded the configured thresholds, Traffic Monitor will mark the cache server as unavailable. If the Delivery Servicestatistics exceed the configured thresholds, the Delivery Service is marked as unavailable, and Traffic Router will start sending clients to the overflow destinations for that Delivery Service, but the cache server remains available to serve other content,
For more information on ATS statistics, see the ATS documentation
Redundant Traffic Monitor servers operate independently from each other but take the state of other Traffic Monitors into account when asked for health state information. In Cache Monitoring, the behavior of a single Traffic Monitor instance is described. The Health Protocol adds another dimension to the health state of the CDN by merging the states of all Traffic Monitors into one, and then taking the optimistic approach when dealing with a cache server or Delivery Service that might have been marked as unavailable by this particular instance or a peer instance of Traffic Monitor. Upon startup or configuration change in Traffic Ops, in addition to cache servers, Traffic Monitor begins polling its peer Traffic Monitors whose state is set to
ONLINE Traffic Monitor polls all of its peers at a configurable interval and saves the peer’s state for later use. When polling its peers, Traffic Monitor asks for the raw health state from each respective peer, which is strictly that instance’s view of the CDN’s health. When any
ONLINE Traffic Monitor is asked for CDN health by an upstream component, such as Traffic Router, the component gets the Health Protocol-influenced version of CDN health (non-raw view). In operation of the Health Protocol, Traffic Monitor takes all health states from all peers, including the locally known health state, and serves an optimistic outlook to the requesting client. This means that, for example, if three of the four Traffic Monitors see a given cache server or Delivery Service as exceeding its thresholds and unavailable, it is still considered available. Only if all Traffic Monitors agree that the given object is unavailable is that state propagated to upstream components. This optimistic approach to the Health Protocol is counter to the “fail fast” philosophy, but serves well for large networks with complicated geography and or routing. The optimistic Health Protocol allows network failures or latency to occur without affecting overall traffic routing, as Traffic Monitors can and do have a different view of the network when deployed in geographically diverse locations. Short polling intervals of both the cache servers and Traffic Monitor peers help to reduce customer impact of outages. It is not uncommon for a cache server to be marked unavailable by Traffic Monitor - in fact, it is business as usual for many CDNs. Should a widely requested video asset cause a single cache server to get close to its interface capacity, the Health Protocol will “kick in”, and Traffic Monitor marks the cache server as unavailable. New clients want to see the same asset, and now Traffic Router will send these customers to another cache server in the same Cache Group. The load is now shared between the two cache servers. As clients finish watching the asset on the overloaded cache server, it will drop below the threshold and gets marked available again, and new clients will begin to be directed to it once more. It is less common for a Delivery Service to be marked unavailable by Traffic Monitor, the Delivery Service thresholds are usually used for overflow situations at extreme peaks to protect other Delivery Services in the CDN from being impacted.