Traffic Monitor Administration

Installing Traffic Monitor

The following are hard requirements requirements for Traffic Monitor to operate:

  • CentOS 7 or later

  • Successful install of Traffic Ops (usually on a separate machine)

  • Administrative access to the Traffic Ops (usually on a separate machine)

These are the recommended hardware specifications for a production deployment of Traffic Monitor:

  • 8 CPUs

  • 16GB of RAM

  • It is also recommended that you know the geographic coordinates and/or mailing address of the site where the Traffic Monitor machine lives for optimal performance

  1. Enter the Traffic Monitor server into Traffic Portal

    Note

    For legacy compatibility reasons, the ‘Type’ field of a new Traffic Monitor server must be ‘RASCAL’.

  2. Make sure the FQDN of the Traffic Monitor is resolvable in DNS.

  3. Install Traffic Monitor, either from source or by installing a traffic_monitor-version string.rpm package generated by the instructions in Building Traffic Control with yum(8) or rpm(8)

  4. Configure Traffic Monitor according to Configuring Traffic Monitor

  5. Start Traffic Monitor, usually by starting its systemd(1) service

  6. Verify Traffic Monitor is running by e.g. opening your preferred web browser to port 80 on the Traffic Monitor host.

Configuring Traffic Monitor

Configuration Overview

Traffic Monitor is configured via two JSON configuration files, traffic_ops.cfg and traffic_monitor.cfg, by default located in the conf directory in the install location. traffic_ops.cfg contains Traffic Ops connection information. Specify the URL, username, and password for the instance of Traffic Ops of which this Traffic Monitor is a member. traffic_monitor.cfg contains log file locations, as well as detailed application configuration variables such as processing flush times, initial poll intervals, and the polling protocols. Once started with the correct configuration, Traffic Monitor downloads its configuration from Traffic Ops and begins polling cache server s. Once every cache server has been polled, Health Protocol state is available via RESTful JSON endpoints and a web browser UI.

Polling protocol can be set for peers (peer_polling_protocol in traffic_monitor.cfg) and caches (cache_polling_protocol in traffic_monitor.cfg) and has 3 options:

ipv4only

Traffic Monitor will communicate with the peers or caches only over IPv4

ipv6only

Traffic Monitor will communicate with the peers or caches only over IPv6 (use case for peers is if the other Traffic Monitor are only available over IPv6)

both (the default)

Traffic Monitor will alternate its communication between IPv4 and IPv6 (note: this does not affect the polling frequency so if polling frequency is 1 second IPv4 will be polled every 2 seconds)

Note

both will poll IPv4 and IPv6 and report on availability based on if the respective IP addresses are defined on the server. So if only an IPv4 address is defined and the protocol is set to both then it will only show the availability over IPv4, but if both addresses are defined then it will show availability based on IPv4 and IPv6.

Optional Stat Polling

Traffic Monitor has the option to disable stat polling via the stat_polling (default: true) option in traffic_monitor.cfg. If set to false, Traffic Monitor will not poll caches for stats; it will only poll caches for health. This can be useful in lowering the amount of resources (CPU, bandwidth) used by Traffic Monitor while still allowing it to retain its core functionality (determining cache availability) via health polling alone. However, disabling stat polling also prevents some other ATC features from working properly (basically anything that requires stats data from caches, e.g. Traffic Stats data), so it should only be disabled when absolutely necessary.

Peering and Optimistic Quorum

As mentioned in the Health Protocol section of the Traffic Monitor overview, peering a Traffic Monitor with one or more other Traffic Monitors enables the optimistic health protocol. In order to leverage the optimistic quorum feature along with the optimistic health protocol, a minimum of three Traffic Monitors are required. The optimistic quorum feature allows a Traffic Monitor to withdraw itself from the optimistic health protocol when it loses connectivity to a number of its peers.

To enable the optimistic quorum feature, the peer_optimistic_quorum_min property in traffic_monitor.cfg should be configured with a value greater than zero that specifies the minimum number of peers that must be available in order to participate in the optimistic health protocol. If at any time the number of available peers falls below this threshold, the local Traffic Monitor will serve 503s whenever the aggregated, optimistic health protocol enabled view of the CDN’s health is requested. Traffic Monitor will continue serving 503s and logging errors in traffic_monitor.log until the minimum number of peers are available. Once the mininimum number of peers are available, the local Traffic Monitor can resume participation in the optimisic health protocol. This prevents negative states caused by network isolation of a Traffic Monitor from propagating to downstream components such as Traffic Router.

Stat and Health Flush Configuration

The Monitor has a health flush interval, a stat flush interval, and a stat buffer interval. Recall that the monitor polls both stats and health. The health poll is so small and fast, a buffer is largely unnecessary. However, in a large CDN, the stat poll may involve thousands of cache servers with thousands of stats each, or more, and CPU may be a bottleneck.

The flush intervals, health_flush_interval_ms and stat_flush_interval_ms, indicate how often to flush stats or health, if results are continuously coming in with no break. This prevents starvation. Ideally, if there is enough CPU, the flushes should never occur. The default flush times are 200 milliseconds, which is suggested as a reasonable starting point; operators may adjust them higher or lower depending on the need to get health data and stop directing client traffic to unhealthy cache servers as quickly as possible, balanced by the need to reduce CPU usage.

The stat buffer interval, stat_buffer_interval_ms, also provides a temporal buffer for stat processing. Stats will not be processed except after this interval, whereupon all pending stats will be processed, unless the flush interval occurs as a starvation safety. The stat buffer and flush intervals may be thought of as a state machine with two states: the “buffer state” accepts results until the buffer interval has elapsed, whereupon the “flush state” is entered, and results are accepted while outstanding, and processed either when no results are outstanding or the flush interval has elapsed.

Note that this means the stat buffer interval acts as “bufferbloat,” increasing the average and maximum time a cache server may be down before it is processed and marked as unhealthy. If the stat buffer interval is non-zero, the average time a cache server may be down before being marked unavailable is half the poll time plus half the stat buffer interval, and the maximum time is the poll time plus the stat buffer interval. For example, if the stat poll time is 6 seconds, and the stat buffer interval is 4 seconds, the average time a cache server may be unhealthy before being marked is \(\frac{6}{2} + \frac{4}{2} = 6\) seconds, and the maximum time is \(6+4=10\) seconds. For this reason, if operators feel the need to add a stat buffer interval, it is recommended to start with a very low duration, such as 5 milliseconds, and increase as necessary.

It is not recommended to set either flush interval to 0, regardless of the stat buffer interval. This will cause new results to be immediately processed, with little to no processing of multiple results concurrently. Result processing does not scale linearly. For example, processing 100 results at once does not cost significantly more CPU usage or time than processing 10 results at once. Thus, a flush interval which is too low will cause increased CPU usage, and potentially increased overall poll times, with little or no benefit. The default value of 200 milliseconds is recommended as a starting point for configuration tuning.

HTTP Accept Header Configuration

The Accept header sent to caches for stat retrieval can be modified with the http_polling_format option. This is a string that will be inserted in to the Accept header of any requests. The default value is text/json which is the default value used by the astats plugin currently.

However newer versions of astats also support CSV output, which can have some CPU savings. To enable that format using http_polling_format: "text/csv" in traffic_monitor.cfg will set the Accept header properly.

Troubleshooting and Log Files

Traffic Monitor log files are in /opt/traffic_monitor/var/log/.

Extensions

Traffic Monitor allows extensions to its parsers for the statistics returned by cache servers and/or their plugins. The formats supported by Traffic Monitor by default are astats, astats-dsnames (which is an odd variant of astats that probably shouldn’t be used), and stats_over_http. The format of a cache server’s health and statistics reporting payloads must be declared on its Profile as the health.polling.format Parameter, or the default format (astats) will be assumed.

For instructions on how to develop a parsing extension, refer to the github.com/apache/trafficcontrol/traffic_monitor/cache package’s documentation.

Importantly, though, a statistics provider must respond to HTTP GET requests over either plain HTTP or HTTPS (which is controlled by the health.polling.url Parameter), and it must provide the following statistics, or enough information to calculate them:

  • System “loadavg” (only requires the one-minute value)

    See also

    For more information on what “loadavg” is, refer to the proc(5) manual page.

  • Input bytes, output bytes, and speeds for all monitored network interfaces

When using the stats_over_http extension this can be provided by the system_stats plugin which will inject that information in to the ATS stats which then get returned by stats_over_http. The system_stats plugin can be used with any custom implementations as it is already included and built with ATS when building with experimental-plugins enabled.

There are other optional and/or Delivery Service-related statistics that may cause Traffic Stats to not have the right information if not provided, but the above are essential for implementing Health Protocol.