Traffic Monitor Administration

Installing Traffic Monitor

The following are hard requirements requirements for Traffic Monitor to operate:

CentOS 7 or later
Successful install of Traffic Ops (usually on a separate machine)
Administrative access to the Traffic Ops (usually on a separate machine)

These are the recommended hardware specifications for a production deployment of Traffic Monitor:

8 CPUs
16GB of RAM
It is also recommended that you know the geographic coordinates and/or mailing address of the site where the Traffic Monitor machine lives for optimal performance

Enter the Traffic Monitor server into Traffic Portal

Note

For legacy compatibility reasons, the ‘Type’ field of a new Traffic Monitor server must be ‘RASCAL’.
Make sure the FQDN of the Traffic Monitor is resolvable in DNS.
Install Traffic Monitor, either from source or by installing a traffic_monitor-version string.rpm package generated by the instructions in Building Traffic Control with yum(8) or rpm(8)
Configure Traffic Monitor according to Configuring Traffic Monitor
Start Traffic Monitor, usually by starting its systemd(1) service
Verify Traffic Monitor is running by e.g. opening your preferred web browser to port 80 on the Traffic Monitor host.

Configuring Traffic Monitor

Configuration Files

Traffic Monitor is configured via two JSON configuration files, traffic_ops.cfg and traffic_monitor.cfg, by default located in the conf directory in the install location.

traffic_ops.cfg

traffic_ops.cfg contains Traffic Ops connection information. Specify the URL, username, and password for the instance of Traffic Ops of which this Traffic Monitor is a member. However, this also sets some settings relating to the Traffic Monitor API server.

cdnName:: The name of the CDN to which this Traffic Monitor belongs. Used to fetch configuration and to determine which cache servers to monitor.
certFile:: The path to an SSL certificate file that corresponds to keyFile which will be used for Traffic Monitor’s HTTPS API server.
httpListener:: Sets the address and port on which Traffic Monitor will listen for HTTP requests in the format address:port. If address is omitted, Traffic Monitor will listen on all available addresses.
httpsListener:: Sets the address and port on which Traffic Monitor will listen for HTTPS requests in the format address:port. If address is omitted, Traffic Monitor will listen on all available addresses. If not provided, null, or the empty string, Traffic Monitor will only serve HTTP, and keyFile and certFile are not used. If this is provided, the httpListener address will be used only to redirect clients to use HTTPS.
insecure:: A boolean that controls whether to validate the HTTPS certificate presented by the Traffic Ops server.
keyFile:: The path to an SSL key file that corresponds to certFile which will be used for Traffic Monitor’s HTTPS API server.
password:: The password of the user identified by username.
url:: The URL at which Traffic Ops may be reached e.g. "https://trafficops.infra.ciab.test".
username:: The username of the user as whom to authenticate with Traffic Ops.
usingDummyTO:: A boolean with no real effect. This value is used internally within the runtime of Traffic Monitor, and should never be set manually in its configuration file.

Deprecated since version ATCv7: The dependency on this field being valid will be removed in the future. It already has no effect.

traffic_monitor.cfg

traffic_monitor.cfg contains log file locations, as well as detailed application configuration variables such as processing flush times, initial poll intervals, and the polling protocols. Once started with the correct configuration, Traffic Monitor downloads its configuration from Traffic Ops, and any Parameters set on the Monitor’s Profile that configure the same thing as a field in this configuration file will take precedence over said fields. The Parameters known to override configuration here are

tm.polling.interval
health.polling.interval
peers.polling.interval
heartbeat.polling.interval

Upon receiving this configuration, Traffic Monitor begins polling cache server s. Once every cache server has been polled, Health Protocol state is available via RESTful JSON endpoints and a web browser UI.

cache_polling_protocol:: Defines the internet protocol used to communicate with cache servers. This can be “ipv4only” to only allow IPv4 communication, “ipv6only” to only allow IPv6 communication, or “both” to alternate between each version. Default is “both”.

Note

both will poll IPv4 and IPv6 and report on availability based on if the respective IP addresses are defined on the server. So if only an IPv4 address is defined and the protocol is set to both then it will only show the availability over IPv4, but if both addresses are defined then it will show availability based on IPv4 and IPv6.
crconfig_backup_file:: The path to a file within which a backup of the most recently fetched CDN Snapshot will be stored. Default is /opt/traffic_monitor/crconfig.backup.
crconfig_history_count:: The number of historical CDN Snapshots to store, which can then be retrieved through the Traffic Monitor APIs. Default is 100.
distributed_polling:: A boolean that controls whether Distributed Polling is enabled. Default is false.

See also

The Distributed Polling section has more information on this setting.
health_flush_interval_ms:: Defines an interval as a number of milliseconds on which Traffic Monitor will flush its collected health data such that it is made available through the Traffic Monitor APIs. Default is 200.

See also

The Stat and Health Flush Configuration section has more information on this setting.
http_polling_format:: A MIME-Type that will be sent in the Accept HTTP header in requests to cache servers for health and stats data. Default is text/json (not application/json).

See also

The HTTP Accept Header Configuration section has more information on this setting.
http_timeout_ms:: Sets the timeout duration - in milliseconds - for all HTTP operations (both peer-polling and stat/health data polling). Default is 2000.
log_location_access:: A logfile location to which access logs will be written, or null to not log access events.[1] Default is null
log_location_debug:: A logfile location to which debug logs will be written, or null to not log debug messages.[1] Default is null
log_location_error:: A logfile location to which error logs will be written, or null to not log error messages.[1] Default is “stderr”.
log_location_event:: A logfile location to which event logs will be written, or null to not log events.[1] Default is “stdout”
log_location_info:: A logfile location to which informational logs will be written, or null to not log informational messages.[1] Default is null
log_location_warning:: A logfile location to which warning logs will be written, or null to not log warning messages.[1] Default is “stdout”
max_events:: The maximum number of changes to stored aggregate data that should be retained at any one time. Default is 200.
monitor_config_polling_interval_ms:: The interval - in milliseconds - on which to poll Traffic Ops for this Traffic Monitor’s “monitoring configuration” as returned by cdns/{{name}}/configs/monitoring.
peer_optimistic_quorum_min:: Specifies the minimum number of peers that must be available in order to participate in the optimistic health protocol. Default is zero.

See also

The Peering and Optimistic Quorum section has more information on this setting.
serve_read_timeout_ms:: Sets the timeout - in milliseconds - of the Traffic Monitor API server for reading incoming requests. Default is 10,000.
serve_write_timeout_ms:: Sets the timeout - in milliseconds - of the Traffic Monitor API server for writing responses. Default is 10,000.
short_hostname_override:: Sets a hostname for the Traffic Monitor. It will behave as though this were its hostname, rather than the hostname actually reported by the operating system. If not provided, null, or the empty string, the Traffic Monitor will use the hostname provided by its host operating system. Default is the empty string.
stat_buffer_interval_ms:: An interval - in milliseconds - for which to buffer collected stats before processing them. If this is not provided, null, or zero, then all stats will be processed immediately. Default is zero.

See also

The Stat and Health Flush Configuration section has more information on this setting.
stat_flush_interval_ms:: Defines an interval as a number of milliseconds on which Traffic Monitor will flush its collected stats data such that it is made available through the Traffic Monitor APIs. Default is 200.

See also

The Stat and Health Flush Configuration section has more information on this setting.
stat_polling:: A boolean that controls whether cache servers are polled for stats data. Default is true.

See also

The Optional Stat Polling section has more information on this setting.
static_file_dir:: The directory within which Traffic Monitor will look for its web interface’s static files. Default is /opt/traffic_monitor/static.
tmconfig_backup_file:: A file location to which a backup of the “monitoring configuration” as returned by cdns/{{name}}/configs/monitoring currently in use by Traffic Monitor will be written. Default is /opt/traffic_monitor/tmconfig.backup.
traffic_ops_disk_retry_max:: The number of times Traffic Monitor should attempt to log in to Traffic Ops before using its backup monitoring configuration and CDN Snapshot (if those exist). Default is 2.
traffic_ops_max_retry_interval_ms:: Traffic Monitor will exponentially increase the amount of time it waits between attempts to log in to Traffic Ops each time it fails (up to a maximum number of times set by traffic_ops_disk_retry_max). This controls the maximum amount of time - in milliseconds - that this waiting duration will be. Default is 60,000.
traffic_ops_min_retry_interval_ms:: Traffic Monitor will exponentially increase the amount of time it waits between attempts to log in to Traffic Ops each time it fails (up to a maximum number of times set by traffic_ops_disk_retry_max). This controls the minimum amount of time - in milliseconds - that this waiting duration will be. Default is 100.

Optional Stat Polling

Traffic Monitor has the option to disable stat polling via the stat_polling (default: true) option in traffic_monitor.cfg. If set to false, Traffic Monitor will not poll caches for stats; it will only poll caches for health. This can be useful in lowering the amount of resources (CPU, bandwidth) used by Traffic Monitor while still allowing it to retain its core functionality (determining cache availability) via health polling alone. However, disabling stat polling also prevents some other ATC features from working properly (basically anything that requires stats data from caches, e.g. Traffic Stats data), so it should only be disabled when absolutely necessary.

Distributed Polling

Traffic Monitor has the option to enable distributed polling via the distributed_polling (default: false) option in traffic_monitor.cfg. If set to true, Traffic Monitor groups will each poll their own disjoint subsets of the CDN. In order to enable this option, stat_polling must be disabled. In order to function properly, all Traffic Monitors in a CDN must have distributed_polling enabled; otherwise, the results are undefined.

Note

Traffic Monitors are said to be in the same “Traffic Monitor group” if they are in the same Cache Group.

Each Traffic Monitor in the same Traffic Monitor group (referred to as local peers) polls the same disjoint subset of the CDN and combines availability states with its local peers via the Health Protocol. This is similar to how Traffic Monitor behaves in its legacy, non-distributed mode except Traffic Monitor is not polling the entire CDN. In order to get availability data for the rest of the CDN, each Traffic Monitor also polls every other Traffic Monitor group in parallel (these are referred to as distributed peers). It does this by selecting one distributed peer per group at a time, cycling through each distributed peer in the group for subsequent polls in a round-robin manner.

Upon startup, Traffic Monitor will retrieve its config (either from TO or on-disk backup file), then begin polling the Cache Groups for which its Traffic Monitor group is responsible. Once it has polled the Cache Groups, it will start serving requests for /publish/CrStates?raw (the raw, uncombined health states of its local caches) and /publish/CrStates?local (the combined health states of its local caches derived from all Traffic Monitors in its group). Once Traffic Monitor has received /publish/CrStates?local responses from all other Traffic Monitor groups, it will start serving requests for /publish/CrStates (the combined health states of all caches in the CDN).

Peering and Optimistic Quorum

As mentioned in the Health Protocol section of the Traffic Monitor overview, peering a Traffic Monitor with one or more other Traffic Monitors enables the optimistic health protocol. In order to leverage the optimistic quorum feature along with the optimistic health protocol, a minimum of three Traffic Monitors are required. The optimistic quorum feature allows a Traffic Monitor to withdraw itself from the optimistic health protocol when it loses connectivity to a number of its peers.

To enable the optimistic quorum feature, the peer_optimistic_quorum_min property in traffic_monitor.cfg should be configured with a value greater than zero that specifies the minimum number of peers that must be available in order to participate in the optimistic health protocol. If at any time the number of available peers falls below this threshold, the local Traffic Monitor will serve 503s whenever the aggregated, optimistic health protocol enabled view of the CDN’s health is requested. Traffic Monitor will continue serving 503s and logging errors in traffic_monitor.log until the minimum number of peers are available. Once the minimum number of peers are available, the local Traffic Monitor can resume participation in the optimistic health protocol. This prevents negative states caused by network isolation of a Traffic Monitor from propagating to downstream components such as Traffic Router.

Stat and Health Flush Configuration

The Monitor has a health flush interval, a stat flush interval, and a stat buffer interval. Recall that the monitor polls both stats and health. The health poll is so small and fast, a buffer is largely unnecessary. However, in a large CDN, the stat poll may involve thousands of cache servers with thousands of stats each, or more, and CPU may be a bottleneck.

The flush intervals, health_flush_interval_ms and stat_flush_interval_ms, indicate how often to flush stats or health, if results are continuously coming in with no break. This prevents starvation. Ideally, if there is enough CPU, the flushes should never occur. The default flush times are 200 milliseconds, which is suggested as a reasonable starting point; operators may adjust them higher or lower depending on the need to get health data and stop directing client traffic to unhealthy cache servers as quickly as possible, balanced by the need to reduce CPU usage.

The stat buffer interval, stat_buffer_interval_ms, also provides a temporal buffer for stat processing. Stats will not be processed except after this interval, whereupon all pending stats will be processed, unless the flush interval occurs as a starvation safety. The stat buffer and flush intervals may be thought of as a state machine with two states: the “buffer state” accepts results until the buffer interval has elapsed, whereupon the “flush state” is entered, and results are accepted while outstanding, and processed either when no results are outstanding or the flush interval has elapsed.

Note that this means the stat buffer interval acts as “bufferbloat,” increasing the average and maximum time a cache server may be down before it is processed and marked as unhealthy. If the stat buffer interval is non-zero, the average time a cache server may be down before being marked unavailable is half the poll time plus half the stat buffer interval, and the maximum time is the poll time plus the stat buffer interval. For example, if the stat poll time is 6 seconds, and the stat buffer interval is 4 seconds, the average time a cache server may be unhealthy before being marked is \(\frac{6}{2} + \frac{4}{2} = 6\) seconds, and the maximum time is \(6+4=10\) seconds. For this reason, if operators feel the need to add a stat buffer interval, it is recommended to start with a very low duration, such as 5 milliseconds, and increase as necessary.

It is not recommended to set either flush interval to 0, regardless of the stat buffer interval. This will cause new results to be immediately processed, with little to no processing of multiple results concurrently. Result processing does not scale linearly. For example, processing 100 results at once does not cost significantly more CPU usage or time than processing 10 results at once. Thus, a flush interval which is too low will cause increased CPU usage, and potentially increased overall poll times, with little or no benefit. The default value of 200 milliseconds is recommended as a starting point for configuration tuning.

HTTP Accept Header Configuration

The Accept header sent to caches for stat retrieval can be modified with the http_polling_format option. This is a string that will be inserted in to the Accept header of any requests. The default value is text/json which is the default value used by the astats plugin currently.

However newer versions of astats also support CSV output, which can have some CPU savings. To enable that format using http_polling_format: "text/csv" in traffic_monitor.cfg will set the Accept header properly.

Troubleshooting and Log Files

Traffic Monitor log files are in /opt/traffic_monitor/var/log/.

Extensions

Traffic Monitor allows extensions to its parsers for the statistics returned by cache servers and/or their plugins. The formats supported by Traffic Monitor by default are astats, astats-dsnames (which is an odd variant of astats that probably shouldn’t be used), and stats_over_http. The format of a cache server’s health and statistics reporting payloads must be declared on its Profile as the health.polling.format Parameter, or the default format (astats) will be assumed.

For instructions on how to develop a parsing extension, refer to the github.com/apache/trafficcontrol/traffic_monitor/cache package’s documentation.

Importantly, though, a statistics provider must respond to HTTP GET requests over either plain HTTP or HTTPS (which is controlled by the health.polling.url Parameter), and it must provide the following statistics, or enough information to calculate them:

System “loadavg” (only requires the one-minute value)

See also

For more information on what “loadavg” is, refer to the proc(5) manual page.
Input bytes, output bytes, and speeds for all monitored network interfaces

When using the stats_over_http extension this can be provided by the system_stats plugin which will inject that information in to the ATS stats which then get returned by stats_over_http. The system_stats plugin can be used with any custom implementations as it is already included and built with ATS when building with experimental-plugins enabled.

There are other optional and/or Delivery Service-related statistics that may cause Traffic Stats to not have the right information if not provided, but the above are essential for implementing Health Protocol.