The Need For Continual Monitoring

2.4GHz interference advances with the determination of General George S. Patton, perhaps with General “Chesty” Puller as a teammate. BYOD, IoT, and STUFF fuels the onslaught. Because of so many devices (Wi-Fi, other types of communication devices, and unintentional interference sources) being haphazardly thrown into the 2.4GHz band both now and in the immediate future, average utilization on 2.4GHz ISM channels continues to rise very quickly. The rise in 2.4GHz “noise” requires higher signal levels from Wi-Fi devices in order to achieve usable Signal-to-Noise (SNR) ratios, which in-turn greatly exacerbates co-channel contention (CCC) and adjacent channel interference (ACI). To make matters worse, there are only 3 non-overlapping channels in the 2.4GHz ISM band, with no more on the way. It’s a death spiral with no hope of recovery when it comes to using it for Wi-Fi.

Interference Hinders Performance

This ever-growing tidal wave of interference impedes performance and reliable connectivity in so many ways that the only way to reasonably understand your RF environment, and the problems your Wi-Fi is experiencing in real-time and over time, is to continually monitor the L1 and L2 environments and reference the data against baselines. I foresee cloud-managed infrastructure being an avenue to collect, store, and analyze such data in the very near future.

Most analyst firms agree that by CY2020, we’ll have 30+ billion always-connected devices with 200+ billion intermittently-connected devices on the Internet. Wi-Fi has now surpassed Ethernet for how devices connect to networks. With IoT and BYOD both causing a sharp increase in Wi-Fi and Bluetooth connectivity, many (or most) of these devices supporting 2.4GHz, we’re bound to see a continually and sharply rising trend of increased support costs for networks supporting 2.4GHz. It’s a losing outcome that we can absolutely avoid if we start now.

What Do We Need To Monitor?

There are a large variety of useful statistics gathered by performance diagnostic system vendors like 7signal Solutions and AirMagnet Enterprise (from Fluke Networks), and while you may say to yourself, “Dude, my infrastructure vendor has this covered!”, I challenge you to consider:

APs can only hear half of a Wi-Fi conversation
Non-dedicated APs performing background scanning capture very little useful performance-related information
Due to limited development resources, infrastructure vendors do not focus much time on performance measurement, metrics, alerting, and reporting

Some of the useful Key Performance Indicators (KPIs) and statistics available from these systems include:

L2-L7 infrastructure connectivity validation
- This is useful when “sensors” simulate client behavior, connecting to APs and running L2-L7 operational tests
Spectrum analysis with source identification & effect determination
The number and type of frames (total, per-AP, per-STA, and per connection)
Protection mechanism use (RTS/CTS and CTS-to-Self)
- Use of protection mechanisms for backwards compatibility can quickly lose you 40%+ of a cell’s capacity.
Frame capture and protocol analysis
Channel utilization by media type, noise, and throughput
- This is very important to understanding how much usable airtime is available.
TCP Throughput (uplink & downlink)
SNR for particular APs & STAs
- SNR is a very important factor to drivers (in both APs and Clients) in deciding which data rates to use.
Ping Round Trip Time (RTT)
- This shows this general latency of the network.
VoIP MOS scoring (bi-directionally), including delay, jitter, & packet loss.
- Even when the network isn’t used for VoIP, these can be a good performance and user experience indicator.
L2 Retransmissions (per-AP, per-STA, and aggregate)
- Rules of thumb are generally <10% is Good. 10-50% is acceptable. >50% is Bad. It’s worth noting that when Auto RRM powers APs down too far, it can cause high AP retries while client device retries remain OK.
Beacon Availability (beacons received/expected).
- As my friend Mike Graham at 7signal puts it, “APs were born to beacon.” If an AP radio stops beaconing or if enough beacons aren’t received by client devices (because of a saturated channel or interference sources), clients will be disconnected.
Attach Success Rate
- “Attach” is often defined as “Authentication + Association”, and if this parameter is low, then you may need to dig into whether Authentication or Association is the problem.
Authentication Time
- If your Wi-Fi network doesn’t support a fast/secure roaming algorithm (e.g. Voice-Enterprise, OKC, PreAuthentication, etc), then authentication time becomes critical to both user experience and application performance. This is an important metric to watch when running 802.1X/EAP over a WAN.
IP Address Retrieval Success Rate and Time
- These metrics will show you if your DHCP server is responding properly and how fast. While DHCP servers are usually fast enough (there are some exceptions), sometimes client devices don’t get a response at all. Clients then must re-request an IP address, and this makes the overall process very slow. A broken Layer-3 process is a performance equivalent to a broken Layer-2 process because you must have both L2 and L3 connectivity in order to communicate.
AP Signal Level
- This metric should be fairly steady on most APs, even if Auto RRM is enabled. With most RRM based deployments, there should be high and low boundaries placed on AP power output, and AP signal level changes shouldn’t be constant (flapping).
Channel Changes
- This metric can indicate a misconfigured or malfunctioning RRM algorithm, where changes are rippling back and forth across the network. This can wreck both performance and user experience.
Airtime Utilization
- This metric is vital to understanding how much airtime is consumed by particular APs or client devices, and what types of transmissions (e.g. frame types) are consuming the airtime. It is useful to validating an infrastructure manufacturer’s Airtime Fairness algorithm or misconfigured Wi-Fi infrastructure.
Number of Client Devices
- This metric can be per-BSSID and per-AP and will help you understand when load-balancing or power output are misconfigured (or broken) or when APs are not optimally located.
Data Rates (in use between Clients and APs)
- This is one of the most important performance-related metrics. Selective data rate configuration can be used to optimize performance in a big way, but can also cause connectivity problems with some clients. Data Rate configuration goes hand-in-hand with optimal AP placement, so drastic configuration changes without APs being moved can be problematic.
- It’s very common, even after optimal data rate configuration, to see strange data rate behavior – especially from client devices.
5GHz-capable clients using 2.4GHz
- It’s strongly recommended to get as many clients onto 5GHz as possible, and some dual-band client devices will resist moving to 5GHz.

Summary

If you can’t see it, you can’t measure it, adequately understand its effects, or plan for the future. Being able to monitor both L1 and L2, you can see cause and effect, trends, and solutions for interference and malfunctioning networks. When feasible, decisions should be made based on collected data, not the wet-finger-in-air tests. Wi-Fi is now mission-critical or life-critical in many organizations and vertical markets. Why “not know” when you can “know you know”? Stop spending time and money groping around in the dark (blind troubleshooting) and trying to fix problems that you should’ve never had. Spend the money to proactively implement a performance monitoring solution that shows you the real deal 24/7. #JustSayin

**I would like to offer special thanks to Mike Graham, Sr. SE at 7signal Solutions, at David Tran, Director of Product Marketing at Fluke Networks, for their help in understanding the breadth of their respective products’ performance monitoring features.**