Classifiers Unclassified

Many network provides apply different policies for different network traffic, for example, T-mobile's Binge On program zero-rates (i.e., does not charge against monthly data quota) network traffic identified as video streaming, and also throttles this traffic to a maximum of 1.5Mbps. However, in general a network provider does not know what app you are using; rather, they only see the app's network traffic. As a result, they have to make educated guesses based on the network traffic that the app generates. To address this challenge, network providers usually deploy one or more devices (typically called middleboxes) that perform this mapping between network traffic and applications. Specifically, such middleboxes include a classification rule that maps network traffic into specific category, and an action that specifies what should be done to this category of traffic. Little is known about these classification rules, since middleboxes use proprietary, closed-source hardware and software.

In this work, we develop a general approach for identifying classification rules (i.e., the network provider's "educated guesses") that map network traffic to applications. Specifically, we use an efficient binary search and carefully-generated flows to eliminate the number of tests to run for reverse-engineering the rules. We also characterize the classification rules for HTTP(S) traffic implemented in today's carrier-grade middleboxes and identify examples of misclassification (traffic from application A being labeled mistakenly as application B). In summary, our analysis shows that different vendors use different matching rules, but all generally focus on a small number of fields inside HTTP/S traffic.

Key Contributions

We develop a general methodology for identifying the matching rules used by a classifier.
We conduct a detailed study of the classification rules used by devices in a controlled setting and in the wild.
We find that the devices use simple text-based matching in HTTP and TLS handshakes.
We find that the devices exhibit simple matching rules with deterministic matching-rule priorities.
We publish the code below that can be used to analysis on any network that does DPI-based zero-rating or shaping

Complete list of keywords identified in HTTP/S traffic that we tested

List of identified keywords in Host headers
List of identified keywords in User Agent headers
List of identified keywords in Content Type headers
List of identified keywords in SNI fields

Code

The code that used for analysis can be found here

Paper

Classifiers Unclassified: An Efficient Approach to Revealing IP Traffic Classification Rules
Fangfan Li, Arash Molavi Kakhki, David Choffnes, Phillipa Gill, Alan Mislove. To appear in Proceedings of the 16th ACM Internet Measurement Conference (IMC'16), Santa Monica, CA, November 2016 [pdf].