Traffic Control HOWTO Version 1.1 Martin A. Brown linux-ip.net Federico Bolelli Unimore <167439@studenti.unimore.it> Natale Patriciello Unimore © 2016, 2006, Martin A. Brown Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no invariant sections, with no Front-Cover Texts, with no Back-Cover Text. A copy of the license is located at http://www.gnu.org/ licenses/fdl.html. 2016-03-11 Revision History Revision 1.1.0 2016-01-30 MAB Incorporate new qdisc description sections written by Federico Bollelli and Natale Patriciello Revision 1.0.2 2006-10-28 MAB Add references to HFSC, alter author email addresses Revision 1.0.1 2003-11-17 MAB Added link to Leonardo Balliache's documentation Revision 1.0 2003-09-24 MAB reviewed and approved by TLDP Revision 0.7 2003-09-14 MAB incremental revisions, proofreading, ready for TLDP Revision 0.6 2003-09-09 MAB minor editing, corrections from Stef Coene Revision 0.5 2003-09-01 MAB HTB section mostly complete, more diagrams, LARTC pre-release Revision 0.4 2003-08-30 MAB added diagram Revision 0.3 2003-08-29 MAB substantial completion of classless, software, rules, elements and components sections Revision 0.2 2003-08-23 MAB major work on overview, elements, components and software sections Revision 0.1 2003-08-15 MAB initial revision (outline complete) Abstract Traffic control encompasses the sets of mechanisms and operations by which packets are queued for transmission/reception on a network interface. The operations include enqueuing, policing, classifying, scheduling, shaping and dropping. This HOWTO provides an introduction and overview of the capabilities and implementation of traffic control under Linux. ------------------------------------------------------------------------------- Table of Contents 1._Introduction_to_Linux_Traffic_Control 1.1._Target_audience_and_assumptions_about_the_reader 1.2._Conventions 1.3._Recommended_approach 1.4._Missing_content,_corrections_and_feedback 2._Overview_of_Concepts 2.1._What_is_it? 2.2._Why_use_it? 2.3._Advantages 2.4._Disdvantages 2.5._Queues 2.6._Flows 2.7._Tokens_and_buckets 2.8._Packets_and_frames 2.9._NIC,_Network_Interface_Controller 2.10._Starvation_and_Latency 2.11._Relationship_between_throughput_and_latency 3._Traditional_Elements_of_Traffic_Control 3.1._Shaping 3.2._Scheduling 3.3._Classifying 3.4._Policing 3.5._Dropping 3.6._Marking 4._Components_of_Linux_Traffic_Control 4.1._qdisc 4.2._class 4.3._filter 4.4._classifier 4.5._policer 4.6._drop 4.7._handle 4.8._txqueuelen 4.9._Driver_Queue_(aka_ring_buffer) 4.10._Byte_Queue_Limits_(BQL) 5._Software_and_Tools 5.1._Kernel_requirements 5.2._iproute2_tools_(tc) 5.3._tcng,_Traffic_Control_Next_Generation 5.4._Netfilter 5.5._IMQ,_Intermediate_Queuing_device 5.6._ethtool,_Driver_Queue 6._Classless_Queuing_Disciplines_(qdiscs) 6.1._FIFO,_First-In_First-Out_(pfifo_and_bfifo) 6.2._pfifo_fast,_the_default_Linux_qdisc 6.3._SFQ,_Stochastic_Fair_Queuing 6.4._ESFQ,_Extended_Stochastic_Fair_Queuing 6.5._RED,Random_Early_Drop 6.6._GRED,_Generic_Random_Early_Drop 6.7._TBF,_Token_Bucket_Filter 7._Classful_Queuing_Disciplines_(qdiscs) 7.1._HTB,_Hierarchical_Token_Bucket 7.2._HFSC,_Hierarchical_Fair_Service_Curve 7.3._PRIO,_priority_scheduler 7.4._CBQ,_Class_Based_Queuing_(CBQ) 7.5._WRR,_Weighted_Round_Robin 8._Rules,_Guidelines_and_Approaches 8.1._General_Rules_of_Linux_Traffic_Control 8.2._Handling_a_link_with_a_known_bandwidth 8.3._Handling_a_link_with_a_variable_(or_unknown)_bandwidth 8.4._Sharing/splitting_bandwidth_based_on_flows 8.5._Sharing/splitting_bandwidth_based_on_IP 9._Scripts_for_use_with_QoS/Traffic_Control 9.1._wondershaper 9.2._ADSL_Bandwidth_HOWTO_script_(myshaper) 9.3._htb.init 9.4._tcng.init 9.5._cbq.init 10._Diagram 10.1._General_diagram 11._Annotated_Traffic_Control_Links 1. Introduction to Linux Traffic Control Linux offers a very rich set of tools for managing and manipulating the transmission of packets. The larger Linux community is very familiar with the tools available under Linux for packet mangling and firewalling (netfilter, and before that, ipchains) as well as hundreds of network services which can run on the operating system. Few inside the community and fewer outside the Linux community are aware of the tremendous power of the traffic control subsystem which has grown and matured under kernels 2.2 and 2.4. This HOWTO purports to introduce the concepts_of_traffic_control, the traditional_elements_(in_general), the_components_of_the_Linux_traffic_control implementation and provide some guidelines . This HOWTO represents the collection, amalgamation and synthesis of the LARTC_HOWTO, documentation from individual projects and importantly the LARTC_mailing_list over a period of study. The impatient soul, who simply wishes to experiment right now, is recommended to the Traffic_Control_using_tcng_and_HTB_HOWTO and LARTC_HOWTO for immediate satisfaction. 1.1. Target audience and assumptions about the reader The target audience for this HOWTO is the network administrator or savvy home user who desires an introduction to the field of traffic control and an overview of the tools available under Linux for implementing traffic control. I assume that the reader is comfortable with UNIX concepts and the command line and has a basic knowledge of IP networking. Users who wish to implement traffic control may require the ability to patch, compile and install a kernel or software package [1]. For users with newer kernels (2.4.20+, see also Section 5.1,_“Kernel_requirements”), however, the ability to install and use software may be all that is required. Broadly speaking, this HOWTO was written with a sophisticated user in mind, perhaps one who has already had experience with traffic control under Linux. I assume that the reader may have no prior traffic control experience. 1.2. Conventions This text was written in DocBook (version_4.2) with vim. All formatting has been applied by xsltproc based on DocBook_XSL and LDP_XSL stylesheets. Typeface formatting and display conventions are similar to most printed and electronically distributed technical documentation. 1.3. Recommended approach I strongly recommend to the eager reader making a first foray into the discipline of traffic control, to become only casually familiar with the tc command line utility, before concentrating on tcng. The tcng software package defines an entire language for describing traffic control structures. At first, this language may seem daunting, but mastery of these basics will quickly provide the user with a much wider ability to employ (and deploy) traffic control configurations than the direct use of tc would afford. Where possible, I'll try to prefer describing the behaviour of the Linux traffic control system in an abstract manner, although in many cases I'll need to supply the syntax of one or the other common systems for defining these structures. I may not supply examples in both the tcng language and the tc command line, so the wise user will have some familiarity with both. 1.4. Missing content, corrections and feedback There is content yet missing from this HOWTO. In particular, the following items will be added at some point to this documentation. * A section of examples. * A section detailing the classifiers. * A section discussing the techniques for measuring traffic. * A section covering meters. * More details on tcng. * Descriptions of newer qdiscs, specifically, Controlled Delay (codel), Fair Queue Controlled Delay (fq_codel), Proportional Integrated controller Enhanced (pie), Stochastic Fair Blue (sfb), Heavy Hitter Filter (hhf), Choke (choke). I welcome suggestions, corrections and feedback at . All errors and omissions are strictly my fault. Although I have made every effort to verify the factual correctness of the content presented herein, I cannot accept any responsibility for actions taken under the influence of this documentation. 2. Overview of Concepts This section will introduce_traffic_control and examine_reasons_for_it, identify a few advantages and disadvantages and introduce key concepts used in traffic control. 2.1. What is it? Traffic control is the name given to the sets of queuing systems and mechanisms by which packets are received and transmitted on a router. This includes deciding (if and) which packets to accept at what rate on the input of an interface and determining which packets to transmit in what order at what rate on the output of an interface. In the simplest possible model, traffic control consists of a single queue which collects entering packets and dequeues them as quickly as the hardware (or underlying device) can accept them. This sort of queue is a FIFO. This is like a single toll booth for entering a highway. Every car must stop and pay the toll. Other cars wait their turn. Linux provides this simplest traffic control tool (FIFO), and in addition offers a wide variety of other tools that allow all sorts of control over packet handling. Note The default qdisc under Linux is the pfifo_fast, which is slightly more complex than the FIFO. There are examples of queues in all sorts of software. The queue is a way of organizing the pending tasks or data (see also Section 2.5,_“Queues”). Because network links typically carry data in a serialized fashion, a queue is required to manage the outbound data packets. In the case of a desktop machine and an efficient webserver sharing the same uplink to the Internet, the following contention for bandwidth may occur. The web server may be able to fill up the output queue on the router faster than the data can be transmitted across the link, at which point the router starts to drop packets (its buffer is full!). Now, the desktop machine (with an interactive application user) may be faced with packet loss and high latency. Note that high latency sometimes leads to screaming users! By separating the internal queues used to service these two different classes of application, there can be better sharing of the network resource between the two applications. Traffic control is a set of tools allowing an administrator granular control over these queues and the queuing mechanisms of a networked device. The power to rearrange traffic flows and packets with these tools is tremendous and can be complicated, but is no substitute for adequate bandwidth. The term Quality of Service (QoS) is often used as a synonym for traffic control at an IP-layer. 2.2. Why use it? Traffic control tools allow the implementer to apply preferences, organizational or business policies to packets or network flows transmitted into a network. This control allows stewardship over the network resources such as throughput or latency. Fundamentally, traffic control becomes a necessity because of packet switching in networks. For a brief digression, to explain the novelty and cleverness of packet switching, think about the circuit-switched telephone networks that were built over the entire 20th century. In order to set up a call, the network gear knew rules about call establishment and when a caller tried to connect, the network employed the rules to reserve a circuit for the entire duration of the call or connection. While one call was engaged, using that resource, no other call or caller could use that resource. This meant many individual pieces of equipment could block call setup because of resource unavailability. Let's return to packet-switched networks, a mid-20th century invention, later in wide use, and nearly ubiquitous in the 21st century. Packet-switched networks differ from circuit based networks in one very important regard. The unit of data handled by the network gear is not a circuit, but rather a small chunk of data called a packet. Inexactly speaking, the packet is a letter in an envelope with a destination address. The packet-switched network had only a very small amount of work to do, reading the destination identifier and transmitting the packet. Sometimes, packet-switched networks are described as stateless because they do not need to track all of the flows (analogy to a circuit) that are active in the network. In order to be function, the packet-handling machines must know how to reach the destinations addresses. One analogy is a package-handling service like your postal service, UPS or DHL. If there's a sudden influx of packets into a packet-switched network (or, by analogy, the increase of cards and packages sent by mail and other carriers at Christmas), the network can become slow or unresponsive. Lack of differentiation between importance of specific packets or network flows is, therefore, a weakness of such packet-switched networks. The network can be overloaded with data packets all competing. In simplest terms, the traffic control tools allow somebody to enqueue packets into the network differently based on attributes of the packet. The various different tools each solve a different problem and many can be combined, to implement complex rules to meet a preference or business goal. There are many practical reasons to consider traffic control, and many scenarios in which using traffic control makes sense. Below are some examples of common problems which can be solved or at least ameliorated with these tools. The list below is not an exhaustive list of the sorts of solutions available to users of traffic control, but shows the types of common problems that can be solved by using traffic control tools to maximize the usability of the network. Common traffic control solutions * Limit total bandwidth to a known rate; TBF, HTB with child class(es). * Limit the bandwidth of a particular user, service or client; HTB classes and classifying with a filter. * Maximize TCP throughput on an asymmetric link; prioritize transmission of ACK packets, wondershaper. * Reserve bandwidth for a particular application or user; HTB with children classes and classifying. * Prefer latency sensitive traffic; PRIO inside an HTB class. * Managed oversubscribed bandwidth; HTB with borrowing. * Allow equitable distribution of unreserved bandwidth; HTB with borrowing. * Ensure that a particular type of traffic is dropped; policer attached to a filter with a drop action. Remember that, sometimes, it is simply better to purchase more bandwidth. Traffic control does not solve all problems! 2.3. Advantages When properly employed, traffic control should lead to more predictable usage of network resources and less volatile contention for these resources. The network then meets the goals of the traffic control configuration. Bulk download traffic can be allocated a reasonable amount of bandwidth even as higher priority interactive traffic is simultaneously serviced. Even low priority data transfer such as mail can be allocated bandwidth without tremendously affecting the other classes of traffic. In a larger picture, if the traffic control configuration represents policy which has been communicated to the users, then users (and, by extension, applications) know what to expect from the network. 2.4. Disdvantages Complexity is easily one of the most significant disadvantages of using traffic control. There are ways to become familiar with traffic control tools which ease the learning curve about traffic control and its mechanisms, but identifying a traffic control misconfiguration can be quite a challenge. Traffic control when used appropriately can lead to more equitable distribution of network resources. It can just as easily be installed in an inappropriate manner leading to further and more divisive contention for resources. The computing resources required on a router to support a traffic control scenario need to be capable of handling the increased cost of maintaining the traffic control structures. Fortunately, this is a small incremental cost, but can become more significant as the configuration grows in size and complexity. For personal use, there's no training cost associated with the use of traffic control, but a company may find that purchasing more bandwidth is a simpler solution than employing traffic control. Training employees and ensuring depth of knowledge may be more costly than investing in more bandwidth. Although traffic control on packet-switched networks covers a larger conceptual area, you can think of traffic control as a way to provide [some of] the statefulness of a circuit-based network to a packet-switched. 2.5. Queues Queues form the backdrop for all of traffic control and are the integral concept behind scheduling. A queue is a location (or buffer) containing a finite number of items waiting for an action or service. In networking, a queue is the place where packets (our units) wait to be transmitted by the hardware (the service). In the simplest model, packets are transmitted in a first-come first-serve basis [2]. In the discipline of computer networking (and more generally computer science), this sort of a queue is known as a FIFO. Without any other mechanisms, a queue doesn't offer any promise for traffic control. There are only two interesting actions in a queue. Anything entering a queue is enqueued into the queue. To remove an item from a queue is to dequeue that item. A queue becomes much more interesting when coupled with other mechanisms which can delay packets, rearrange, drop and prioritize packets in multiple queues. A queue can also use subqueues, which allow for complexity of behaviour in a scheduling operation. From the perspective of the higher layer software, a packet is simply enqueued for transmission, and the manner and order in which the enqueued packets are transmitted is immaterial to the higher layer. So, to the higher layer, the entire traffic control system may appear as a single queue [3]. It is only by examining the internals of this layer that the traffic control structures become exposed and available. In the image below a simplified high level overview of the queues on the transmit path of the Linux network stack: Figure 1:Simplified high level overview of the queues on the transmit path of the Linux network stack. See 2.9 for details about NIC interface and 4.9 for details about driver_queue. 2.6. Flows A flow is a distinct connection or conversation between two hosts. Any unique set of packets between two hosts can be regarded as a flow. Under TCP the concept of a connection with a source IP and port and destination IP and port represents a flow. A UDP flow can be similarly defined. Traffic control mechanisms frequently separate traffic into classes of flows which can be aggregated and transmitted as an aggregated flow (consider DiffServ). Alternate mechanisms may attempt to divide bandwidth equally based on the individual flows. Flows become important when attempting to divide bandwidth equally among a set of competing flows, especially when some applications deliberately build a large number of flows. 2.7. Tokens and buckets Two of the key underpinnings of a shaping mechanisms are the interrelated concepts of tokens and buckets. In order to control the rate of dequeuing, an implementation can count the number of packets or bytes dequeued as each item is dequeued, although this requires complex usage of timers and measurements to limit accurately. Instead of calculating the current usage and time, one method, used widely in traffic control, is to generate tokens at a desired rate, and only dequeue packets or bytes if a token is available. Consider the analogy of an amusement park ride with a queue of people waiting to experience the ride. Let's imagine a track on which carts traverse a fixed track. The carts arrive at the head of the queue at a fixed rate. In order to enjoy the ride, each person must wait for an available cart. The cart is analogous to a token and the person is analogous to a packet. Again, this mechanism is a rate-limiting or shaping mechanism. Only a certain number of people can experience the ride in a particular period. To extend the analogy, imagine an empty line for the amusement park ride and a large number of carts sitting on the track ready to carry people. If a large number of people entered the line together many (maybe all) of them could experience the ride because of the carts available and waiting. The number of carts available is a concept analogous to the bucket. A bucket contains a number of tokens and can use all of the tokens in bucket without regard for passage of time. And to complete the analogy, the carts on the amusement park ride (our tokens) arrive at a fixed rate and are only kept available up to the size of the bucket. So, the bucket is filled with tokens according to the rate, and if the tokens are not used, the bucket can fill up. If tokens are used the bucket will not fill up. Buckets are a key concept in supporting bursty traffic such as HTTP. The TBF qdisc is a classical example of a shaper (the section on TBF includes a diagram which may help to visualize the token and bucket concepts). The TBF generates rate tokens and only transmits packets when a token is available. Tokens are a generic shaping concept. In the case that a queue does not need tokens immediately, the tokens can be collected until they are needed. To collect tokens indefinitely would negate any benefit of shaping so tokens are collected until a certain number of tokens has been reached. Now, the queue has tokens available for a large number of packets or bytes which need to be dequeued. These intangible tokens are stored in an intangible bucket, and the number of tokens that can be stored depends on the size of the bucket. This also means that a bucket full of tokens may be available at any instant. Very predictable regular traffic can be handled by small buckets. Larger buckets may be required for burstier traffic, unless one of the desired goals is to reduce the burstiness of the flows. In summary, tokens are generated at rate, and a maximum of a bucket's worth of tokens may be collected. This allows bursty traffic to be handled, while smoothing and shaping the transmitted traffic. The concepts of tokens and buckets are closely interrelated and are used in both TBF (one of the classless_qdiscs) and HTB (one of the classful_qdiscs). Within the tcng language, the use of two- and three-color meters is indubitably a token and bucket concept. 2.8. Packets and frames The terms for data sent across network changes depending on the layer the user is examining. This document will rather impolitely (and incorrectly) gloss over the technical distinction between packets and frames although they are outlined here. The word frame is typically used to describe a layer 2 (data link) unit of data to be forwarded to the next recipient. Ethernet interfaces, PPP interfaces, and T1 interfaces all name their layer 2 data unit a frame. The frame is actually the unit on which traffic control is performed. A packet, on the other hand, is a higher layer concept, representing layer 3 (network) units. The term packet is preferred in this documentation, although it is slightly inaccurate. 2.9. NIC, Network Interface Controller A network interface controller is a computer hardware component, differently from previous ones thar are software components, that connects a computer to a computer network. The network controller implements the electronic circuitry required to communicate using a specific data link layer and physical layer standard such as Ethernet, Fibre Channel, Wi-Fi or Token Ring. Traffic control must deal with the physical constraints and characteristics of the NIC interface. 2.9.1. Huge Packets from the Stack Most NICs have a fixed maximum transmission unit (MTU) which is the biggest frame which can be transmitted by the physical medium. For Ethernet the default MTU is 1500 bytes but some Ethernet networks support Jumbo Frames of up to 9000 bytes. Inside the IP network stack, the MTU can manifest as a limit on the size of the packets which are sent to the device for transmission. For example, if an application writes 2000 bytes to a TCP socket then the IP stack needs to create two IP packets to keep the packet size less than or equal to a 1500 byte MTU. For large data transfers the comparably small MTU causes a large number of small packets to be created and transferred through the driver_queue. In order to avoid the overhead associated with a large number of packets on the transmit path, the Linux kernel implements several optimizations: TCP segmentation offload (TSO), UDP fragmentation offload (UFO) and generic segmentation offload (GSO). All of these optimizations allow the IP stack to create packets which are larger than the MTU of the outgoing NIC. For IPv4, packets as large as the IPv4 maximum of 65,536 bytes can be created and queued to the driver_queue. In the case of TSO and UFO, the NIC hardware takes responsibility for breaking the single large packet into packets small enough to be transmitted on the physical interface. For NICs without hardware support, GSO performs the same operation in software immediately before queueing to the driver_queue. Recall from earlier that the driver_queue contains a fixed number of descriptors which each point to packets of varying sizes, Since TSO, UFO and GSO allow for much larger packets these optimizations have the side effect of greatly increasing the number of bytes which can be queued in the driver_queue. Figure 3 illustrates this concept. Figure 2:Large packets can be sent to the NIC when TSO, UFO or GSO are enabled. This can greatly increase the number of bytes in the driver_queue. 2.10. Starvation and Latency The queue between the IP stack and the hardware (see chapter_4.2 for detail about driver_queue or see chapter_5.5 for how manage it) introduces two problems: starvation and latency. If the NIC driver wakes to pull packets off of the queue for transmission and the queue is empty the hardware will miss a transmission opportunity thereby reducing the throughput of the system. This is referred to as starvation. Note that an empty queue when the system does not have anything to transmit is not starvation – this is normal. The complication associated with avoiding starvation is that the IP stack which is filling the queue and the hardware driver draining the queue run asynchronously. Worse, the duration between fill or drain events varies with the load on the system and external conditions such as the network interface’s physical medium. For example, on a busy system the IP stack will get fewer opportunities to add packets to the buffer which increases the chances that the hardware will drain the buffer before more packets are queued. For this reason it is advantageous to have a very large buffer to reduce the probability of starvation and ensures high throughput. While a large queue is necessary for a busy system to maintain high throughput, it has the downside of allowing for the introduction of a large amount of latency. Figure 3: Interactive packet (yellow) behind bulk flow packets (blue) Figure 3 shows a driver_queue which is almost full with TCP segments for a single high bandwidth, bulk traffic flow (blue). Queued last is a packet from a VoIP or gaming flow (yellow). Interactive applications like VoIP or gaming typically emit small packets at fixed intervals which are latency sensitive while a high bandwidth data transfer generates a higher packet rate and larger packets. This higher packet rate can fill the buffer between interactive packets causing the transmission of the interactive packet to be delayed. To further illustrate this behaviour consider a scenario based on the following assumptions: * A network interface which is capable of transmitting at 5 Mbit/sec or 5,000,000 bits/sec. * Each packet from the bulk flow is 1,500 bytes or 12,000 bits. * Each packet from the interactive flow is 500 bytes. * The depth of the queue is 128 descriptors * There are 127 bulk data packets and 1 interactive packet queued last. Given the above assumptions, the time required to drain the 127 bulk packets and create a transmission opportunity for the interactive packet is (127 * 12,000) / 5,000,000 = 0.304 seconds (304 milliseconds for those who think of latency in terms of ping results). This amount of latency is well beyond what is acceptable for interactive applications and this does not even represent the complete round trip time – it is only the time required transmit the packets queued before the interactive one. As described earlier, the size of the packets in the driver_queue can be larger than 1,500 bytes if TSO, UFO or GSO are enabled. This makes the latency problem correspondingly worse. Choosing the correct size for the driver_queue is a Goldilocks problem – it can’t be too small or throughput suffers, it can’t be too big or latency suffers. 2.11. Relationship between throughput and latency In all traffic control systems, there is a relationship between throughput and latency. The maximum information rate of a network link is termed bandwidth, but for the user of a network the actually achieved bandwidth has a dedicated term, throughput. latency the delay in time between a sender's transmission and the recipient's decoding or receiving of the data; always non-negative and non-zero (time doesn't move backwards, then) in principle, latency is unidirectional, however almost the entire Internet networking community talks about bidirectional delay —the delay in time between a sender's transmission of data and some sort of acknowledgement of receipt of that data; cf. ping measured in milliseconds (ms); on Ethernet, latencies are typically between 0.3 and 1.0 ms and on wide-area networks (i.e. to your ISP, across a large campus or to a remote server) between 5 to 300 ms throughput a measure of the total amount of data that can be transmitted successfully between a sender and receiver measured in bits per second; the measurement most often quoted by complaining users after buying a 10Mbit/s package from their provider and receiving 8.2Mbit/s. Note Latency and throughput are general computing terms. For example, application developers speak of user-perceived latency when trying to build responsive tools. Database and filesystem people speak about disk throughput. And, above the network layer, latency of a website name lookup in DNS is a major contributor to the perceived performance of a website. The remainder of this document concerns latency in the network domain, specifically the IP network layer. During the millenial fin de siècle, many developed world network service providers had learned that users were interested in the highest possible download throughput (the above mentioned 10Mbit/s bandwidth figure). In order to maximize this download throughput, gear vendors and providers commonly tuned their equipment to hold a large number of data packets. When the network was ready to accept another packet, the network gear was certain to have one in its queue and could simply send another packet. In practice, this meant that the user, who was measuring download throughput, would receive a high number and was likely to be happy. This was desirable for the provider because the delivered throughput could more likely meet the advertised number. This technique effectively maximized throughput, at the cost of latency. Imagine that a high priority packet is waiting at the end of the big queue of packets mentioned above. Perhaps, the theoretical latency of the packet on this network might be 100ms, but it needs to wait its turn in the queue to be transmitted. While the decision to maximize throughput has been wildly successful, the effect on latency is significant. Despite a general warning from Stuart Cheshire in the mid-1990s called It's_the Latency,_Stupid, it took the novel term, bufferbloat, widely publicized about 15 years later by Jim Getty in an ACM Queue article Bufferbloat:_Dark_Buffers in_the_Internet and a Bufferbloat_FAQ in his blog, to bring some focus onto the choice for maximizing throughput that both gear vendors and providers preferred. The relationship (tension) between latency and throughput in packet-switched networks have been well-known in the academic, networking and Linux development community. Linux traffic control core data structures date back to the 1990s and have been continuously developed and extended with new schedulers and features. 3. Traditional Elements of Traffic Control 3.1. Shaping Shapers delay packets to meet a desired rate. Shaping is the mechanism by which packets are delayed before transmission in an output queue to meet a desired output rate. This is one of the most common desires of users seeking bandwidth control solutions. The act of delaying a packet as part of a traffic control solution makes every shaping mechanism into a non-work-conserving mechanism, meaning roughly: "Work is required in order to delay packets." Viewed in reverse, a non-work-conserving queuing mechanism is performing a shaping function. A work-conserving queuing mechanism (see PRIO) would not be capable of delaying a packet. Shapers attempt to limit or ration traffic to meet but not exceed a configured rate (frequently measured in packets per second or bits/bytes per second). As a side effect, shapers can smooth out bursty traffic [4]. One of the advantages of shaping bandwidth is the ability to control latency of packets. The underlying mechanism for shaping to a rate is typically a token and bucket mechanism. See also Section 2.7,_“Tokens_and_buckets” for further detail on tokens and buckets. 3.2. Scheduling Schedulers arrange and/or rearrange packets for output. Scheduling is the mechanism by which packets are arranged (or rearranged) between input and output of a particular queue. The overwhelmingly most common scheduler is the FIFO (first-in first-out) scheduler. From a larger perspective, any set of traffic control mechanisms on an output queue can be regarded as a scheduler, because packets are arranged for output. Other generic scheduling mechanisms attempt to compensate for various networking conditions. A fair queuing algorithm (see SFQ) attempts to prevent any single client or flow from dominating the network usage. A round-robin algorithm (see WRR) gives each flow or client a turn to dequeue packets. Other sophisticated scheduling algorithms attempt to prevent backbone overload (see GRED) or refine other scheduling mechanisms (see ESFQ). 3.3. Classifying Classifiers sort or separate traffic into queues. Classifying is the mechanism by which packets are separated for different treatment, possibly different output queues. During the process of accepting, routing and transmitting a packet, a networking device can classify the packet a number of different ways. Classification can include marking the packet, which usually happens on the boundary of a network under a single administrative control or classification can occur on each hop individually. The Linux model (see Section 4.3,_“filter”) allows for a packet to cascade across a series of classifiers in a traffic control structure and to be classified in conjunction with policers (see also Section 4.5,_“policer”). 3.4. Policing Policers measure and limit traffic in a particular queue. Policing, as an element of traffic control, is simply a mechanism by which traffic can be limited. Policing is most frequently used on the network border to ensure that a peer is not consuming more than its allocated bandwidth. A policer will accept traffic to a certain rate, and then perform an action on traffic exceeding this rate. A rather harsh solution is to drop the traffic, although the traffic could be reclassified instead of being dropped. A policer is a yes/no question about the rate at which traffic is entering a queue. If the packet is about to enter a queue below a given rate, take one action (allow the enqueuing). If the packet is about to enter a queue above a given rate, take another action. Although the policer uses a token bucket mechanism internally, it does not have the capability to delay a packet as a shaping mechanism does. 3.5. Dropping Dropping discards an entire packet, flow or classification. Dropping a packet is a mechanism by which a packet is discarded. 3.6. Marking Marking is a mechanism by which the packet is altered. Note This is not fwmark. The iptables target MARK and the ipchains --mark are used to modify packet metadata, not the packet itself. Traffic control marking mechanisms install a DSCP on the packet itself, which is then used and respected by other routers inside an administrative domain (usually for DiffServ). 4. Components of Linux Traffic Control Table 1. Correlation between traffic control elements and Linux components ________________________________________________________________________________________ |traditional|Linux component | |element____|____________________________________________________________________________| |enqueuing;_|FIXME:_receiving_packets_from_userspace_and_network.________________________| |shaping____|The_class_offers_shaping_capabilities.______________________________________| |scheduling |A qdisc is a scheduler. Schedulers can be simple such as the FIFO or | |___________|complex,_containing_classes_and_other_qdiscs,_such_as_HTB.__________________| | |The filter object performs the classification through the agency of a | |classifying|classifier object. Strictly speaking, Linux classifiers cannot exist outside| |___________|of_a_filter.________________________________________________________________| |policing |A policer exists in the Linux traffic control implementation only as part of| |___________|a_filter.___________________________________________________________________| |dropping |To drop traffic requires a filter with a policer which uses “drop” as an| |___________|action._____________________________________________________________________| |marking____|The_dsmark_qdisc_is_used_for_marking._______________________________________| | |Between the scheduler's qdisc and the network_interface_controller_(NIC) | | |lies the driver_queue. The driver_queue gives the higher layers (IP stack | |enqueuing; |and traffic control subsystem) a location to queue data asynchronously for | | |the operation of the hardware. The size of that queue is automatically set | |___________|by_Byte_Queue_Limits_(BQL)._________________________________________________| 4.1. qdisc Simply put, a qdisc is a scheduler (Section 3.2,_“Scheduling”). Every output interface needs a scheduler of some kind, and the default scheduler is a FIFO. Other qdiscs available under Linux will rearrange the packets entering the scheduler's queue in accordance with that scheduler's rules. The qdisc is the major building block on which all of Linux traffic control is built, and is also called a queuing discipline. The classful_qdiscs can contain classes, and provide a handle to which to attach filters. There is no prohibition on using a classful qdisc without child classes, although this will usually consume cycles and other system resources for no benefit. The classless_qdiscs can contain no classes, nor is it possible to attach filter to a classless qdisc. Because a classless qdisc contains no children of any kind, there is no utility to classifying. This means that no filter can be attached to a classless qdisc. A source of terminology confusion is the usage of the terms root qdisc and ingress qdisc. These are not really queuing disciplines, but rather locations onto which traffic control structures can be attached for egress (outbound traffic) and ingress (inbound traffic). Each interface contains both. The primary and more common is the egress qdisc, known as the root qdisc. It can contain any of the queuing disciplines (qdiscs) with potential classes and class structures. The overwhelming majority of documentation applies to the root qdisc and its children. Traffic transmitted on an interface traverses the egress or root qdisc. For traffic accepted on an interface, the ingress qdisc is traversed. With its limited utility, it allows no child class to be created, and only exists as an object onto which a filter can be attached. For practical purposes, the ingress qdisc is merely a convenient object onto which to attach a policer to limit the amount of traffic accepted on a network interface. In short, you can do much more with an egress qdisc because it contains a real qdisc and the full power of the traffic control system. An ingress qdisc can only support a policer. The remainder of the documentation will concern itself with traffic control structures attached to the root qdisc unless otherwise specified. 4.2. class Classes only exist inside a classful qdisc (e.g., HTB and CBQ). Classes are immensely flexible and can always contain either multiple children classes or a single child qdisc [5]. There is no prohibition against a class containing a classful qdisc itself, which facilitates tremendously complex traffic control scenarios. Any class can also have an arbitrary number of filters attached to it, which allows the selection of a child class or the use of a filter to reclassify or drop traffic entering a particular class. A leaf class is a terminal class in a qdisc. It contains a qdisc (default FIFO) and will never contain a child class. Any class which contains a child class is an inner class (or root class) and not a leaf class. 4.3. filter The filter is the most complex component in the Linux traffic control system. The filter provides a convenient mechanism for gluing together several of the key elements of traffic control. The simplest and most obvious role of the filter is to classify (see Section 3.3,_“Classifying”) packets. Linux filters allow the user to classify packets into an output queue with either several different filters or a single filter. * A filter must contain a classifier phrase. * A filter may contain a policer phrase. Filters can be attached either to classful qdiscs or to classes, however the enqueued packet always enters the root qdisc first. After the filter attached to the root qdisc has been traversed, the packet may be directed to any subclasses (which can have their own filters) where the packet may undergo further classification. 4.4. classifier Filter objects, which can be manipulated using tc, can use several different classifying mechanisms, the most common of which is the u32 classifier. The u32 classifier allows the user to select packets based on attributes of the packet. The classifiers are tools which can be used as part of a filter to identify characteristics of a packet or a packet's metadata. The Linux classfier object is a direct analogue to the basic operation and elemental mechanism of traffic control classifying. 4.5. policer This elemental mechanism is only used in Linux traffic control as part of a filter. A policer calls one action above and another action below the specified rate. Clever use of policers can simulate a three-color meter. See also Section 10,_“Diagram”. Although both policing and shaping are basic elements of traffic control for limiting bandwidth usage a policer will never delay traffic. It can only perform an action based on specified criteria. See also Example 5,_“tc_filter”. 4.6. drop This basic traffic control mechanism is only used in Linux traffic control as part of a policer. Any policer attached to any filter could have a drop action. Note The only place in the Linux traffic control system where a packet can be explicitly dropped is a policer. A policer can limit packets enqueued at a specific rate, or it can be configured to drop all traffic matching a particular pattern [6]. There are, however, places within the traffic control system where a packet may be dropped as a side effect. For example, a packet will be dropped if the scheduler employed uses this method to control flows as the GRED does. Also, a shaper or scheduler which runs out of its allocated buffer space may have to drop a packet during a particularly bursty or overloaded period. 4.7. handle Every class and classful qdisc (see also Section 7,_“Classful_Queuing Disciplines_(qdiscs)”) requires a unique identifier within the traffic control structure. This unique identifier is known as a handle and has two constituent members, a major number and a minor number. These numbers can be assigned arbitrarily by the user in accordance with the following rules [7]. The numbering of handles for classes and qdiscs major This parameter is completely free of meaning to the kernel. The user may use an arbitrary numbering scheme, however all objects in the traffic control structure with the same parent must share a major handle number. Conventional numbering schemes start at 1 for objects attached directly to the root qdisc. minor This parameter unambiguously identifies the object as a qdisc if minor is 0. Any other value identifies the object as a class. All classes sharing a parent must have unique minor numbers. The special handle ffff:0 is reserved for the ingress qdisc. The handle is used as the target in classid and flowid phrases of tc filter statements. These handles are external identifiers for the objects, usable by userland applications. The kernel maintains internal identifiers for each object. 4.8. txqueuelen The current size of the transmission queue can be obtained from the ip and ifconfig commands. Confusingly, these commands name the transmission queue length differently (emphasized text below): $ifconfig eth0 eth0 Link encap:Ethernet HWaddr 00:18:F3:51:44:10 inet addr:69.41.199.58 Bcast:69.41.199.63 Mask:255.255.255.248 inet6 addr: fe80::218:f3ff:fe51:4410/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:435033 errors:0 dropped:0 overruns:0 frame:0 TX packets:429919 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:65651219 (62.6 MiB) TX bytes:132143593 (126.0 MiB) Interrupt:23 $ip link 1: lo: mtu 16436 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: eth0: mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 00:18:f3:51:44:10 brd ff:ff:ff:ff:ff:ff The length of the transmission queue in Linux defaults to 1000 packets which could represent a large amount of buffering especially at low bandwidths. (To understand why, see the discussion on latency and throughput, specifically bufferbloat). More interestingly, txqueuelen is only used as a default queue length for these queueing disciplines. * pfifo_fast (Linux default queueing discipline) * sch_fifo * sch_gred * sch_htb (only for the default queue) * sch_plug * sch_sfb * sch_teql Looking back at Figure 1, the txqueuelen parameter controls the size of the queues in the Queueing Discipline box for the QDiscs listed above. For most of these queueing disciplines, the “limit” argument on the tc command line overrides the txqueuelen default. In summary, if you do not use one of the above queueing disciplines or if you override the queue length then the txqueuelen value is meaningless. The length of the transmission queue is configured with the ip or ifconfig commands. ip link set txqueuelen 500 dev eth0 Notice that the ip command uses “txqueuelen” but when displaying the interface details it uses “qlen”. 4.9. Driver Queue (aka ring buffer) Between the IP stack and the network interface controller (NIC) lies the driver queue. This queue is typically implemented as a first-in, first-out (FIFO) ring buffer – just think of it as a fixed sized buffer. The driver queue does not contain packet data. Instead it consists of descriptors which point to other data structures called socket kernel buffers (SKBs) which hold the packet data and are used throughout the kernel. Figure 4:Partially full driver queue with descriptors pointing to SKBs The input source for the driver queue is the IP stack which queues complete IP packets. The packets may be generated locally or received on one NIC to be routed out another when the device is functioning as an IP router. Packets added to the driver queue by the IP stack are dequeued by the hardware driver and sent across a data bus to the NIC hardware for transmission. The reason the driver queue exists is to ensure that whenever the system has data to transmit, the data is available to the NIC for immediate transmission. That is, the driver queue gives the IP stack a location to queue data asynchronously from the operation of the hardware. One alternative design would be for the NIC to ask the IP stack for data whenever the physical medium is ready to transmit. Since responding to this request cannot be instantaneous this design wastes valuable transmission opportunities resulting in lower throughput. The opposite approach would be for the IP stack to wait after a packet is created until the hardware is ready to transmit. This is also not ideal because the IP stack cannot move on to other work. For detail how to set driver queue see chapter_5.5. 4.10. Byte Queue Limits (BQL) Byte Queue Limits (BQL) is a new feature in recent Linux kernels (> 3.3.0) which attempts to solve the problem of driver queue sizing automatically. This is accomplished by adding a layer which enables and disables queuing to the driver queue based on calculating the minimum buffer size required to avoid starvation under the current system conditions. Recall from earlier that the smaller the amount of queued data, the lower the maximum latency experienced by queued packets. It is key to understand that the actual size of the driver queue is not changed by BQL. Rather BQL calculates a limit of how much data (in bytes) can be queued at the current time. Any bytes over this limit must be held or dropped by the layers above the driver queue.. The BQL mechanism operates when two events occur: when packets are enqueued to the driver queue and when a transmission to the wire has completed. A simplified version of the BQL algorithm is outlined below. LIMIT refers to the value calculated by BQL. **** ** After adding packets to the queue **** if the number of queued bytes is over the current LIMIT value then disable the queueing of more data to the driver queue Notice that the amount of queued data can exceed LIMIT because data is queued before the LIMIT check occurs. Since a large number of bytes can be queued in a single operation when TSO, UFO or GSO (see chapter 2.9.1 aggiungi link for details) are enabled these throughput optimizations have the side effect of allowing a higher than desirable amount of data to be queued. If you care about latency you probably want to disable these features. The second stage of BQL is executed after the hardware has completed a transmission (simplified pseudo-code): **** ** When the hardware has completed sending a batch of packets ** (Referred to as the end of an interval) **** if the hardware was starved in the interval increase LIMIT else if the hardware was busy during the entire interval (not starved) and there are bytes to transmit decrease LIMIT by the number of bytes not transmitted in the interval if the number of queued bytes is less than LIMIT enable the queueing of more data to the buffer As you can see, BQL is based on testing whether the device was starved. If it was starved, then LIMIT is increased allowing more data to be queued which reduces the chance of starvation. If the device was busy for the entire interval and there are still bytes to be transferred in the queue then the queue is bigger than is necessary for the system under the current conditions and LIMIT is decreased to constrain the latency. A real world example may help provide a sense of how much BQL affects the amount of data which can be queued. On one of my servers the driver queue size defaults to 256 descriptors. Since the Ethernet MTU is 1,500 bytes this means up to 256 * 1,500 = 384,000 bytes can be queued to the driver queue (TSO, GSO etc are disabled or this would be much higher). However, the limit value calculated by BQL is 3,012 bytes. As you can see, BQL greatly constrains the amount of data which can be queued. An interesting aspect of BQL can be inferred from the first word in the name – byte. Unlike the size of the driver queue and most other packet queues, BQL operates on bytes. This is because the number of bytes has a more direct relationship with the time required to transmit to the physical medium than the number of packets or descriptors since the later are variably sized. BQL reduces network latency by limiting the amount of queued data to the minimum required to avoid starvation. It also has the very important side effect of moving the point where most packets are queued from the driver queue which is a simple FIFO to the queueing discipline (QDisc) layer which is capable of implementing much more complicated queueing strategies. The next section introduces the Linux QDisc layer. 4.10.1. Set BQL The BQL algorithm is self tuning so you probably don’t need to mess with this too much. However, if you are concerned about optimal latencies at low bitrates then you may want override the upper limit on the calculated LIMIT value. BQL state and configuration can be found in a /sys directory based on the location and name of the NIC. On my server the directory for eth0 is: /sys/devices/pci0000:00/0000:00:14.0/net/eth0/queues/tx-0/byte_queue_limits The files in this directory are: * hold_time: Time between modifying LIMIT in milliseconds. * inflight: The number of queued but not yet transmitted bytes. * limit: The LIMIT value calculated by BQL. 0 if BQL is not supported in the NIC driver. * limit_max: A configurable maximum value for LIMIT. Set this value lower to optimize for latency. * limit_min: A configurable minimum value for LIMIT. Set this value higher to optimize for throughput. To place a hard upper limit on the number of bytes which can be queued write the new value to the limit_max fie. echo "3000" > limit_max 5. Software and Tools 5.1. Kernel requirements Many distributions provide kernels with modular or monolithic support for traffic control (Quality of Service). Custom kernels may not already provide support (modular or not) for the required features. If not, this is a very brief listing of the required kernel options. The user who has little or no experience compiling a kernel is recommended to Kernel_HOWTO. Experienced kernel compilers should be able to determine which of the below options apply to the desired configuration, after reading a bit more about traffic control and planning. Example 1. Kernel compilation options [8] # # QoS and/or fair queueing # CONFIG_NET_SCHED=y CONFIG_NET_SCH_CBQ=m CONFIG_NET_SCH_HTB=m CONFIG_NET_SCH_CSZ=m CONFIG_NET_SCH_PRIO=m CONFIG_NET_SCH_RED=m CONFIG_NET_SCH_SFQ=m CONFIG_NET_SCH_TEQL=m CONFIG_NET_SCH_TBF=m CONFIG_NET_SCH_GRED=m CONFIG_NET_SCH_DSMARK=m CONFIG_NET_SCH_INGRESS=m CONFIG_NET_QOS=y CONFIG_NET_ESTIMATOR=y CONFIG_NET_CLS=y CONFIG_NET_CLS_TCINDEX=m CONFIG_NET_CLS_ROUTE4=m CONFIG_NET_CLS_ROUTE=y CONFIG_NET_CLS_FW=m CONFIG_NET_CLS_U32=m CONFIG_NET_CLS_RSVP=m CONFIG_NET_CLS_RSVP6=m CONFIG_NET_CLS_POLICE=y A kernel compiled with the above set of options will provide modular support for almost everything discussed in this documentation. The user may need to modprobe module before using a given feature. Again, the confused user is recommended to the Kernel_HOWTO, as this document cannot adequately address questions about the use of the Linux kernel. 5.2. iproute2 tools (tc) iproute2 is a suite of command line utilities which manipulate kernel structures for IP networking configuration on a machine. For technical documentation on these tools, see the iproute2_documentation and for a more expository discussion, the documentation at linux-ip.net. Of the tools in the iproute2 package, the binary tc is the only one used for traffic control. This HOWTO will ignore the other tools in the suite. Because it interacts with the kernel to direct the creation, deletion and modification of traffic control structures, the tc binary needs to be compiled with support for all of the qdiscs you wish to use. In particular, the HTB qdisc is not supported yet in the upstream iproute2 package. See Section 7.1, “HTB,_Hierarchical_Token_Bucket” for more information. The tc tool performs all of the configuration of the kernel structures required to support traffic control. As a result of its many uses, the command syntax can be described (at best) as arcane. The utility takes as its first non-option argument one of three Linux traffic control components, qdisc, class or filter. Example 2. tc command usage [root@leander]# tc Usage: tc [ OPTIONS ] OBJECT { COMMAND | help } where OBJECT := { qdisc | class | filter } OPTIONS := { -s[tatistics] | -d[etails] | -r[aw] } Each object accepts further and different options, and will be incompletely described and documented below. The hints in the examples below are designed to introduce the vagaries of tc command line syntax. For more examples, consult the LARTC_HOWTO. For even better understanding, consult the kernel and iproute2 code. Example 3. tc qdisc [root@leander]# tc qdisc add \ _1_ > dev eth0 \ _2_ > root \ _3_ > handle 1:0 \ _4_ > htb _5_ _1 Add a queuing discipline. The verb could also be del. _2 Specify the device onto which we are attaching the new queuing discipline. _3 This means “egress” to tc. The word root must be used, however. Another qdisc with limited functionality, the ingress qdisc can be attached to the same device. _4 The handle is a user-specified number of the form major:minor. The minor number for any queueing discipline handle must always be zero (0). An acceptable shorthand for a qdisc handle is the syntax "1:", where the minor number is assumed to be zero (0) if not specified. _5 This is the queuing discipline to attach, HTB in this example. Queuing discipline specific parameters will follow this. In the example here, we add no qdisc- specific parameters. Above was the simplest use of the tc utility for adding a queuing discipline to a device. Here's an example of the use of tc to add a class to an existing parent class. Example 4. tc class [root@leander]# tc class add \ _1_ > dev eth0 \ _2_ > parent 1:1 \ _3_ > classid 1:6 \ _4_ > htb \ _5_ > rate 256kbit \ _6_ > ceil 512kbit _7_ _1 Add a class. The verb could also be del. _2 Specify the device onto which we are attaching the new class. _3 Specify the parent handle to which we are attaching the new class. _4 This is a unique handle (major:minor) identifying this class. The minor number must be any non-zero (0) number. _5 Both of the classful_qdiscs require that any children classes be classes of the same type as the parent. Thus an HTB qdisc will contain HTB classes. _6 This is a class specific parameter. Consult Section 7.1,_“HTB,_Hierarchical 7_ Token_Bucket” for more detail on these parameters. Example 5. tc filter [root@leander]# tc filter add \ _1_ > dev eth0 \ _2_ > parent 1:0 \ _3_ > protocol ip \ _4_ > prio 5 \ _5_ > u32 \ _6_ > match ip port 22 0xffff \ _7_ > match ip tos 0x10 0xff \ _8_ > flowid 1:6 \ _9_ > police \ _10_ > rate 32000bps \ _11_ > burst 10240 \ _12_ > mpu 0 \ _13_ > action drop/continue _14_ _1_ Add a filter. The verb could also be del. _2_ Specify the device onto which we are attaching the new filter. _3_ Specify the parent handle to which we are attaching the new filter. _4_ This parameter is required. It's use should be obvious, although I don't know more. _5_ The prio parameter allows a given filter to be preferred above another. The pref is a synonym. _6_ This is a classifier, and is a required phrase in every tc filter command. _7_ _8_ These are parameters to the classifier. In this case, packets with a type of service flag (indicating interactive usage) and matching port 22 will be selected by this statement. _9_ The flowid specifies the handle of the target class (or qdisc) to which a matching filter should send its selected packets. _10_ This is the policer, and is an optional phrase in every tc filter command. _11_ The policer will perform one action above this rate, and another action below (see action_parameter). _12_ The burst is an exact analog to burst in HTB (burst is a buckets concept). _13_ The minimum policed unit. To count all traffic, use an mpu of zero (0). _14_ The action indicates what should be done if the rate based on the attributes of the policer. The first word specifies the action to take if the policer has been exceeded. The second word specifies action to take otherwise. As evidenced above, the tc command line utility has an arcane and complex syntax, even for simple operations such as these examples show. It should come as no surprised to the reader that there exists an easier way to configure Linux traffic control. See the next section, Section 5.3,_“tcng,_Traffic Control_Next_Generation”. 5.3. tcng, Traffic Control Next Generation FIXME; sing the praises of tcng. See also Traffic_Control_using_tcng_and_HTB HOWTO and tcng_documentation. Traffic control next generation (hereafter, tcng) provides all of the power of traffic control under Linux with twenty percent of the headache. 5.4. Netfilter Netfilter is a framework provided by the Linux kernel that allows various networking-related operations to be implemented in the form of customized handlers. Netfilter offers various functions and operations for packet filtering, network address translation, and port translation, which provide the functionality required for directing packets through a network, as well as for providing ability to prohibit packets from reaching sensitive locations within a computer network. Netfilter represents a set of hooks inside the Linux kernel, allowing specific kernel modules to register callback functions with the kernel's networking stack. Those functions, usually applied to the traffic in form of filtering and modification rules, are called for every packet that traverses the respective hook within the networking stack. 5.4.1. iptables iptables is a user-space application program that allows a system administrator to configure the tables provided by the Linux kernel firewall (implemented as different Netfilter modules) and the chains and rules it stores. Different kernel modules and programs are currently used for different protocols; iptables applies to IPv4, ip6tables to IPv6, arptables to ARP, and ebtables to Ethernet frames. iptables requires elevated privileges to operate and must be executed by user root, otherwise it fails to function. On most Linux systems, iptables is installed as /usr/sbin/iptables and documented in its man pages, which can be opened using man iptables when installed. It may also be found in /sbin/ iptables, but since iptables is more like a service rather than an "essential binary", the preferred location remains /usr/sbin. The term iptables is also commonly used to inclusively refer to the kernel- level components. x_tables is the name of the kernel module carrying the shared code portion used by all four modules that also provides the API used for extensions; subsequently, Xtables is more or less used to refer to the entire firewall (v4, v6, arp, and eb) architecture. Xtables allows the system administrator to define tables containing chains of rules for the treatment of packets. Each table is associated with a different kind of packet processing. Packets are processed by sequentially traversing the rules in chains. A rule in a chain can cause a goto or jump to another chain, and this can be repeated to whatever level of nesting is desired. (A jump is like a “call”, i.e. the point that was jumped from is remembered.) Every network packet arriving at or leaving from the computer traverses at least one chain. Figure 5:Packet flow paths. Packets start at a given box and will flow along a certain path, depending on the circumstances. The origin of the packet determines which chain it traverses initially. There are five predefined chains (mapping to the five available Netfilter hooks, see figure 5), though a table may not have all chains. Figure 6:netfilter’s hook Predefined chains have a policy, for example DROP, which is applied to the packet if it reaches the end of the chain. The system administrator can create as many other chains as desired. These chains have no policy; if a packet reaches the end of the chain it is returned to the chain which called it. A chain may be empty. * PREROUTING: Packets will enter this chain before a routing decision is made (point 1 in Figure 6). * INPUT: Packet is going to be locally delivered. It does not have anything to do with processes having an opened socket; local delivery is controlled by the "local-delivery" routing table: ip route show table local (point 2 Figure 6). * FORWARD: All packets that have been routed and were not for local delivery will traverse this chain (point 3 in Figure 6). * OUTPUT: Packets sent from the machine itself will be visiting this chain (point 5 in Figure 6) * POSTROUTING: Routing decision has been made. Packets enter this chain just before handing them off to the hardware (point 4 in Figure 6). Each rule in a chain contains the specification of which packets it matches. It may also contain a target (used for extensions) or verdict (one of the built-in decisions). As a packet traverses a chain, each rule in turn is examined. If a rule does not match the packet, the packet is passed to the next rule. If a rule does match the packet, the rule takes the action indicated by the target/ verdict, which may result in the packet being allowed to continue along the chain or it may not. Matches make up the large part of rulesets, as they contain the conditions packets are tested for. These can happen for about any layer in the OSI model, as with e.g. the --mac-source and -p tcp --dport parameters, and there are also protocol-independent matches, such as -m time. The packet continues to traverse the chain until either * a rule matches the packet and decides the ultimate fate of the packet, for example by calling one of the ACCEPT or DROP, or a module returning such an ultimate fate; or * a rule calls the RETURN verdict, in which case processing returns to the calling chain; or * the end of the chain is reached; traversal either continues in the parent chain (as if RETURN was used), or the base chain policy, which is an ultimate fate, is used. Targets also return a verdict like ACCEPT (NAT modules will do this) or DROP (e.g. the REJECT module), but may also imply CONTINUE (e.g. the LOG module; CONTINUE is an internal name) to continue with the next rule as if no target/ verdict was specified at all. 5.5. IMQ, Intermediate Queuing device The Intermediate queueing device is not a qdisc but its usage is tightly bound to qdiscs. Within linux, qdiscs are attached to network devices and everything that is queued to the device is first queued to the qdisc and then to driver queue. From this concept, two limitations arise: * Only egress shaping is possible (an ingress qdisc exists, but its possibilities are very limited compared to classful qdiscs, as seen before). * A qdisc can only see traffic of one interface, global limitations can't be placed. IMQ is there to help solve those two limitations. In short, you can put everything you choose in a qdisc. Specially marked packets get intercepted in netfilter NF_IP_PRE_ROUTING and NF_IP_POST_ROUTING hooks and pass through the qdisc attached to an imq device. An iptables target is used for marking the packets. This enables you to do ingress shaping as you can just mark packets coming in from somewhere and/or treat interfaces as classes to set global limits. You can also do lots of other stuff like just putting your http traffic in a qdisc, put new connection requests in a qdisc, exc. 5.5.1. Sample configuration The first thing that might come to mind is use ingress shaping to give yourself a high guaranteed bandwidth. Configuration is just like with any other interface: tc qdisc add dev imq0 root handle 1: htb default 20 tc class add dev imq0 parent 1: classid 1:1 htb rate 2mbit burst 15k tc class add dev imq0 parent 1:1 classid 1:10 htb rate 1mbit tc class add dev imq0 parent 1:1 classid 1:20 htb rate 1mbit tc qdisc add dev imq0 parent 1:10 handle 10: pfifo tc qdisc add dev imq0 parent 1:20 handle 20: sfq tc filter add dev imq0 parent 10:0 protocol ip prio 1 u32 match \ ip dst 10.0.0.230/32 flowid 1:10 In this example u32 is used for classification. Other classifiers should work as expected. Next traffic has to be selected and marked to be enqueued to imq0. iptables -t mangle -A PREROUTING -i eth0 -j IMQ --todev 0 ip link set imq0 up The IMQ iptables targets is valid in the PREROUTING and POSTROUTING chains of the mangle table. It's syntax is IMQ [ --todev n ] n : number of imq device An ip6tables target is also provided. Please note traffic is not enqueued when the target is hit but afterwards. The exact location where traffic enters the imq device depends on the direction of the traffic (in/out). These are the predefined netfilter hooks used by iptables: enum nf_ip_hook_priorities { NF_IP_PRI_FIRST = INT_MIN, NF_IP_PRI_CONNTRACK = -200, NF_IP_PRI_MANGLE = -150, NF_IP_PRI_NAT_DST = -100, NF_IP_PRI_FILTER = 0, NF_IP_PRI_NAT_SRC = 100, NF_IP_PRI_LAST = INT_MAX, }; For ingress traffic, imq registers itself with NF_IP_PRI_MANGLE + 1 priority which means packets enter the imq device directly after the mangle PREROUTING chain has been passed. For egress imq uses NF_IP_PRI_LAST which honours the fact that packets dropped by the filter table won't occupy bandwidth. 5.6. ethtool, Driver Queue The ethtool command is used to control the driver queue size for Ethernet devices. ethtool also provides low level interface statistics as well as the ability to enable and disable IP stack and driver features. The -g flag to ethtool displays the driver queue (ring) parameters (see Figure 1) : $ethtool -g eth0 Ring parameters for eth0: Pre-set maximums: RX: 16384 RX Mini: 0 RX Jumbo: 0 TX: 16384 Current hardware settings: RX: 512 RX Mini: 0 RX Jumbo: 0 TX: 256 You can see from the above output that the driver for this NIC defaults to 256 descriptors in the transmission queue. It was often recommended to reduce the size of the driver queue in order to reduce latency. With the introduction of BQL (assuming your NIC driver supports it) there is no longer any reason to modify the driver queue size (see the below for how to configure BQL). Ethtool also allows you to manage optimization features such as TSO,_UFO_and GSO. The -k flag displays the current offload settings and -K modifies them. $ethtool -k eth0 Offload parameters for eth0: rx-checksumming: off tx-checksumming: off scatter-gather: off tcp-segmentation-offload: off udp-fragmentation-offload: off generic-segmentation-offload: off generic-receive-offload: on large-receive-offload: off rx-vlan-offload: off tx-vlan-offload: off ntuple-filters: off receive-hashing: off Since TSO,_GSO,_UFO and GRO greatly increase the number of bytes which can be queued in the driver queue you should disable these optimizations if you want to optimize for latency over throughput. It’s doubtful you will notice any CPU impact or throughput decrease when disabling these features unless the system is handling very high data rates. 6. Classless Queuing Disciplines (qdiscs) Each of these queuing disciplines can be used as the primary qdisc on an interface, or can be used inside a leaf class of a classful_qdiscs. These are the fundamental schedulers used under Linux. Note that the default scheduler is the pfifo_fast. 6.1. FIFO, First-In First-Out (pfifo and bfifo) Note Although a FIFO is one of the simplest elements of a queueing system, neither pfifo nor bfifo is the default qdisc on Linux interfaces. Be certain to see Section 6.2,_“pfifo_fast,_the_default_Linux_qdisc” for the full details on the default (pfifo_fast) qdisc. 6.1.1. pfifo, bfifo Algorithm The FIFO algorithm forms the underlying basis for the default qdisc on all Linux network interfaces (pfifo_fast). It performs no shaping or rearranging of packets. It simply transmits packets as soon as it can after receiving and queuing them. This is also the qdisc used inside all newly created classes until another qdisc or a class replaces the FIFO. Figure 7: FIFO qdisc A list of packets is maintained, when a packet is enqueued it gets inserted at the tail of a list. When a packet needs to be sent out to the network, it is taken from the head of the list. 6.1.2. limit parameter A real FIFO qdisc must, however, have a size limit (a buffer size) to prevent it from overflowing in case it is unable to dequeue packets as quickly as it receives them. Linux implements two basic FIFO qdiscs, one based on bytes, and one on packets. Regardless of the type of FIFO used, the size of the queue is defined by the parameter limit. For a pfifo (Packet limited First In, First Out queue) the unit is understood to be packets and for a bfifo(Byte limited First In, First Out queue) the unit is understood to be bytes. For pfifo, defaults to the interface txqueuelen , as specified with ifconfig or ip. The range for this parameter is [0, UINT32_MAX]. For bfifo, it defaults to the txqueuelen multiplied by the interface MTU. The range for this parameter is [0, UINT32_MAX] bytes. Note: The link layer header was considered when counting packet length. Example 6. Specifying a limit for a packet or byte FIFO [root@leander]# cat bfifo.tcc /* * make a FIFO on eth0 with 10kbyte queue size * */ dev eth0 { egress { fifo (limit 10kB ); } } [root@leander]# tcc < bfifo.tcc # ================================ Device eth0 ================================ tc qdisc add dev eth0 handle 1:0 root dsmark indices 1 default_index 0 tc qdisc add dev eth0 handle 2:0 parent 1:0 bfifo limit 10240 [root@leander]# cat pfifo.tcc /* * make a FIFO on eth0 with 30 packet queue size * */ dev eth0 { egress { fifo (limit 30p ); } } [root@leander]# tcc < pfifo.tcc # ================================ Device eth0 ================================ tc qdisc add dev eth0 handle 1:0 root dsmark indices 1 default_index 0 tc qdisc add dev eth0 handle 2:0 parent 1:0 pfifo limit 30 6.1.3. tc –s qdisc ls The output of tc -s qdisc ls contains the limit, either in packets or in bytes, and the number of bytes and packets actually sent. An unsent and dropped packet only appears between braces and is not counted as 'Sent'. In this example, the queue length is 100 packets, 45894 bytes were sentover 681 packets. No packets were dropped, and as the pfifo queue does not slow down packets, there were also no overlimits: $ tc -s qdisc ls dev eth0 qdisc pfifo 8001: dev eth0 limit 100p Sent 45894 bytes 681 pkts (dropped 0, overlimits 0) If a backlog occurs, this is displayed as well. pfifo and bfifo, like all non-default qdiscs, maintain statistics. This might be a reason to prefer that over the default. 6.2. pfifo_fast, the default Linux qdisc The pfifo_fast (three-band first in, first out queue) qdisc is the default qdisc for all interfaces under Linux. Whenever an interface is created, the pfifo_fast qdisc is automatically used as a queue. If another qdisc is attached, it preempts the default pfifo_fast, which automatically returns to function when an existing qdisc is detached. Figure 8: pfifo_fast qdisc 6.2.1. pfifo_fast algorithm Based on a conventional FIFO qdisc, this qdisc also provides some prioritization. It provides three different bands (individual FIFOs) for separating traffic. The highest priority traffic (interactive flows) are placed into band 0 and are always serviced first. Similarly, band 1 is always emptied of pending packets before band 2 is dequeued. The algorithm is very similar to that of the classful prio_qdisc. The pfifo_fast qdisc is like three pfifo queues side by side, where an individual packet will be enqueued in one of the three FIFOs based on its Type of Service (ToS) bits. Not all three bands are dequeued simultaneously - as long as lower- numbered (higher priority) bands have traffic, higher-numbered bands are never dequeued. This can be used to prioritize interactive traffic or penalize 'lowest cost' traffic. Each band can be txqueuelen packets long, as configured with ifconfig or ip. Any additional packets arriving, once a specific band is full, are not enqueued but are instead dropped. See chapter_6.2.3 for complete details on how ToS bits are translated into bands. 6.2.2. txqueuelen parameter The length of the three bands depends on the interface txqueuelen, as specified with ifconfig or ip. 6.2.3. Bugs There is nothing configurable to the end user about the pfifo_fast qdisc. This is how it is configured by default: priomap determines how packet priorities, as assigned by the kernel, map to bands. Mapping occurs based on the ToS octet of the packet, which looks like this: The four ToS bits (the 'ToS field') are defined as: As there is 1 bit to the right of these four bits, the actual value of the ToS field is double the value of the ToS bits. Running tcpdump -v -v shows you the value of the entire ToS field, not just the four bits. It is the value you see in the first column of this table: Lots of numbers. The second column contains the value of the relevant four ToS bits, followed by their translated meaning. For example, 15 stands for a packet wanting Minimal Monetary Cost, Maximum Reliability, Maximum Throughput AND Minimum Delay. The fourth column lists the way the Linux kernel interprets the ToS bits, by showing to which Priority they are mapped. The last column shows the result of the default priomap. On the command line, the default priomap looks like this: 1, 2, 2, 2, 1, 2, 0, 0 , 1, 1, 1, 1, 1, 1, 1, 1 This means that priority 4, for example, gets mapped to band number 1. The priomap also allows you to list higher priorities (> 7) which do not correspond to ToS mappings, but which are set by other means. Moreover, differently from other non standard qdisc, pfifo_fast does not maintain statistics and does not show up in tc qdisc ls. This is because it is the automatic default in the absence of a configured qdisc. 6.3. SFQ, Stochastic Fair Queuing Stochastic Fairness Queueing is a classless queueing discipline available for traffic control with the tc command. SFQ does not shape traffic but only schedules the transmission of packets, based on 'flows'. The goal is to ensure fairness so that each flow is able to send data in turn, thus preventing any single flow from drowning out the rest. It accomplishes this by using a hash function to separate the traffic into separate (internally maintained) FIFOs which are dequeued in a round-robin fashion. Because there is the possibility for unfairness to manifest in the choice of hash function, this function is altered periodically (see_6.3.1_algorithm). Perturbation (the parameter perturb) sets this periodicity (see_6.3.3_parameters). This may in fact have some effect in mitigating a Denial of Service attempt. SFQ is work-conserving and therefore always delivers a packet if it has one available. So, the SFQ qdisc attempts to fairly distribute opportunity to transmit data to the network among an arbitrary number of flows. Figure 9: Stochastic Fair Queuing (SFQ) 6.3.1. SFQ Algorithm On enqueueing, each packet is assigned to a hash bucket, based on the packets hash value. This hash value is either obtained from an external flow classifier (use tc filter to set them), or a default internal classifier if no external classifier has been configured. When the internal classifier is used, sfq uses * Source address; * Destination address; * Source and Destination port; if these are available. SFQ knows about ipv4 and ipv6 and also UDP, TCP and ESP. Packets with other protocols are hashed based on the 32bits representation of their destination and source. A flow corresponds mostly to a TCP/IP connection. Each of these buckets should represent a unique flow. Because multiple flows may get hashed to the same bucket, sfqs internal hashing algorithm may be perturbed at configurable intervals so that the unfairness lasts only for a short while. Perturbation may however cause some inadvertent packet reordering to occur. After linux-3.3, there is no packet reordering problem, but possible packet drops if rehashing hits one limit (number of flows or packets per flow) When dequeuing, each hashbucket with data is queried in a round robin fashion. Before linux-3.3, the compile time maximum length of the SFQ is 128 packets, which can be spread over at most 128 buckets of 1024 available. In case of overflow, tail-drop is performed on the fullest bucket, thus maintaining fairness. After linux-3.3, maximum length of SFQ is 65535 packets, and divisor limit is 65536. In case of overflow, tail-drop is performed on the fullest bucket, unless headdrop was requested. 6.3.2. Synopsis tc qdisc ... [divisor hashtablesize] [limit packets] [perturb seconds] [quantum bytes] [flows number] [depth number] [head drop] [redflowlimit bytes] [min bytes] [max bytes] [avpkt bytes] [burst packets] [probability P] [ecn] [harddrop] 6.3.3. Parameters * divisor: can be used to set a different hash table size, available from kernel 2.6.39 onwards. The specified divisor must be a power of two and cannot be larger than 65536. Default value is 1024. * limit: upper limit of the SFQ. Can be used to reduce the default length of 127 packets. After linux-3.3, it can be raised. * depth: limit of packets per flow (after linux-3.3). Default to 127 and can be lowered. * perturb: interval in seconds for queue algorithm perturbation. Defaults to 0, which means that no perturbation occurs. Do not set too low for each perturbation may cause some packet reordering or losses. Advised value is 60. This value has no effect when external flow classification is used. Its better to increase divisor value to lower risk of hash collisions. * quantum: amount of bytes a flow is allowed to dequeue during a round of the round robin process. Defaults to the MTU of the interface which is also the advised value and the minimum value. * flows: after linux-3.3, it is possible to change the default limit of flows. Default value is 127. * headdrop: default SFQ behavior is to perform tail-drop of packets from a flow. You can ask a headdrop instead, as this is known to provide a better feedback for TCP flows * redflowlimit: configure the optional RED module on top of each SFQ flow. Random Early Detection principle is to perform packet marks or drops in a probabilistic way. Redflowlimit configures the hard limit on the real (not average) queue size per SFQ flow in bytes. * min: average queue size at which marking becomes a possibility. Defaults to max /3. * max: at this average queue size, the marking probability is maximal. Defaults to redflowlimit /4. * probability: maximum probability for marking, specified as a floating point number from 0.0 to 1.0. Default value is 0.02. * avpkt: specified in bytes. Used with burst to determine the time constant for average queue size calculations. Default value is 1000. * burst: used for determining how fast the average queue size is influenced by the real queue size. Default value is: (2 * min + max) / (3 * avpkt). * ecn: RED can either 'mark' or 'drop'. Explicit Congestion Notification allows RED to notify remote hosts that their rate exceeds the amount of bandwidth available. Non-ECN capable hosts can only be notified by dropping a packet. If this parameter is specified, packets which indicate that their hosts honor ECN will only be marked and not dropped, unless the queue size hits depth packets. * harddrop: if average flow queue size is above max bytes, this parameter forces a drop instead of ecn marking. 6.3.4. Example and Usage To attach to device ppp0: $ tc qdisc add dev ppp0 root sfq Please note that SFQ, like all non-shaping (work-conserving) qdiscs, is only useful if it owns the queue. This is the case when the link speed equals the actually available bandwidth. This holds for regular phone modems, ISDN connections and direct non-switched ethernet links. Most often, cable modems and DSL devices do not fall into this category. The same holds for when connected to a switch and trying to send data to a congested segment also connected to the switch. In this case, the effective queue does not reside within Linux and is therefore not available for scheduling. Embed SFQ in a classful qdisc to make sure it owns the queue. It is possible to use external classifiers with sfq, for example to hash traffic based only on source/destination ip addresses $ tc filter add ... flow hash keys src,dst perturb 30 divisor 1024 Note that the given divisor should match the one used by sfq. If you have changed the sfq default of 1024, use the same value for the flow hash filter, too. Example 7. SFQ with optional RED mode [root@leander]# tc qdisc add dev eth0 parent 1:1 handle 10: sfq limit 3000 flows 512 divisor 16384 redflowlimit 100000 min 8000 max 60000 probability 0.20 ecn headdrop Example 8. Creating an SFQ [root@leander]# cat sfq.tcc /* * make an SFQ on eth0 with a 10 second perturbation * */ dev eth0 { egress { sfq( perturb 10s ); } } [root@leander]# tcc < sfq.tcc # ================================ Device eth0 ================================ tc qdisc add dev eth0 handle 1:0 root dsmark indices 1 default_index 0 tc qdisc add dev eth0 handle 2:0 parent 1:0 sfq perturb 10 Unfortunately, some clever software (e.g. Kazaa and eMule among others) obliterate the benefit of this attempt at fair queuing by opening as many TCP sessions (flows) as can be sustained. In many networks, with well-behaved users, SFQ can adequately distribute the network resources to the contending flows, but other measures may be called for when obnoxious applications have invaded the network. See also Section 6.4,_“ESFQ,_Extended_Stochastic_Fair_Queuing” for an SFQ qdisc with more exposed parameters for the user to manipulate. 6.4. ESFQ, Extended Stochastic Fair Queuing Conceptually, this qdisc is no different than SFQ although it allows the user to control more parameters than its simpler cousin. This qdisc was conceived to overcome the shortcoming of SFQ identified above. By allowing the user to control which hashing algorithm is used for distributing access to network bandwidth, it is possible for the user to reach a fairer real distribution of bandwidth. Example 9. ESFQ usage Usage: ... esfq [ perturb SECS ] [ quantum BYTES ] [ depth FLOWS ] [ divisor HASHBITS ] [ limit PKTS ] [ hash HASHTYPE] Where: HASHTYPE := { classic | src | dst } FIXME; need practical experience and/or attestation here. 6.5. RED,Random Early Drop 6.5.1. Description Random Early Detection is a classless qdisc which manages its queue size smartly. Regular queues simply drop packets from the tail when they are full, which may not be the optimal behaviour. RED also performs tail drop, but does so in a more gradual way. Once the queue hits a certain average length, packets enqueued have a configurable chance of being marked (which may mean dropped). This chance increases linearly up to a point called the max average queue length, although the queue might get bigger. This has a host of benefits over simple taildrop, while not being processor intensive. It prevents synchronous retransmits after a burst in traffic, which cause further retransmits, etc. The goal is to have a small queue size, which is good for interactivity while not disturbing TCP/IP traffic with too many sudden drops after a burst of traffic. Depending on if ECN is configured, marking either means dropping or purely marking a packet as overlimit. 6.5.2. Algorithm The average queue size is used for determining the marking probability. This is calculated using an Exponential Weighted Moving Average, which can be more or less sensitive to bursts. When the average queue size is below min bytes, no packet will ever be marked. When it exceeds min, the probability of doing so climbs linearly up to probability, until the average queue size hits max bytes. Because probability is normally not set to 100%, the queue size might conceivably rise above max bytes, so the limit parameter is provided to set a hard maximum for the size of the queue. 6.5.3. Synopsis $ tc qdisc ... red limit bytes [min bytes] [max bytes] avpkt bytes [burst packets] [ecn] [harddrop] [bandwidth rate] [probability chance] [adaptive] 6.5.4. Parameters * min: average queue size at which marking becomes a possibility. Defaults to max/3. * max: at this average queue size, the marking probability is maximal. Should be at least twice min to prevent synchronous retransmits, higher for low min. Default to limit/4. * probability: maximum probability for marking, specified as a floating point number from 0.0 to 1.0. Suggested values are 0.01 or 0.02 (1 or 2%, respectively). Default is 0.02. * limit: hard limit on the real (not average) queue size in bytes. Further packets are dropped. Should be set higher than max+burst. It is advised to set this a few times higher than max. * burst: used for determining how fast the average queue size is influenced by the real queue size. Larger values make the calculation more sluggish, allowing longer bursts of traffic before marking starts. Real life experiments support the following guideline (min+min+max)/(3*avpkt). * Avpkt: specified in bytes. Used with burst to determine the time constant for average queue size calculations. 1000 is a good value. * bandwidth: this rate is used for calculating the average queue size after some idle time. Should be set to the bandwidth of your interface. Does not mean that RED will shape for you! Optional. Default is 10Mbit. * ecn: as mentioned before, RED can either 'mark' or 'drop'. Explicit Congestion Notification allows RED to notify remote hosts that their rate exceeds the amount of bandwidth available. Non-ECN capable hosts can only be notified by dropping a packet. If this parameter is specified, packets which indicate that their hosts honor ECN will only be marked and not dropped, unless the queue size hits limit bytes. Recommended. * harddrop: If average flow queue size is above max bytes, this parameterv forces a drop instead of ecn marking. * adaptive: (added in linux-3.3) Sets RED in adaptive mode as described in http://icir.org/floyd/papers/adaptiveRed.pdf. Goal of Adaptive RED is to make 'probability' dynamic value between 1% and 50% to reach the target average queue: (max - min) / 2. 6.5.5. Example # tc qdisc add dev eth0 parent 1:1 handle 10: red limit 400000 min 30000 max 90000 avpkt 1000 burst 55 ecn adaptive bandwidth 10Mbit 6.6. GRED, Generic Random Early Drop Generalized RED is used in DiffServ implementation and it has virtual queue (VQ) within physical queue. Currently, the number of virtual queues is limited to 16. GRED is configured in two steps. First the generic parameters are configured to select the number of virtual queues DPs and whether to turn on the RIO-like buffer sharing scheme. Also at this point, a default virtual queue is selected. The second step is used to set parameters for individual virtual queues. 6.6.1. Synopsis ... gred DP drop-probability limit BYTES min BYTES max BYTES avpkt BYTES burst PACKETS probability PROBABILITY bandwidth KBPS [prio value] OR ... gred setup DPs "num of DPs" default "default DP" [grio] 6.6.2. Parameter * setup: identifies that this is a generic setup for GRED; * DPs: is the number of virtual queues; * default: specifies default virtual queue; * grio: turns on the RIO-like buffering scheme; * limit: defines the virtual queue “physical” limit in bytes; * min: defines the minimum threshold value in bytes; * max: defines the maximum threshold value in bytes; * avpkt: is the average packet size in bytes; * bandwidth: is the wire-speed of the interface; * burst: is the number of average-sized packets allowed to burst; * probability: defines the drop probability in the range (0…); * DP: identifies the virtual queue assigned to these parameters; * prio: identifies the virtual queue priority if grio was set in general parameters; 6.7. TBF, Token Bucket Filter This qdisc is built on tokens and buckets. It simply shapes traffic transmitted on an interface. To limit the speed at which packets will be dequeued from a particular interface, the TBF qdisc is the perfect solution. It simply slows down transmitted traffic to the specified rate. Packets are only transmitted if there are sufficient tokens available. Otherwise, packets are deferred. Delaying packets in this fashion will introduce an artificial latency into the packet's round trip time. Figure 10: Token Bucket Filter (TBF) 6.7.1. Algorithm As the name implies, traffic is filtered based on the expenditure of tokens (non-work-conserving). Tokens roughly correspond to bytes, with the additional constraint that each packet consumes some tokens, no matter how small it is. This reflects the fact that even a zero-sized packet occupies the link for some time. On creation, the TBF is stocked with tokens which correspond to the amount of traffic that can be burst in one go. Tokens arrive at a steady rate, until the bucket is full. If no tokens are available, packets are queued, up to a configured limit. The TBF now calculates the token deficit, and throttles until the first packet in the queue can be sent. If it is not acceptable to burst out packets at maximum speed, a peakrate can be configured to limit the speed at which the bucket empties. This peakrate is implemented as a second TBF with a very small bucket, so that it doesn't burst. 6.7.2. Parameters * limit or latency: limit is the number of bytes that can be queued waiting for tokens to become available. You can also specify this the other way around by setting the latency parameter, which specifies the maximum amount of time a packet can sit in the TBF. The latter calculation takes into account the size of the bucket, the rate and possibly the peakrate (if set). These two parameters are mutually exclusive. * Burst: also known as buffer or maxburst. Size of the bucket, in bytes. This is the maximum amount of bytes that tokens can be available for instantaneously. In general, larger shaping rates require a larger buffer. For 10mbit/s on Intel, you need at least 10kbyte buffer if you want to reach your configured rate. If your buffer is too small, packets may be dropped because more tokens arrive per timer tick than fit in your bucket. The minimum buffer size can be calculated by dividing the rate by HZ. Token usage calculations are performed using a table which by default has a resolution of 8 packets. This resolution can be changed by specifying the cell size with the burst. For example, to specify a 6000 byte buffer with a 16 byte cell size, set a burst of 6000/16. You will probably never have to set this. Must be an integral power of 2. * Mpu: a zero-sized packet does not use zero bandwidth. For ethernet, no packet uses less than 64 bytes. The Minimum Packet Unit determines the minimal token usage (specified in bytes) for a packet. Defaults to zero. * Rate: the speed knob. Furthermore, if a peakrate is desired, the following parameters are available: * peakrate: maximum depletion rate of the bucket. The peakrate does not need to be set, it is only necessary if perfect millisecond timescale shaping is required. * mtu/minburst: specifies the size of the peakrate bucket. For perfect accuracy, should be set to the MTU of the interface. If a peakrate is needed, but some burstiness is acceptable, this size can be raised. A 3000 byte minburst allows around 3mbit/s of peakrate, given 1000 byte packets. Like the regular burstsize you can also specify a cell size. 6.7.3. Example Example 10. Creating a 256kbit/s TBF [root@leander]# cat tbf.tcc /* * make a 256kbit/s TBF on eth0 * */ dev eth0 { egress { tbf( rate 256 kbps, burst 20 kB, limit 20 kB, mtu 1514 B ); } } [root@leander]# tcc < tbf.tcc # ================================ Device eth0 ================================ tc qdisc add dev eth0 handle 1:0 root dsmark indices 1 default_index 0 tc qdisc add dev eth0 handle 2:0 parent 1:0 tbf burst 20480 limit 20480 mtu 1514 rate 32000bps 7. Classful Queuing Disciplines (qdiscs) The flexibility and control of Linux traffic control can be unleashed through the agency of the classful qdiscs. Remember that the classful queuing disciplines can have filters attached to them, allowing packets to be directed to particular classes and subqueues. There are several common terms to describe classes directly attached to the root qdisc and terminal classes. Classess attached to the root qdisc are known as root classes, and more generically inner classes. Any terminal class in a particular queuing discipline is known as a leaf class by analogy to the tree structure of the classes. Besides the use of figurative language depicting the structure as a tree, the language of family relationships is also quite common. 7.1. HTB, Hierarchical Token Bucket HTB is meant as a more understandable and intuitive replacement for the CBQ (see_chapter_7.4) qdisc in Linux. Both CBQ and HTB help you to control the use of the outbound bandwidth on a given link. Both allow you to use one physical link to simulate several slower links and to send different kinds oftraffic on different simulated links. In both cases, you have to specify how to divide the physical link into simulated links and how to decide which simulated link to use for a given packet to be sent. HTB uses the concepts of tokens and buckets along with the class-based system and filters to allow for complex and granular control over traffic. With a complex borrowing_model, HTB can perform a variety of sophisticated traffic control techniques. One of the easiest ways to use HTB immediately is that of shaping. By understanding tokens and buckets or by grasping the function of TBF, HTB should be merely a logical step. This queuing discipline allows the user to define the characteristics of the tokens and bucket used and allows the user to nest these buckets in an arbitrary fashion. When coupled with a classifying scheme, traffic can be controlled in a very granular fashion. Below is example output of the syntax for HTB on the command line with the tc tool. Although the syntax for tcng is a language of its own, the rules for HTB are the same. Example 11. tc usage for HTB Usage: ... qdisc add ... htb [default N] [r2q N] default minor id of class to which unclassified packets are sent {0} r2q DRR quantums are computed as rate in Bps/r2q {10} debug string of 16 numbers each 0-3 {0} ... class add ... htb rate R1 burst B1 [prio P] [slot S] [pslot PS] [ceil R2] [cburst B2] [mtu MTU] [quantum Q] rate rate allocated to this class (class can still borrow) burst max bytes burst which can be accumulated during idle period {computed} ceil definite upper class rate (no borrows) {rate} cburst burst but for ceil {computed} mtu max packet size we create rate map for {1600} prio priority of leaf; lower are served first {0} quantum how much bytes to serve from leaf at once {use r2q} TC HTB version 3.3 7.1.1. Software requirements Unlike almost all of the other software discussed, HTB is a newer queuing discipline and your distribution may not have all of the tools and capability you need to use HTB. The kernel must support HTB; kernel version 2.4.20 and later support it in the stock distribution, although earlier kernel versions require patching. To enable userland support for HTB, see HTB for an iproute2 patch to tc. 7.1.2. Shaping One of the most common applications of HTB involves shaping transmitted traffic to a specific rate. All shaping occurs in leaf classes. No shaping occurs in inner or root classes as they only exist to suggest how the borrowing_model should distribute available tokens. 7.1.3. Borrowing A fundamental part of the HTB qdisc is the borrowing mechanism. Children classes borrow tokens from their parents once they have exceeded rate. A child class will continue to attempt to borrow until it reaches ceil, at which point it will begin to queue packets for transmission until more tokens/ctokens are available. As there are only two primary types of classes which can be created with HTB the following table and diagram identify the various possible states and the behaviour of the borrowing mechanisms. Table 2. HTB class states and potential actions taken _____________________________________________________________________________ |type_of_class|class_state___|HTB_internal_state|action_taken_________________| | | | |Leaf class will dequeue | |leaf |< rate |HTB_CAN_SEND |queued bytes up to available | | | | |tokens (no more than burst | |_____________|______________|__________________|packets)_____________________| | | | |Leaf class will attempt to | | | | |borrow tokens/ctokens from | | | | |parent class. If tokens are | |leaf |> rate, < ceil|HTB_MAY_BORROW |available, they will be lent | | | | |in quantum increments and the| | | | |leaf class will dequeue up to| |_____________|______________|__________________|cburst_bytes_________________| | | | |No packets will be dequeued. | |leaf |> ceil |HTB_CANT_SEND |This will cause packet delay | | | | |and will increase latency to | |_____________|______________|__________________|meet_the_desired_rate._______| |inner, root |< rate |HTB_CAN_SEND |Inner class will lend tokens | |_____________|______________|__________________|to_children._________________| | | | |Inner class will attempt to | | | | |borrow tokens/ctokens from | |inner, root |> rate, < ceil|HTB_MAY_BORROW |parent class, lending them to| | | | |competing children in quantum| |_____________|______________|__________________|increments_per_request.______| | | | |Inner class will not attempt | |inner, root |> ceil |HTB_CANT_SEND |to borrow from its parent and| | | | |will not lend tokens/ctokens | |_____________|______________|__________________|to_children_classes._________| This diagram identifies the flow of borrowed tokens and the manner in which tokens are charged to parent classes. In order for the borrowing model to work, each class must have an accurate count of the number of tokens used by itself and all of its children. For this reason, any token used in a child or leaf class is charged to each parent class until the root class is reached. Any child class which wishes to borrow a token will request a token from its parent class, which if it is also over its rate will request to borrow from its parent class until either a token is located or the root class is reached. So the borrowing of tokens flows toward the leaf classes and the charging of the usage of tokens flows toward the root class. Figure 11: Hierarchical Token Bucket (HTB) Note in this diagram that there are several HTB root classes. Each of these root classes can simulate a virtual circuit. 7.1.4. HTB class parameters default An optional parameter with every HTB qdisc object, the default default is 0, which cause any unclassified traffic to be dequeued at hardware speed, completely bypassing any of the classes attached to the root qdisc. rate Used to set the minimum desired speed to which to limit transmitted traffic. This can be considered the equivalent of a committed information rate (CIR), or the guaranteed bandwidth for a given leaf class. ceil Used to set the maximum desired speed to which to limit the transmitted traffic. The borrowing model should illustrate how this parameter is used. This can be considered the equivalent of “burstable bandwidth”. burst This is the size of the rate bucket (see Tokens_and_buckets). HTB will dequeue burst bytes before awaiting the arrival of more tokens. cburst This is the size of the ceil bucket (see Tokens_and_buckets). HTB will dequeue cburst bytes before awaiting the arrival of more ctokens. quantum This is a key parameter used by HTB to control borrowing. Normally, the correct quantum is calculated by HTB, not specified by the user. Tweaking this parameter can have tremendous effects on borrowing and shaping under contention, because it is used both to split traffic between children classes over rate (but below ceil) and to transmit packets from these same classes. r2q Also, usually calculated for the user, r2q is a hint to HTB to help determine the optimal quantum for a particular class. mtu prio In the round-robin process, classes with the lowest priority field are tried for packets first. Mandatory field. prio Place of this class within the hierarchy. If attached directly to a qdisc and not to another class, minor can be omitted. Mandatory field. prio Like qdiscs, classes can be named. The major number must be equal to the major number of the qdisc to which it belongs. Optional, but needed if this class is going to have children. 7.1.5. HTB root parameters The root of a HTB qdisc class tree has the following parameters: parent major:minor | root This mandatory parameter determines the place of the HTB instance, either at the root of an interface or within an existing class. handle major: Like all other qdiscs, the HTB can be assigned a handle. Should consist only of a major number, followed by a colon. Optional, but very useful if classes will be generated within this qdisc. default minor-id Unclassified traffic gets sent to the class with this minor-id. 7.1.6. Rules Below are some general guidelines to using HTB culled from http:// www.docum.org/docum.org/ and the (new)_LARTC_mailing_list (see also the (old) LARTC_mailing_list_archive). These rules are simply a recommendation for beginners to maximize the benefit of HTB until gaining a better understanding of the practical application of HTB. * Shaping with HTB occurs only in leaf classes. See also Section 7.1.2, “Shaping”. * Because HTB does not shape in any class except the leaf class, the sum of the rates of leaf classes should not exceed the ceil of a parent class. Ideally, the sum of the rates of the children classes would match the rate of the parent class, allowing the parent class to distribute leftover bandwidth (ceil - rate) among the children classes. This key concept in employing HTB bears repeating. Only leaf classes actually shape packets; packets are only delayed in these leaf classes. The inner classes (all the way up to the root class) exist to define how borrowing/ lending occurs (see also Section 7.1.3,_“Borrowing”). * The quantum is only only used when a class is over rate but below ceil. * The quantum should be set at MTU or higher. HTB will dequeue a single packet at least per service opportunity even if quantum is too small. In such a case, it will not be able to calculate accurately the real bandwidth consumed [9]. * Parent classes lend tokens to children in increments of quantum, so for maximum granularity and most instantaneously evenly distributed bandwidth, quantum should be as low as possible while still no less than MTU. * A distinction between tokens and ctokens is only meaningful in a leaf class, because non-leaf classes only lend tokens to child classes. * HTB borrowing could more accurately be described as “using”. 7.1.7. Classification Like see before, within the one HTB instance many classes may exist. Each of these classes contains another qdisc, by default tc-pfifo.When enqueueing a packet, HTB starts at the root and uses various methods to determine which class should receive the data. In the absence of uncommon configuration options, the process is rather easy. At each node we look for an instruction, and then go to the class the instruction refers us to. If the class found is a barren leaf-node (without children), we enqueue the packet there. If it is not yet a leaf node, we do the whole thing over again starting from that node. The following actions are performed, in order at each node we visit, until one sends us to another node, or terminates the process. * Consult filters attached to the class. If sent to a leafnode, we are done. Otherwise, restart. * If none of the above returned with an instruction, enqueue at this node. This algorithm makes sure that a packet always ends up somewhere, even while you are busy building your configuration. 7.2. HFSC, Hierarchical Fair Service Curve The HFSC classful qdisc balances delay-sensitive traffic against throughput sensitive traffic. In a congested or backlogged state, the HFSC queuing discipline interleaves the delay-sensitive traffic when required according service curve definitions. Read about the Linux implementation in German, HFSC Scheduling_mit_Linux or read a translation into English, HFSC_Scheduling_with Linux. The original research article, A_Hierarchical_Fair_Service_Curve Algorithm_For_Link-Sharing,_Real-Time_and_Priority_Services, also remains available. This section will be completed at a later date. 7.3. PRIO, priority scheduler The PRIO classful qdisc works on a very simple precept. When it is ready to dequeue a packet, the first class is checked for a packet. If there's a packet, it gets dequeued. If there's no packet, then the next class is checked, until the queuing mechanism has no more classes to check. PRIO is a scheduler and never delays packets - it is a work-conserving qdisc, though the qdiscs contained in the classes may not be 7.3.1. Algorithm On creation with tc qdisc add, a fixed number of bands is created. Each band is a class, although is not possible to add classes with tc class add. The number of bands to be created is fixed at the creation of the qdisc itself. When dequeueing packets, band 0 is always checked first. If it has no packet to dequeue, then PRIO will try band 1, and so onwards. Maximum reliability packets should therefore go to band 0, minimum delay to band 1 and the rest to band 2. As the PRIO qdisc itself will have minor number 0, band 0 is actually major:1, band 1 is major:2, etc. For major, substitute the major number assigned to the qdisc on 'tc qdisc add' with the handle parameter. 7.3.2. Synopsis $ tc qdisc ... dev dev ( parent classid | root) [ handle major: ] prio [bands bands ] [ priomap band band band... ] [ estimator interval time‐constant ] 7.3.3. Classification Three methods are available to determine the target band in which a packet will be enqueued. * From userspace, a process with sufficient privileges can encode the destination class directly with SO_PRIORITY. * Programmatically, a tc filter attached to the root qdisc can point any traffic directly to a class. * And, typically, with reference to the priomap, a packet's priority, is derived from the Type of Service (ToS) assigned to the packet. Only the priomap is specific to this qdisc. 7.3.4. Configurable parameters bands total number of distinct bands. If changed from the default of 3, priomap must be updated as well. priomap a tc filter attached to the root qdisc can point traffic directly to a class The priomap specifies how this qdisc determines how a packet maps to a specific band. Mapping occurs based on the value of the ToS octet of a packet. 0 1 2 3 4 5 6 7 +-----+-----+-----+-----+-----+-----+-----+-----+ | PRECEDENCE | ToS | MBZ | RFC 791 +-----+-----+-----+-----+-----+-----+-----+-----+ 0 1 2 3 4 5 6 7 +-----+-----+-----+-----+-----+-----+-----+-----+ | DiffServ Code Point (DSCP) | (unused) | RFC 2474 +-----+-----+-----+-----+-----+-----+-----+-----+ The four ToS bits from the (the 'ToS field') are defined slightly differently in RFC 791 and RFC 2474. The later RFC supersedes the definitions of the former, but not all software, systems and terminology have caught up to that change. So, often packet analysis programs will still refer to Type of Service (ToS) instead of DiffServ Code Point (DSCP). Table 3. RFC_791 interpretation of IP ToS header ___________________________________________ |Binary|Decimal|Meaning_____________________| |1000__|8______|Minimize_delay_(md)_________| |0100__|4______|Maximize_throughput_(mt)____| |0010__|2______|Maximize_reliability_(mr)___| |0001__|1______|Minimize_monetary_cost_(mmc)| |0000__|0______|Normal_Service______________| As there is 1 bit to the right of these four bits, the actual value of the ToS field is double the value of the ToS bits. Running tcpdump -v -v shows you the value of the entire ToS field, not just the four bits. It is the value you see in the first column of this table: Table 4. Mapping ToS value to priomap band ___________________________________________________________________ |ToS_Field|ToS_Bits|Meaning_____________________|Linux_Priority|Band| |0x0______|0_______|Normal_Service______________|0_Best_Effort_|1___| |0x2______|1_______|Minimize_Monetary_Cost_(mmc)|1_Filler______|2___| |0x4______|2_______|Maximize_Reliability_(mr)___|0_Best_Effort_|1___| |0x6______|3_______|mmc+mr______________________|0_Best_Effort_|1___| |0x8______|4_______|Maximize_Throughput_(mt)____|2_Bulk________|2___| |0xa______|5_______|mmc+mt______________________|2_Bulk________|2___| |0xc______|6_______|mr+mt_______________________|2_Bulk________|2___| |0xe______|7_______|mmc+mr+mt___________________|2_Bulk________|2___| |0x10_____|8_______|Minimize_Delay_(md)_________|6_Interactive_|0___| |0x12_____|9_______|mmc+md______________________|6_Interactive_|0___| |0x14_____|10______|mr+md_______________________|6_Interactive_|0___| |0x16_____|11______|mmc+mr+md___________________|6_Interactive_|0___| |0x18_____|12______|mt+md_______________________|4_Int._Bulk___|1___| |0x1a_____|13______|mmc+mt+md___________________|4_Int._Bulk___|1___| |0x1c_____|14______|mr+mt+md____________________|4_Int._Bulk___|1___| |0x1e_____|15______|mmc+mr+mt+md________________|4_Int._Bulk___|1___| The second column contains the value of the relevant four ToS bits, followed by their translated meaning. For example, 15 stands for a packet wanting Minimal Monetary Cost, Maximum Reliability, Maximum Throughput AND Minimum Delay. The fourth column lists the way the Linux kernel interprets the ToS bits, by showing to which Priority they are mapped. The last column shows the result of the default priomap. On the command line, the default priomap looks like this: 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 This means that priority 4, for example, gets mapped to band number 1. The priomap also allows you to list higher priorities (> 7) which do not correspond to ToS mappings, but which are set by other means. 7.3.5. Classes PRIO classes cannot be configured further - they are automatically created when the PRIO qdisc is attached. Each class however can contain yet a further qdisc. 7.3.6. Bugs Large amounts of traffic in the lower bands can cause starvation of higher bands. Can be prevented by attaching a shaper to these bands to make sure they cannot dominate the link. 7.4. CBQ, Class Based Queuing (CBQ) Class Based Queuing (CBQ) is the classic implementation (also called venerable) of a traffic control system. CBQ is a classful qdisc that implements a rich link sharing hierarchy of classes. It contains shaping elements as well as prioritizing capabilities. Shaping is performed by calculating link idle time based on the timing of dequeue events and knowledge of the underlying link layer bandwidth. 7.4.1. Shaping Algorithm Shaping is done using link idle time calculations, and actions taken if these calculations deviate from set limits. When shaping a 10mbit/s connection to 1mbit/s, the link will be idle 90% of the time. If it isn't, it needs to be throttled so that it is idle 90% of the time. From the kernel's perspective, this is hard to measure, so CBQ instead computes idle time from the number of microseconds that elapse between requests from the device driver for more data. Combined with the knowledge of packet sizes, this is used to approximate how full or empty the link is. This is rather circumspect and doesn't always arrive at proper results. The physical link bandwidth may be ill defined in case of not-quite-real network devices like PPP over Ethernet or PPTP over TCP/IP. The effective bandwidth in that case is probably determined by the efficiency of pipes to userspace - which not defined. During operations, the effective idletime is measured using an exponential weighted moving average (EWMA). This calculation of activity against idleness values recent packets exponentially more than predecessors. The EWMA is an effective calculation to deal with the problem that a system is either active or inactive. For example, the Unix system load average is calculated in the same way. The calculated idle time is subtracted from the EWMA measured one, the resulting number is called 'avgidle'. A perfectly loaded link has an avgidle of zero: packets arrive exactly at the calculated interval. An overloaded link has a negative avgidle and if it gets too negative, CBQ throttles and is then 'overlimit'. Conversely, an idle link might amass a huge avgidle, which would then allow infinite bandwidths after a few hours of silence. To prevent this, avgidle is capped at maxidle. If overlimit, in theory, the CBQ could throttle itself for exactly the amount of time that was calculated to pass between packets, and then pass one packet, and throttle again. Due to timer resolution constraints, this may not be feasible, see the minburst parameter below. 7.4.2. Classification Under one installed CBQ qdisc many classes may exist. Each of these classes contains another qdisc, by default tc-pfifo. When enqueueing a packet, CBQ starts at the root and uses various methods to determine which class should receive the data. If a verdict is reached, this process is repeated for the recipient class which might have further means of classifying traffic to its children, if any. CBQ has the following methods available to classify a packet to any child classes. * skb>priority class encoding. Can be set from userspace by an application with the SO_PRIORITY setsockopt. The skb->priority class encoding only applies if the skb->priority holds a major:minor handle of an existing class within this qdisc. * tc filters attached to the class. * The defmap of a class, as set with the split and defmap parameters. The defmap may contain instructions for each possible Linux packet priority. Each class also has a level. Leaf nodes, attached to the bottom of theclass hierarchy, have a level of 0. 7.4.3. Classification Algorithm Classification is a loop, which terminates when a leaf class is found. At any point the loop may jump to the fallback algorithm. The loop consists of the following steps: * If the packet is generated locally and has a valid classid encoded within its skb->priority, choose it and terminate. * Consult the tc filters, if any, attached to this child. If these return a class which is not a leaf class, restart loop from he class returned. If it is a leaf, choose it and terminate. * If the tc filters did not return a class, but did return a classid, try to find a class with that id within this qdisc. Checkif the found class is of a lower level than the current class. If so, and the returned class is not a leaf node, restart the loop at the found class. If it is a leaf node, terminate. If we found an upward reference to a higher level, enter the fallback algorithm. * If the tc filters did not return a class, nor a valid reference to one, consider the minor number of the reference to be the priority. Retrieve a class from the defmap of this class for the priority. If this did not contain a class, consult the defmap of this class for the BEST_EFFORT class. If this is an upward reference, or no BEST_EFFORT class was defined, enter the fallback algorithm. If a valid class was found, and it is not a leaf node, restart the loop at this class. If it is a leaf, choose it and terminate. If neither the priority distilled from the classid, nor the BEST_EFFORT priority yielded a class, enter the fallback algorithm. The fallback algorithm resides outside of the loop and is as follows. * Consult the defmap of the class at which the jump to fallback occured. If the defmap contains a class for the priority of the class (which is related to the ToS field), choose this class and terminate. * Consult the map for a class for the BEST_EFFORT priority. If found, choose it, and terminate. * Choose the class at which break out to the fallback algorithm occurred. Terminate. The packet is enqueued to the class which was chosen when either algorithm terminated. It is therefore possible for a packet to be enqueued not at a leaf node, but in the middle of the hierarchy. 7.4.4. Link Sharing Algorithm When dequeuing for sending to the network device, CBQ decides which of its classes will be allowed to send. It does so with a Weighted Round Robin process in which each class with packets gets a chance to send in turn. The WRR process starts by asking the highest priority classes (lowest numerically - highest semantically) for packets, and will continue to do so until they have no more data to offer, in which case the process repeats for lower priorities. Each class is not allowed to send at length though, they can only dequeue a configurable amount of data during each round. If a class is about to go overlimit, and it is not bounded it will try to borrow avgidle from siblings that are not isolated. This process is repeated from the bottom upwards. If a class is unable to borrow enough avgidle to send a packet, it is throttled and not asked for a packet for enough time for the avgidle to increase above zero. 7.4.5. Root Parameters The root qdisc of a CBQ class tree has the following parameters: parent root | major:minor this mandatory parameter determines the place of the CBQ instance, either at the root of an interface or within an existing class. handle major: like all other qdiscs, the CBQ can be assigned a handle. Should consist only of a major number, followed by a colon. This parameter is optional. avpkt bytes for calculations, the average packet size must be known. It is silently capped at a minimum of 2/3 of the interface MTU. This parameter is mandatory. bandwidth rate underlying available bandwidth; to determine the idle time, CBQ must know the bandwidth of your A) desired target bandwidth, B) underlying physical interface or C) parent qdisc. This is a vital parameter, more about it later. This parameter is mandatory. cell size the cell size determines the granularity of packet transmission time calculations. Must be an integral power of 2, defaults to 8. mpu bytes a zero sized packet may still take time to transmit. This value is the lower cap for packet transmission time calculations - packets smaller than this value are still deemed to have this size. Defaults to 0. ewma log CBQ calculates idleness using an Exponentially Weighted Moving Average (EWMA) which smooths out measurements easily accommodating short bursts. The log value determines how much smoothing occurs. Lower values imply greater sensitivity. Must be between 0 and 31. Defaults to 5. A CBQ qdisc does not shape out of its own accord. It only needs to know certain parameters about the underlying link. Actual shaping is done in classes. 7.4.6. Class Parameters Classes have a lot of parameters to configure their operation. parent major:minor place of this class within the hierarchy. If attached directly to a qdisc and not to another class, minor can be omitted. This parameter is mandatory. classid major:minor like qdiscs, classes can be named. The major number must be equal to the major number of the qdisc to which it belongs. Optional, but needed if this class is going to have children. weight weightvalue when dequeuing to the lower layer, classes are tried for traffic in a round-robin fashion. Classes with a higher configured qdisc will generally have more traffic to offer during each round, so it makes sense to allow it to dequeue more traffic. All weights under a class are normalized, so only the ratios matter. Defaults to the configured rate, unless the priority of this class is maximal, in which case it is set to 1. allot bytes allot specifies how many bytes a qdisc can dequeue during each round of the process. This parameter is weighted using the renormalized class weight described above. priority priovalue in the round-robin process, classes with the lowest priority field are tried for packets first. This parameter is mandatory. rate bitrate maximum aggregated rate at which this class (children inclusive) can transmit. The bitrate is specified using the tc way of specifying rates (e.g. '1544kbit'). This parameter is mandatory. bandwidth bitrate this is different from the bandwidth specified when creating a parent CBQ qdisc. The CBQ class bandwidth parameter is only used to determine maxidle and offtime, which, in turn, are only calculated when specifying maxburst or minburst. Thus, this parameter is only required if specifying maxburst or minburst. maxburst packetcount, this number of packets is used to calculate maxidle so that when avgidle is at maxidle, this number of average packets can be burst before avgidle drops to 0. Set it higher to be more tolerant of bursts. You can't set maxidle directly, only via this parameter. minburst packetcount as mentioned before, CBQ needs to throttle in case of overlimit. The ideal solution is to do so for exactly the calculated idle time, and pass 1 packet. However, Unix kernels generally have a hard time scheduling events shorter than 10ms, so it is better to throttle for a longer period, and then pass minburst packets in one go, and then sleep minburst times longer. The time to wait is called the offtime. Higher values of minburst lead to more accurate shaping in the long term, but to bigger bursts at millisecond timescales. minidle microseconds, minidle: if avgidle is below 0, we are overlimits and need to wait until avgidle will be big enough to send one packet. To prevent a sudden burst from shutting down the link for a prolonged period of time, avgidle is reset to minidle if it gets too low. Minidle is specified in negative microseconds, so 10 means that avgidle is capped at -10us. bounded | borrow, identifies a borrowing policy. Either the class will try to borrow bandwidth from its siblings or it will consider itself bounded. Mutually exclusive. isolated | sharing identifies a sharing policy. Either the class will engage in a sharing policy toward its siblings or it will consider itself isolated. Mutually exclusive. split major:minor and defmap bitmap[/bitmap]: if consulting filters attached to a class did not give a verdict, CBQ can also classify based on the packet's priority. There are 16 priorities available, numbered from 0 to 15. The defmap specifies which priorities this class wants to receive, specified as a bitmap. The Least Significant Bit corresponds to priority zero. The split parameter tells CBQ at which class the decision must be made, which should be a (grand)parent of the class you are adding. As an example, 'tc class add ... classid 10:1 cbq .. split 10:0 defmap c0' configures class 10:0 to send packets with priorities 6 and 7 to 10: 1. The complimentary configuration would then be: 'tc class add ... classid 10:2 cbq ... split 10:0 defmap 3f' Which would send all packets 0, 1, 2, 3, 4 and 5 to 10:1. estimator interval timeconstant: CBQ can measure how much bandwidth each class is using, which tc filters can use to classify packets with. In order to determine the bandwidth it uses a very simple estimator that measures once every interval microseconds how much traffic has passed. This again is a EWMA, for which the time constant can be specified, also in microseconds. The time constant corresponds to the sluggishness of the measurement or, conversely, to the sensitivity of the average to short bursts. Higher values mean less sensitivity. 7.5. WRR, Weighted Round Robin This qdisc is not included in the standard kernels. The WRR qdisc distributes bandwidth between its classes using the weighted round robin scheme. That is, like the CBQ qdisc it contains classes into which arbitrary qdiscs can be plugged. All classes which have sufficient demand will get bandwidth proportional to the weights associated with the classes. The weights can be set manually using the tc program. But they can also be made automatically decreasing for classes transferring much data. The qdisc has a built-in classifier which assigns packets coming from or sent to different machines to different classes. Either the MAC or IP and either source or destination addresses can be used. The MAC address can only be used when the Linux box is acting as an ethernet bridge, however. The classes are automatically assigned to machines based on the packets seen. The qdisc can be very useful at sites where a lot of unrelated individuals share an Internet connection. A set of scripts setting up a relevant behavior for such a site is a central part of the WRR distribution. 8. Rules, Guidelines and Approaches 8.1. General Rules of Linux Traffic Control There are a few general rules which ease the study of Linux traffic control. Traffic control structures under Linux are the same whether the initial configuration has been done with tcng or with tc. * Any router performing a shaping function should be the bottleneck on the link, and should be shaping slightly below the maximum available link bandwidth. This prevents queues from forming in other routers, affording maximum control of packet latency/deferral to the shaping device. * A device can only shape traffic it transmits [10]. Because the traffic has already been received on an input interface, the traffic cannot be shaped. A traditional solution to this problem is an ingress policer. * Every interface must have a qdisc. The default qdisc (the pfifo_fast qdisc) is used when another qdisc is not explicitly attached to the interface. * One of the classful_qdiscs added to an interface with no children classes typically only consumes CPU for no benefit. * Any newly created class contains a FIFO. This qdisc can be replaced explicitly with any other qdisc. The FIFO qdisc will be removed implicitly if a child class is attached to this class. * Classes directly attached to the root qdisc can be used to simulate virtual circuits. * A filter can be attached to classes or one of the classful_qdiscs. 8.2. Handling a link with a known bandwidth HTB is an ideal qdisc to use on a link with a known bandwidth, because the innermost (root-most) class can be set to the maximum bandwidth available on a given link. Flows can be further subdivided into children classes, allowing either guaranteed bandwidth to particular classes of traffic or allowing preference to specific kinds of traffic. 8.3. Handling a link with a variable (or unknown) bandwidth In theory, the PRIO scheduler is an ideal match for links with variable bandwidth, because it is a work-conserving qdisc (which means that it provides no shaping). In the case of a link with an unknown or fluctuating bandwidth, the PRIO scheduler simply prefers to dequeue any available packet in the highest priority band first, then falling to the lower priority queues. 8.4. Sharing/splitting bandwidth based on flows Of the many types of contention for network bandwidth, this is one of the easier types of contention to address in general. By using the SFQ qdisc, traffic in a particular queue can be separated into flows, each of which will be serviced fairly (inside that queue). Well-behaved applications (and users) will find that using SFQ and ESFQ are sufficient for most sharing needs. The Achilles heel of these fair queuing algorithms is a misbehaving user or application which opens many connections simultaneously (e.g., eMule, eDonkey, Kazaa). By creating a large number of individual flows, the application can dominate slots in the fair queuing algorithm. Restated, the fair queuing algorithm has no idea that a single application is generating the majority of the flows, and cannot penalize the user. Other methods are called for. 8.5. Sharing/splitting bandwidth based on IP For many administrators this is the ideal method of dividing bandwidth amongst their users. Unfortunately, there is no easy solution, and it becomes increasingly complex with the number of machine sharing a network link. To divide bandwidth equitably between N IP addresses, there must be N classes. 9. Scripts for use with QoS/Traffic Control 9.1. wondershaper More to come, see wondershaper. 9.2. ADSL Bandwidth HOWTO script (myshaper) More to come, see myshaper. 9.3. htb.init More to come, see htb.init. 9.4. tcng.init More to come, see tcng.init. 9.5. cbq.init More to come, see cbq.init. 10. Diagram 10.1. General diagram Below is a general diagram of the relationships of the components of a classful queuing discipline (HTB pictured). A larger version of the diagram is available. Example 12. An example HTB tcng configuration /* * * possible mock up of diagram shown at * http://linux-ip.net/traffic-control/htb-class.png * */ $m_web = trTCM ( cir 512 kbps, /* commited information rate */ cbs 10 kB, /* burst for CIR */ pir 1024 kbps, /* peak information rate */ pbs 10 kB /* burst for PIR */ ) ; dev eth0 { egress { class ( <$web> ) if tcp_dport == PORT_HTTP && __trTCM_green ( $m_web ); class ( <$bulk> ) if tcp_dport == PORT_HTTP && __trTCM_yellow ( $m_web ); drop if __trTCM_red( $m_web ); class ( <$bulk> ) if tcp_dport == PORT_SSH ; htb () { /* root qdisc */ class ( rate 1544kbps, ceil 1544kbps ) { /* root class */ $web = class ( rate 512kbps, ceil 512kbps ) { sfq ; } ; $bulk = class ( rate 512kbps, ceil 1544kbps ) { sfq ; } ; } } } } 11. Annotated Traffic Control Links This section identifies a number of links to documentation about traffic control and Linux traffic control software. Each link will be listed with a brief description of the content at that site. * HTB_site, HTB_user_guide and HTB_theory (Martin “devik” Devera) Hierarchical Token Bucket, HTB, is a classful queuing discipline. Widely used and supported it is also fairly well documented in the user guide and at Stef Coene's_site (see below). * General_Quality_of_Service_docs (Leonardo Balliache) There is a good deal of understandable and introductory documentation on his site, and in particular has some excellent overview material. See in particular, the detailed Linux_QoS document among others. * tcng_(Traffic_Control_Next_Generation) and tcng_manual (Werner Almesberger) The tcng software includes a language and a set of tools for creating and testing traffic control structures. In addition to generating tc commands as output, it is also capable of providing output for non-Linux applications. A key piece of the tcng suite which is ignored in this documentation is the tcsim traffic control simulator. The user manual provided with the tcng software has been converted to HTML with latex2html. The distribution comes with the TeX documentation. * iproute2 and iproute2_manual (Alexey Kuznetsov) This is a the source code for the iproute2 suite, which includes the essential tc binary. Note, that as of iproute2-2.4.7-now-ss020116-try.tar.gz, the package did not support HTB, so a patch available from the HTB site will be required. The manual documents the entire suite of tools, although the tc utility is not adequately documented here. The ambitious reader is recommended to the LARTC HOWTO after consuming this introduction. * Documentation,_graphs,_scripts_and_guidelines_to_traffic_control_under_Linux (Stef Coene) Stef Coene has been gathering statistics and test results, scripts and tips for the use of QoS under Linux. There are some particularly useful graphs and guidelines available for implementing traffic control at Stef's site. * LARTC_HOWTO (bert hubert, et. al.) The Linux Advanced Routing and Traffic Control HOWTO is one of the key sources of data about the sophisticated techniques which are available for use under Linux. The Traffic Control Introduction HOWTO should provide the reader with enough background in the language and concepts of traffic control. The LARTC HOWTO is the next place the reader should look for general traffic control information. * Guide_to_IP_Networking_with_Linux (Martin A. Brown) Not directly related to traffic control, this site includes articles and general documentation on the behaviour of the Linux IP layer. * Werner_Almesberger's_Papers Werner Almesberger is one of the main developers and champions of traffic control under Linux (he's also the author of tcng, above). One of the key documents describing the entire traffic control architecture of the Linux kernel is his Linux Traffic Control - Implementation Overview which is available in PDF or PS format. * Linux_DiffServ_project Mercilessly snipped from the main page of the DiffServ site... Differentiated Services (short: Diffserv) is an architecture for providing different types or levels of service for network traffic. One key characteristic of Diffserv is that flows are aggregated in the network, so that core routers only need to distinguish a comparably small number of aggregated flows, even if those flows contain thousands or millions of individual flows. ------------------------------------------------------------------------------- [1] See Section 5,_“Software_and_Tools” for more details on the use or installation of a particular traffic control mechanism, kernel or command line utility. [2] This queueing model has long been used in civilized countries to distribute scant food or provisions equitably. William Faulkner is reputed to have walked to the front of the line for to fetch his share of ice, proving that not everybody likes the FIFO model, and providing us a model for considering priority queuing. [3] Similarly, the entire traffic control system appears as a queue or scheduler to the higher layer which is enqueuing packets into this layer. [4] This smoothing effect is not always desirable, hence the HTB parameters burst and cburst. [5] A classful qdisc can only have children classes of its type. For example, an HTB qdisc can only have HTB classes as children. A CBQ qdisc cannot have HTB classes as children. [6] In this case, you'll have a filter which uses a classifier to select the packets you wish to drop. Then you'll use a policer with a with a drop action like this police rate 1bps burst 1 action drop/drop. [7] I do not know the range nor base of these numbers. I believe they are u32 hexadecimal, but need to confirm this. [8] The options listed in this example are taken from a 2.4.20 kernel source tree. The exact options may differ slightly from kernel release to kernel release depending on patches and new schedulers and classifiers. [9] HTB will report bandwidth usage in this scenario incorrectly. It will calculate the bandwidth used by quantum instead of the real dequeued packet size. This can skew results quickly. [10] In fact, the Intermediate_Queuing_Device_(IMQ) simulates an output device onto which traffic control structures can be attached. This clever solution allows a networking device to shape ingress traffic in the same fashion as egress traffic. Despite the apparent contradiction of the rule, IMQ appears as a device to the kernel. Thus, there has been no violation of the rule, but rather a sneaky reinterpretation of that rule.