Tải bản đầy đủ (.pdf) (128 trang)

Understanding Linux Network Internals 2005 phần 9 ppsx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (6.95 MB, 128 trang )

33.3. Major Cache Operations
The protocol-independent (DST) part of the cache is a set of dst_entry data structures. Most of the activities in this chapter happen through a
dst_entry structure. The IPv4 and IPv6 data structures rtable and rt6_info both include a dst_entry data structure.
The dst_entry structure offers a set of virtual functions in a field named dst_ops, which allows higher-layer protocols to run protocol-specific
functions that manipulate the entries. The DST code is located in net/core/dst.c and include/net/dst.h.
All the routines that manipulate dst_entry structures start with a dst_ prefix. Note that even though they operate on dst_entry structures, they
actually affect the outer rtable structures, too.
DST is initialized with dst_init, invoked at boot time by net_dev_init (see Chapter 5).
33.3.1. Cache Locking

Read-only operations, such as lookups , use a different locking mechanism from read-write operations such as insertion and deletion, but
they naturally have to cooperate. Here is how they are handled:
Read-only operations
These use the routines presented in the section "Cache Lookup" and are protected by a read-copy-update (RCU) read lock, as
in the following snapshot:
rcu_read_lock( );

perform lookup

rcu_read_unlock( );
This code actually does no locking, because read operations can proceed simultaneously without interfering with each other.
Read-write operations
The insertion of an entry (see the section "Adding Elements to the Cache") and the deletion of an entry (see the section "Deleting
DST Entries") use the spin lock embedded in each bucket's element and shown in Figure 33-1. Note that the provision of a
per-bucket lock lets different processors write simultaneously to different buckets.
Chapter 1 explains the RCU algorithm used to implement locking in the routing table cache, and how read-write spin locks coexist with RCU.
33.3.2. Cache Entry Allocation and Reference Counts
A memory pool used to allocate new cache entries is created by ip_rt_init at boot time. Cache entries are allocated with dst_alloc, which returns a
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
void pointer that is cast by the creator to the right data type. Despite the function's name, it does not allocate dst_entry structures, but instead


allocates the larger entries that contain those structures: rtable structures for IPv4 (as shown in Figure 33-1), rt6_info for IPv6, and so on.
Because the function can be called to allocate structures of different sizes for different protocols, the size of the structure to allocate is
indicated through an entry_size virtual function, described in the section "Interface Between the DST and Calling Protocols."
33.3.3. Adding Elements to the Cache

Every time a cache lookup required to route an ingress or egress packet fails, the kernel consults the routing table and stores the result
into the routing cache. The kernel allocates a new cache entry with dst_alloc, initializes some of its fields based on the results from the routing
table, and finally calls rt_intern_hash to insert the new entry into the cache at the head of the bucket's list. A new route is also added to the
cache upon receipt of an ICMP REDIRECT message (see Chapter 25). Figures 33-2(a) and 33-2(b) shows the logic of rt_intern_hash. When
the kernel is compiled with support for multipath caching, a cache miss may lead to the insertion of multiple routes into the cache, as
discussed in the section "Multipath Caching."
The function first checks whether the new route already exists by issuing a simple cache lookup. Even though the function was called
because a cache lookup failed, the route could have been added in the meantime by another CPU. If the lookup succeeds, the existing
cached route is simply moved to the head of the bucket's list. (This assumes the route is not associated with a multipath route; i.e., that its
DST_BALANCED flag is not set.) If the lookup fails, the new route is added to the cache.
As a simple way to keep the size of the cache under control, rt_intern_hash TRies to remove an entry every time it adds a new one. Thus, while
browsing the bucket's list, rt_intern_hash keeps track of the most eligible route for deletion and measures the length of the bucket's list. A route
is removed only from those that are eligible for deletion (that is, routes whose reference counts are 0) and when the bucket list is longer
than the configurable parameter ip_rt_gc_elasticity. If these conditions are met, rt_intern_hash invokes the rt_score routine to choose the best route to
remove. rt_score ranks routes, according to many criteria, into three classes, ranging from most-valuable routes (least eligible to be removed)
to least-valuable routes (most eligible to be removed):
[*]
[*]
See the section "Examples of eligible cache victims" in Chapter 30.
Figure 33-2a. rt_intern_hash function
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
Routes that were inserted via ICMP redirects, are being monitored by user-space commands, or are scheduled for expiration.
Output routes (the ones used to route locally generated packets), broadcast routes, multicast routes, and routes to local
addresses (for packets generated by this host for itself).

All other routes in decreasing order of timestamp of last use: that is, least recently used routes are removed first.
rt_score simply stores the time the entry has not been used in the lower 30 bits of a local 32-bit variable, then sets the 31
st
bit for the first class
of routes and the 32
nd
bit for the second class of routes. The final value is a score that represents how important that route is considered
to be: the lower the score, the more likely the route is to be selected as a victim by rt_intern_hash.
Figure 33-2b. rt_intern_hash function
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
33.3.4. Binding the Route Cache to the ARP Cache
Most routing cache entries are bound to the ARP cache entry of the route's next hop. This means that a routing cache entry requires either
an existing ARP cache entry or a successful ARP lookup for the same next hop. In particular, the binding is done for output routes used to
route locally generated packets (identified by a NULL ingress device identifier) and for unicast forwarding routes. In both cases, ARP is
asked to resolve the next hop's L2 address. Forwarding to broadcast addresses, multicast addresses, and local host addresses does not
require an ARP resolution because the addresses are resolved using other means.
Egress routes that lead to broadcast and multicast addresses do not need associated ARP entries, because the associated L2 addresses
can be derived from the L3 addresses (see the section "Special Cases" in Chapter 26). Routes that lead to local addresses do not need
ARP either, because packets matching the route are delivered locally.
ARP binding for routes is created by arp_bind_neighbour. When that function fails due to lack of memory, rt_intern_hash forces an aggressive garbage
collection operation on the routing cache by calling rt_garbage_collect (see the section "Garbage Collection"). The aggressive garbage collection
is done by lowering the thresholds ip_rt_gc_elasticity and ip_rt_gc_min_interval and then calling rt_garbage_collect. The garbage collection is tried only once,
and only when rt_intern_hash has not been called from software interrupt context, because otherwise, it would be too costly in CPU time. Once
garbage collection has completed, the insertion of the new cache entries starts over from the cache lookup step.
33.3.5. Cache Lookup

Anytime there is a need to find a route, the kernel consults the routing cache first and falls back to the routing table if there is a cache

miss. The routing table lookup process is described in Chapter 35; in this section, we will look at the cache lookup.
The routing subsystem provides two different functions to do route lookups , one for ingress and one for egress:
ip_route_input
Used for input traffic, which could be either delivered locally or forwarded. The function determines how to handle generic
packets (whether to deliver locally, forward, drop, etc.) but is also used by other subsystems to decide how to handle their
ingress traffic. For instance, ARP uses this function to see whether an ARPOP_REQUEST should be answered (see Chapter 28).
ip_route_output_key
Used for output traffic, which is generated locally and could be either delivered locally or transmitted out.
Possible return values from the two routines include:
0
The routing lookup was successful. This case includes a cache miss that triggers a successful routing table lookup.
-ENOBUF
The lookup failed due to a memory problem.
-ENODEV
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
The lookup key included a device identifier and it was invalid.
-EINVAL
Generic lookup failure.
The kernel also provides a set of wrappers around the two basic functions, used under specific conditions. See, for example, how TCP
uses ip_route_connect and ip_route_newports.
Figure 33-3 shows the internals of two main routing cache lookup routines. The egress function shown in the figure is _ _ip_route_output_key, which
is indirectly called by ip_route_output_key.
Figure 33-3. (a) ip_route_input_key function; (b) _ _ip_route_output_key function

The routing cache is used to store both ingress and egress routes, so a cache lookup is tried in both cases. In case of a cache miss, the
functions call ip_route_input_slow or ip_route_output_slow, which consult the routing tables via the fib_lookup routine that we will cover in Chapter 35. The
names of the functions end in _slow to underline the difference in speed between a lookup that is satisfied from the cache and one that
requires a query of the routing tables. The two paths are also referred to as the fast and slow paths.
Once the routing decision has been taken, through either a cache hit or a routing table, and resulting either in success or failure, the

lookup routines return the input buffer skb with the skb->dst->input and skb->dst->output virtual functions initialized. skb->dst is the cache entry that
satisfied the routing request; in case of a cache miss, a new cache entry is created and linked to skb->dst.
The packet will then be further processed by calling either one or both of the virtual functions skb->dst->input (called via a simple wrapper
named dst_input) and skb->dst->output (called via a wrapper named dst_output). Figure 18-1 in Chapter 18 shows where those two virtual functions are
invoked in the IP stack, and what routines they can be initialized to depending on the direction of the traffic.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
Chapter 35 goes into detail on the slow routines for the routing table lookups. The next two sections describe the internals of the two
cache lookup routines in Figure 33-3. Their code is very similar; the only differences are:
On ingress, the device of the ingress route needs to match the ingress device, whereas the egress device is not yet known and
is therefore simply compared against the null device (0). The opposite applies to egress routes.
In case of a cache hit, the functions update the in_hit and out_hit counters, respectively, using the RT_CACHE_STAT_INC macro. Statistics
related to both the routing cache and the routing tables are described in Chapter 36.
Egress lookups need to take the RTO_ONLINK flag into account (see the section "Egress lookup").
Egress lookups support multipath caching, the feature introduced in the section "Cache Support for Multipath" in Chapter 31.
33.3.5.1. Ingress lookup

ip_route_input is used to route ingress packets. Here is its prototype and the meaning of its input parameters:
int ip_route_input(struct sk_buff *skb, u32 daddr, u32 saddr,
u8 tos, struct net_device *dev)
skb
Packet that triggered the route lookup. This packet does not necessarily have to be routed itself. For example, ARP uses
ip_route_input to consult the local routing table for other reasons. In this case, skb would be an ingress ARP request.
saddr
daddr
Source and destination addresses to use for the lookup.
tos
TOS field, a field of the IP header.
dev
Device the packet was received from.

ip_route_input selects the bucket of the hash table that should contain the route, based on the input criteria. It then browses the list of routes in
that bucket one by one, comparing all the necessary fields until it either finds a match or gets to the end without a match.
The lookup fields passed as input to ip_route_input are compared to the fields stored in the fl field
[*]
of the routing cache entry's rtable, as shown in
the following code extract. The bucket (hash variable) is chosen through a combination of input parameters. The route itself is represented
by the rth variable.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
[*]
See the description of the flowi structure in the section "Main Data Structures" in Chapter 32.
hash = rt_hash_code(daddr, saddr ^ (iif << 5), tos);
rcu_read_lock( );
for (rth = rcu_dereference(rt_hash_table[hash].chain; rth;
rth = rcu_dereference(rth->u.rt_next)) {
if (rth->fl.fl4_dst == daddr &&
rth->fl.fl4_src == saddr &&
rth->fl.iif == iif &&
rth->fl.oif == 0 &&
#ifdef CONFIG_IP_ROUTE_FWMARK
rth->fl.fl4_fwmark == skb->nfmark &&
#endif
rth->fl.fl4_tos == tos) {
rth->u.dst.lastuse = jiffies;
dst_hold(&rth->u.dst);
rth->u.dst._ _use++;
RT_CACHE_STAT_INC(in_hit);
rcu_read_unlock( );
skb->dst = (struct dst_entry*)rth;
return 0;

}
RT_CACHE_STAT_INC(in_hlist_search);
}
rcu_read_unlock( );
In the case of a cache miss for a destination address that is multicast, the packet is passed to the multicast handler ip_route_input_mc if one of
the following two conditions is met, and is dropped otherwise:
The destination address is a locally configured multicast address. This is checked with ip_check_mc.
The destination address is not locally configured, but the kernel is compiled with support for multicast routing (CONFIG_IP_MROUTE).
This decision is shown in the following code:
if (MULTICAST(daddr)) {
struct in_device *in_dev;

rcu_read_lock( );
if ((in_dev = _ _in_dev_get(dev)) != NULL) {
int our = ip_check_mc(in_dev, daddr, saddr,
skb->nh.iph->protocol);
if (our
#ifdef CONFIG_IP_MROUTE
|| (!LOCAL_MCAST(daddr) && IN_DEV_MFORWARD(in_dev))
#endif
) {
rcu_read_unlock( );
return ip_route_input_mc(skb, daddr, saddr,
tos, dev, our);
}
}
rcu_read_unlock( );
return -EINVAL;
}
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.

Simpo PDF Merge and Split Unregistered Version -
Finally, in the case of a cache miss for a destination address that is not multicast, ip_route_input calls ip_route_input_slow, which consults the routing
table:
return ip_route_input_slow(skb, daddr, saddr, tos, dev);
}
33.3.5.2. Egress lookup

_ _ip_route_output_key is used to route locally generated packets and is very similar to ip_route_input: it checks the cache first and relies on
ip_route_output_slow in the case of a cache miss. When the cache supports Multipath, a cache hit requires some more work: more than one
entry in the cache may be eligible for selection and the right one has to be selected based on the caching algorithm in use. The selection
is done with multipath_select_route. More details can be found in the section "Multipath Caching."
Here is its prototype and the meaning of its input parameters:
int _ _ip_route_output_key(struct rtable **rp, const struct flowi *flp)
rp
When the routine returns success, *rp is initialized to point to the cache entry that matched the search key flp.
flp
Search key.
A successful egress cache lookup needs to match the RTO_ONLINK flag, if it is set:
!((rth->fl.fl4.tos ^ flp->fl4_tos) &
(IPTOS_RT_MASK | RTO_ONLINK)))
The preceding condition is true when both of the following conditions are met:
The TOS of the routing cache entry matches the one in the search key. Note that the TOS field is saved in the bits 2, 3, 4 and 5
of the 8-bit tos variable (as shown in Figure 18-3 in Chapter 18).
[*]
[*]
The TOS field, as shown in Figure 18-3 in Chapter 18, is an 8-bit field, of which bit 0 is ignored and bit 1
through 7 are used. However, the routing code uses only the bits 1, 2, 3 and 4. It does not take the precedence
component (bits 5, 6, 7) into consideration for egress routes. Those bits are masked out with the macro
RT_TOS.
The RTO_ONLINK flag is set on both the routing cache entry and the search key or on neither of them.

You will see the RTO_ONLINK flag in the section "Search Key Initialization" in Chapter 35. The flag is passed via the TOS variable, but it has
nothing to do with the IP header's TOS field; it simply uses an unused bit of the TOS field (see Figure 18-1 in Chapter 18). When the flag is
set, it means the destination is located in a local subnet and there is no need to do a routing lookup (or, in other words, a routing lookup
could fail but that would not be a problem). This is not a flag the administrator sets when configuring routes, but it is used when doing
routing lookups to specify that the route type searched must have scope RT_SCOPE_LINK, which means the destination is directly connected.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
The flag is then saved in the associated routing cache entries when they are created. Lookups with the RTO_ONLINK flag set are made, for
example, by the following protocols:
ARP
When an administrator manually configures an ARP mapping, the kernel makes sure that the IP address belongs to one of the
locally configured subnets. For example, the command arp -s 10.0.0.1 11:22:33:44:55:66 adds the mapping of 10.0.0.1 to
11:22:33:44:55:66 to the ARP cache. This command would be rejected by the kernel if, according to its routing table, the IP
address 10.0.0.1 did not belong to one of the locally configured subnets (see arp_req_set and Chapter 26).
Raw IP and UDP
When sending data over a socket, the user can set the MSG_DONTROUTE flag. This flag is used when an application is transmitting a
packet out from a known interface to a destination that is directly connected (there is no need for a gateway), so the kernel does
not have to determine the egress device. This kind of transmission is used, for instance, by routing protocols and diagnostic
applications.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
33.4. Multipath Caching

The concepts behind this feature are introduced in the section "Cache Support for Multipath" in Chapter 31. When the kernel is compiled
with support for multipath caching, the lookup code adds multiple routes to the cache, as shown in the section "Multipath Caching" in
Chapter 35. In this section, we will examine the key routines used to implement this feature, and the interface provided by caching
algorithms.
33.4.1. Registering a Caching Algorithm


Caching algorithms are defined with an instance of the ip_mp_alg_ops data structure, which consists of function pointers. Depending on the
needs of the caching algorithm, not all function pointers may be initialized, but one is mandatory: mp_alg_select_route.
Algorithms register and unregister with the kernel, respectively, using multipath_alg_register and multipath_alg_unregister. All the algorithms are
implemented as modules in the net/ipv4/ directory.
33.4.2. Interface Between the Routing Cache and Multipath
For each function pointer of the ip_mp_alg_ops data structure, the kernel defines a wrapper in include/net/ip_mp_alg.h. Here is when each one
is called:
multipath_select_route
This is the most important routine. It selects the right route from the ones in the cache that satisfy a given lookup (because they
are associated with the same multipath route). This routine is called by _ _ip_route_output_key, the lookup function we saw earlier.
multipath_flush
Clears any state kept by the algorithm when the cache is flushed. It is called by rt_cache_flush (see the section "Flushing the Routing
Cache").
multipath_set_nhinfo
Updates the state information kept by the algorithm when a new multipath route is cached.
multipath_remove
Removes the right routes in the cache when a multipath route is removed (for example, by rt_free).
None of the algorithms supports multipath_remove, and only the weighted random algorithm uses multipath_flush and multipath_set_nhinfo.
In later sections, we will see what state information the various algorithms need to keep, and how they implement the mp_alg_select_route
routine.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
33.4.3. Helper Routines
Here are a couple of routines used by the multipath code:
multipath_comparekeys
Compares two route selectors. It is used mainly by the mp_alg_select_route algorithm's functions to find cached routes that are
associated with the same multipath route as another cached route.
rt_remove_balanced_routes
Given an input cached route, removes it and all the other cached routes on the same hash table's bucket that are associated with

the same multipath route. The last input parameter to rt_remove_balanced_routes returns the number of cached routes removed. The
function's return value is the next rtable instance in the hash bucket's list that follows the input parameter's route. This return value
is used by the caller to resume its scan on the table from the right position. When rt_remove_balanced_routes removes the last rtable
instance of the bucket's list, it returns NULL.
33.4.4. Common Elements Between Algorithms

Keeping the following three points in mind will help you understand the code that deals with multipath caching, and in particular, the
implementation of the mp_alg_select_route routine provided by the caching algorithms:
Entries of the routing cache associated with multipath routes can be recognized thanks to the DST_BALANCED flag, which is set prior
to their insertion into the cache (see the section "dst_entry Structure" in Chapter 36). We will see exactly how and when this is
done in Chapter 35. This flag is often used in the routing cache code to apply different actions, depending on whether a given
entry of the cache is associated with a multipath route.
The dst_entry structure used to define cached routes includes a timestamp of last use (dst->lastuse). Each time a cached route is
returned by a cache lookup, this timestamp is updated for the route. Cache entries associated with multipath routes need to be
handled specially. When the cache entry returned by a lookup is associated with a multipath route, all the other entries of the
cache associated with the same multipath route must have their timestamps updated, too. This is necessary to avoid having
routes purged by the garbage collection algorithm.
The input to the mp_alg_select_route routine is the first cache entry that matches the lookup key. Given how elements are added to the
routing table cache, all the other entries of the cache associated with the same multipath route are located within the same
bucket. For this reason, mp_alg_select_route will browse the bucket list starting from the input cache element and identify the other
routes thanks to the DST_BALANCED flag and the multipath_comparekeys routine.
33.4.5. Random Algorithm
This algorithm does not need to keep any state information, and therefore it does not need any memory to be allocated, nor does it take up
significant CPU time to make its decisions. All the algorithm does is browse the routes of the input table's bucket, count the number of
routes eligible for selection, generate a random number with the local routine random, and select the right cache entry based on that random
number.
The algorithm is defined in net/ipv4/multipath_random.c.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
33.4.6. Weighted Random Algorithm


This is the algorithm with the most complicated implementation. Each next hop of a multipath route can be assigned a weight. The
algorithm selects the right next hop (i.e., the right route in the cache) randomly and proportionally to the weights.
For each multipath route's next hop there is an instance of the fib_nh data structure that stores the weight, among other parameters. We will
see in Chapter 34 where those data structures are located in the routing table. In particular, you can refer to Figure 34-1 in that chapter.
The section "Weighted Random Algorithm" in Chapter 31 explains the basic concepts behind this algorithm. To help make a quick decision,
the algorithm builds a local database of information that it uses to access fib_nh instances and to read the weights of the next hops. Figure
33-4 shows what that database would look like after configuration of the following two multipath routes:
#ip route add 10.0.1.0/24 mpath wrandom nexthop via 192.168.1.1 weight 1
nexthop via 192.168.2.1 weight 2
#ip route add 10.0.2.0/24 mpath wrandom nexthop via 192.168.1.1 weight 5
nexthop via 192.168.2.1 weight 1
The database is actually not built right away when the multipath routes are defined: it is populated at lookup time.
Remember that the input to the mp_alg_select_route routine (wrandom_select_route in this case) is the first cached route of the routing cache that
matches the search key. All other eligible cached routes will be in the same routing cache bucket.
Selection of the route by mp_alg_select_route is accomplished in two steps:
mp_alg_select_route first browses the routing cache's bucket, and for each route, checks whether it is eligible for selection with the
multipath_comparekeys routine. In the meantime, it creates a local list of eligible cached routes, with the main goal of defining a line like
the one in Figure 31-4 in Chapter 31. Figure 33-5 shows what the list would look like for the example in that chapter. Each route
added to the list gets its weight using the database in Figure 33-4 and initializes the power field accordingly.
Figure 33-4. Next-hop database created by the weighted random algorithm
1.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
Figure 33-5b. Example of temporary list created for the next-hop selection
mp_alg_select_route generates a random number and, given the list of eligible routes, selects one route using the mechanism described
in the section "Weighted Random Algorithm" in Chapter 31.
2.
Let's see how a lookup on the state database works. Let's keep in mind that cached routes (that is, rtable instances) contain the next hop
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.

Simpo PDF Merge and Split Unregistered Version -
router and the egress device. Given a cached route, _ _multipath_lookup_weight first selects the right state's bucket based on the egress device: state is
indexed based on that device. Once a bucket of state has been selected, the list of multipath_route elements is scanned, looking for one that
matches the gateway and device fields. Once the right multipath_route instance has been identified, the list of associated multipath_dest structures
is scanned, looking for one that matches the destination IP address of the input lookup key fl. From the matching multipath_dest instance, the
function can read the next-hop weight via the pointer nh_info that points to the right fib_nh instance.
The state database is populated by the multipath_set_nhinfo routine we saw in the section "Interface Between the Routing Cache and Multipath."
This algorithm is defined in net/ipv4/multipath_random.c.
33.4.7. Round-Robin Algorithm

The round-robin algorithm does not need additional data structures to keep the state information it needs. All the required information is
retrieved from the dst->_ _use field of the dst_entry structure, which represents the number of times a cache lookup returned the route. The
selection of the right route therefore consists simply of browsing the routes of the input table's bucket, and selecting, among the eligible
routes, the one with the lowest value of _ _use.
The algorithm is defined in net/ipv4/multipath_rr.c.
33.4.8. Device Round-Robin Algorithm
The purpose and effect of this algorithm were explained in the section "Device Round-Robin Algorithm" in Chapter 31. This algorithm
selects the right egress device, and therefore the right entry in the cache for a given multipath route, with the drr_select_route routine as follows:
The global vector state keeps a counter for each device that indicates how many times is has been selected.1.
For each multipath route, only the first next hop on any given device is considered. This speeds up the decision but implies that
there is no load sharing between next hops that share the same egress device: for each device, only one next hop of any
multipath route is used.
2.
While browsing the routes (i.e., next hops) for the computation of the lowest use count, routes associated with devices that have
not been used yet are given higher preference. When a new device is selected, a new entry is added to state.
3.
The first route analyzed for the device with the lowest use count is selected.4.
The algorithm is defined in net/ipv4/multipath_drr.c.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -

33.5. Interface Between the DST and Calling Protocols

The DST cache is an independent subsystem; it has, for instance, its own garbage collection mechanism. As a subsystem, it provides a
set of functions that various protocols can use to change or tune its behavior. When external subsystems need to interact with the routing
cache, such as to notify it of an event or read the value of one of its parameters, they do it via a set of DST routines defined in the files
net/core/dst.c and include/net/dst.h. These routines are wrappers around a set of functions made available by the L3 protocol that owns
the cache, by initializing an instance of a dst_ops VFT, as shown in Figure 33-6.
Figure 33-6. dst_ops interface
The key structure presented by DST to higher layers is dst_entry; protocol-specific structures such as rtable are merely wrappers for this
structure. IP owns the routing cache, but other protocols often keep references to routing cache elements. All of those references refer to
dst_entry, not to its rtable wrapper. The sk_buff buffers also keep a reference to the dst_entry structure, not to the rtable structure. This
reference is used to store the result of the routing lookup.
The dst_entry and dst_ops structures are described in detail in the associated sections in Chapter 36. There is an instance of dst_ops for
each protocol; for example, IPv4 uses ipv4_dst_ops, initialized in net/ipv4/route.c:
struct dst_ops ipv4_dst_ops = {
.family = AF_INET,
.protocol = _ _constant_htons(ETH_P_IP),
.gc = rt_garbage_collect,
.check = ipv4_dst_check,
.destroy = ipv4_dst_destroy,
.ifdown = ipv4_dst_ifdown,
.negative_advice = ipv4_negative_advice,
.link_failure = ipv4_link_failure,
.update_pmtu = ip_rt_update_pmtu,
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
.entry_size = sizeof(struct rtable),
};
Whenever the DST subsystem is notified of an event or a request is made via one of the DST interface routines, the protocol associated
with the affected dst_entry instance is notified by an invocation of the proper function among the ones provided by the dst_entry through

its instance of the dst_ops VFT. For example, if ARP would like to notify the upper protocol about the unreachability of a given IPv4
address, it calls dst_link_failure for the associated dst_entry structure (remember that cached routes are associated with IP addresses,
not with networks), which will invoke the ipv4_link_failure routine registered by IPv4 via ipv4_dst_ops.
It is also possible for the calling protocol to intervene directly in DST's behavior. For example, when IPv4 asks DST to allocate a new
cache entry, DST may then realize there is a need to start garbage collection and invoke rt_garbage_collect, the routine provided by
IPv4 itself.
When a given type of notification requires some kind of processing common to all the protocols, the common logic may be implemented
directly inside the DST APIs instead of being replicated in each protocol's handler.
Some virtual functions in the DST's dst_ops structure are invoked through wrappers in higher layers; functions that do not have a
wrapper are invoked directly through the syntax dst->ops->function. Here is the meaning of the dst_ops virtual functions and a brief
description of the IPv4 subsystem's routines (listed in the preceding snapshot of code) that would be assigned to them:
gc
Takes care of garbage collection. It is run when the subsystem allocates a new cache entry with dst_alloc and that function
realizes there is a shortage of memory. The IPv4 routine rt_garbage_collect is described in the section "Synchronous
Cleanup."
check
A cached route whose dst_entry is marked as dead is normally not usable. However, there is one case, where IPsec is in use,
where that's not necessarily true. This routine is used to check whether an obsolete dst_entry is usable. For instance, look at
the ipv4_dst_check routine, which performs no check on the submitted dst_entry structure before removing it, and compare it
to the corresponding xfrm_dst_check routine used to do "xfrm" transforms for IPsec. Also see how routines such as
sk_dst_check (introduced in Chapter 21) check the status of a cached route. There is no wrapper for this function.
destroy
Called by dst_destroy, the routine that the DST runs to delete a dst_entry structure, and informs the calling protocol of the
deletion to give it a chance to do any necessary cleanup first. For example, the IPv4 routine ipv4_dst_destroy uses the
notification to release references to other data structures. dst_destroy is described in the section "Deleting DST Entries."
ifdown
Called by dst_ifdown, which is invoked by the DST subsystem itself when a device is shut down or unregistered. It is called
once for each affected cached route (see the section "External Events"). The IPv4 routine ipv4_dst_ifdown replaces the
rtable's pointer to the device's IP configuration idev with a pointer to the loopback device, because that is always sure to exist.
negative_advice

Called by the DST function dst_negative_advice, which is used to notify the DST about a problem with a dst_entry instance.
For example, TCP uses dst_negative_advice when it detects a write timeout.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
The IPv4's routine ipv4_negative_advice uses this notification to delete the cached route. When the dst_entry is already
marked as dead (through its dst->obsolete flag, as we will see in the section "Deleting DST Entries"), ipv4_negative_advice
simply releases the rtable's reference to the dst_entry.
link_failure
Called by the DST function dst_link_failure, which is invoked when a transmission problem is detected due to an unreachable
destination.
As an example of this function's use, the neighbor protocols ARP and Neighbor Discoveryused by IPv4 and IPv6,
respectivelyinvoke it to indicate that they never received a reply to solicitation requests they generated to resolve an L3-to-L2
address association. (They can usually tell this because of a timeout; see, for example, arp_error_report in net/ipv4/arp.c for
the behavior of the ARP protocol.) Other higher-layer protocols, such as the various tunnels (IP over IP, etc.), do the same
when they have problems reaching the other end of a tunnel, which could be several hops away; see, for example,
ipip_tunnel_xmit in net/ipv4/ipip.c for the IP-over-IP tunneling protocol.
update_pmtu
Updates the PMTU of a cached route. It is usually invoked to handle the reception of an ICMP Fragmentation Needed
message. See the section "Processing Ingress ICMP_REDIRECT Messages" in Chapter 31. There is no wrapper for this
function.
get_mss
Returns the TCP maximum segment size that can be used on this route. IPv4 does not initialize this routine, and there is no
wrapper for this function. See the section "IPsec Transformations and the Use of dst_entry."
Besides the wrappers around the functions just shown, the DST also manipulates dst_entry instances through functions that do not need
to interact with other subsystems. For example, the section "Asynchronous Cleanup" shows dst_set_expires, and Chapter 26 shows how
dst_confirm is used to confirm the reachability of a neighbor. See the files net/core/dst.c and include/net/dst.h for more details.
33.5.1. IPsec Transformations and the Use of dst_entry

In the previous sections, we saw the most common use for dst_entry structures: to store the protocol-independent information regarding
a cached route, including the input and output methods that process the packets to be received or transmitted after a routing lookup.

Another use for dst_entry structures is made by IPsec, a suite of protocols used to provide secure services such as authentication and
confidentiality on top of IP. IPsec uses dst_entry structures to build what it calls transformation bundles . A transformation is an operation
to apply to a packet, such as encryption. A bundle is just a set of transformations defined as a sequence of operations. Once the IPsec
protocols decide on all the transformations to apply to the traffic that matches a given route, that information is stored in the routing
cache as a list of dst_entry structures.
Normally, a route is associated with a single dst_entry structure whose input and output fields describe how to process the matching
packets (forward, deliver locally, etc., as shown in Figure 18-1 in Chapter 18). But IPsec creates a list of dst_entry instances where only
the last instance uses input and output to actually apply the routing decisions; the previous instances use input and output to apply the
required transformations, as shown in Figure 33-7 (the model in the figure is a simplified one).
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
Figure 33-7. Use of dst_entry (a) without IPsec; (b) with IPsec
dst_entry lists are created using the child pointer in the structure. Another pointer named path, also used by IPsec, points to the last
element of the list (the one that would be created even when IPsec is not in use).
Each of the other dst_entry elements in the listthat is, each element except the lastis there to implement an IPsec transformation. Each
sets its path field to point to the last element. In addition, each sets its DST_NOHASH flag so that the DST subsystem knows it is not part
of the routing cache hash table and that another subsystem is taking care of it.
The implications of IPsec on routing lookups are as follows: both input and output routing lookups are affected by the data structure
layout shown for IPsec configuration in Figure 33-7(b). The result returned by a lookup is a pointer to the first dst_entry that implements a
transformation, not the last one representing the real routing information. This is because the first dst_entry instance represents the first
transformation to be applied, and the transformations must be applied in order.
You can find interactions between the IP or routing layer and IPsec in several other places:
For egress traffic, ip_route_output_flow (which is called by ip_route_output_key, introduced in the section "Cache Lookup")
includes extra code (i.e., a call to xfrm_lookup) to interact with IPsec.
For ingress traffic that is to be delivered locally, ip_local_deliver_finish calls xfrm4_policy_check to consult the IPsec policy
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
database.
ip_forward makes the same check for ingress traffic that needs to be forwarded.
Sometimes the IP code makes a direct call to the generic xfrm_xxx IPsec routines, and sometimes it uses IPv4 wrappers with the names

xfrm4_ xxx.
33.5.2. External Events

When dst_init initializes the DST subsystem, it registers with the device event notification chain netdev_chain, introduced in Chapter 4. The
only two events the DST is interested in are the ones generated when a network device goes down (NEtdEV_DOWN) and when a
device is unregistered (NEtdEV_UNREGISTER). You can find the complete list of NETDEV_XXX events in include/linux/notifier.h.
When a device becomes unusable, either because it is not available anymore (for instance, it has been unregistered from the kernel), or
because it has simply been shut down for administrative reasons, all the routes using that device become unusable as well. This means
that both the routing tables and the routing cache need to be notified about this kind of event and react accordingly. We will see how the
routing tables are handled in Chapter 34. Here we will see how the routing cache is cleaned up. The dst_entry structures for cached
routes can be inserted in one of two places:
The routing cache.
The dst_garbage_list list. Here deleted routes wait for all their references to be released, to become eligible for deletion by
the garbage collection process.
The entries in the cache are taken care of by the notification handler fib_netdev_event (described in the section "Impacts on the routing
tables" in Chapter 32), which, among other actions, flushes the cache. The ones in the dst_garbage_list list are taken care of by the routine
that DST registers with the neTDev_chain notification chain. As shown in the following snippet from net/core/dst.c, the handler DST uses
to process the received notifications is dst_dev_event:
static struct notifier_block dst_dev_notifier = {
.notifier_call = dst_dev_event,
};

void _ _init dst_init(void)
{
register_netdevice_notifier(&dst_dev_notifier);
}
dst_dev_event browses the dst_garbage_list list of dead dst_entry structures and invokes dst_ifdown for each one. The last input
parameter to dst_ifdown tells it what event it is being called to handle. Here is how it handles the two event types:
NETDEV_UNREGISTER
When the device is unregistered, all references to it have to be removed. dst_ifdown replaces them with references to the

loopback device, for both the dst_entry structure and its associated neighbour instance, if any.
[*]
[*]
See the section "L2 Header Caching" in Chapter 27.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
NETDEV_DOWN
Because the device is down, traffic cannot be sent to it anymore. Therefore, the input and output routines of dst_entry are set
to dst_discard_in and dst_discard_out, respectively. These two routines simply discard any input buffer passed to them (i.e.,
any frame they are asked to process).
We saw in the section "IPsec Transformations and the Use of dst_entry" that a dst_entry structure could be linked to other ones through
the child pointer. dst_ifdown goes child by child and updates all of them. The input and output routines are updated only for the last entry,
because that entry is the one that uses the routines for reception or transmission.
We saw in Chapter 8 that unregistering a device triggers not only a NEtdEV_UNREGISTER notification but also a NEtdEV_DOWN
notification, because a device has to be shut down to be unregistered. This means that both events handled by dst_dev_event occur
when a device is unregistered. This explains why dst_ifdown checks its unregister parameter and deliberately skips part of its code when
the parameter is set, while running other parts only when it is set.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
33.6. Flushing the Routing Cache

Whenever a change in the system takes place that could cause some of the information in the cache to become out of date, the kernel
flushes the routing cache. In many cases, only selected entries are out of date, but to keep things simple the kernel removes all entries.
The main events that trigger flushing are:
A device comes up or goes down
Some addresses that used to be reachable through a given device may not be reachable anymore, or may be reachable
through a different device with a better route.
An IP address is added to or removed from a device
We saw in the sections "Adding an IP address" and "Removing an IP address" in Chapter 32 that Linux creates a special route
for each locally configured IP address. When an address is removed, any associated route in the cache also has to be

removed. The removed address was most likely configured with a netmask different from /32, so all the cache entries
associated with addresses within the same subnet should go away
[*]
as well. Finally, if one of the addresses in the same
subnet was used as a gateway for other indirect routes, all of them should go away. Flushing the entire cache is simpler than
keeping track of all of these possible cases.
[*]
This is not true when you remove a secondary address. See the section "Removing an IP address" in
Chapter 32.
The global forwarding status, or the forwarding status of a device, has changed
If you disable forwarding, you need to remove all the cached routes that were used to forward traffic. See the section
"Enabling and Disabling Forwarding" in Chapter 36.
A route is removed
All the cached entries associated with the deleted route need to be removed.
An administrative flush is requested via the /proc interface
This is described in the section "The /proc/sys/net/ipv4/route Directory" in Chapter 36.
The routine used to flush the cache is rt_run_flush, but it is never called directly. Requests to flush the cache are done via rt_cache_flush,
which will either flush the cache right away or start a timer, depending on the value of the input timeout provided by the caller:
Less than 0
The cache is flushed after the number of seconds specified by the kernel parameter ip_rt_min_delay, which can be tuned via
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
/proc as described in the section "The /proc/sys/net/ipv4/route Directory" in Chapter 36.
0
The cache is flushed right away.
Greater than 0
The cache is flushed after the specified amount of time.
Once a flush request is submitted, a flush is guaranteed to take place within ip_rt_max_delay seconds, which is set to 8 by default.
When a flush request is submitted and there is already one pending, the timer is restarted to reflect the new request; however, the new
request cannot ask the timer to expire later than ip_rt_max_delay seconds since the previous timer was fired. This is accomplished by

using the global variable rt_deadline.
In addition, the cache is periodically flushed by means of a periodic timer, rt_secret_timer, that expires every ip_rt_secret_interval
seconds (see the section "The /proc/sys/net/ipv4/route Directory" in Chapter 36 for its default value). When the timer expires, the handler
rt_secret_rebuild flushes the cache and restarts the timer. ip_rt_secret_interval is configurable via /proc.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -

×