Understanding Linux Network Internals 2005 phần 7 pot

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (8.15 MB, 128 trang )

NUD_INCOMPLETE
A solicitation has been sent, but no reply has been received yet. In this state, there is no hardware address to use (not even an
old one, as there is with NUD_STALE).
NUD_REACHABLE
The address of the neighbor is cached and the latter is known to be reachable (there has been a proof of reachability).
NUD_FAILED
Marks a neighbor as unreachable because of a failed solicitation request, either the one generated when the entry was created
or the one triggered by the NUD_PROBE state.
NUD_STALE
NUD_DELAY
NUD_PROBE
Transitional states; they will be resolved when the local host determines whether the neighbor is reachable. See the section
"Reachability Confirmation."
The next set of values represents a group of special states that usually never change once assigned:
NUD_NOARP
This state is used to mark neighbors that do not need any protocol to resolve the L3-to-L2 mapping (see the section "Special
Cases"). The section "Start of the arp_constructor Function" in Chapter 28 shows how and why this state is set in IPv4/ARP. But
even though the name of this state suggests that it applies only to ARP, it can actually be used by any neighboring protocol.
NUD_PERMANENT
The L2 address of the neighbor has been statically configured (i.e., with user-space commands) and therefore there is no need
to use any neighboring protocol to take care of it. See the section "System Administration of Neighbors" in Chapter 29.
26.6.2.2. Derived states

In addition to the basic states listed in the previous section, the following derived values are defined just to make the code clearer when
there is a need to refer to multiple states with something in common:
NUD_VALID
An entry is considered to be in the NUD_VALID state if its state is any one of the following, which represent neighbors believed to
have an available address:
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
NUD_PERMANENT

NUD_NOARP
NUD_REACHABLE
NUD_PROBE
NUD_STALE
NUD_DELAY
NUD_CONNECTED
This is used for the subset of NUD_VALID states that do not have a confirmation process pending:
NUD_PERMANENT
NUD_NOARP
NUD_REACHABLE
NUD_IN_TIMER
The neighboring subsystem is running a timer for this entry, which happens when the status is unclear. The basic states that
correspond to this are:
NUD_INCOMPLETE
NUD_DELAY
NUD_PROBE
Let's look at an example of why a derived state is useful in kernel code. When a neighbor instance is removed, the host needs to stop all
the pending timers associated with that data structure. Instead of comparing the neighbor's state to the three states known to have a
pending timer associated with them, it is just cleaner to define NUD_IN_TIMER and compare the neighbor's state against it using the bitwise
operator &.
26.6.2.3. Initial state

When a neighbor instance is created, the NUD_NONE state is assigned to it by default, but the state can also be explicitly set to something
different when the creation is caused by an explicit user command (see Chapter 29).
As explained in the section "Neighbor Initialization" in Chapter 27, the protocol's constructor method may also change the state depending on
the characteristics of the associated device (e.g., point-to-point) and L3 address (e.g., broadcast).
26.6.3. Reachability Confirmation

We saw in the section "Why Static Assignment of Addresses Is Not Sufficient" that it is possible for an L3-to-L2 mapping to change.
Because of this, it makes sense to confirm the information stored in the cache regularly, if the information has not been used for some

time. This is called reachability confirmation.
Note that a change in reachability status is not necessarily due to the reasons listed in the section "Reasons That Neighboring Protocols
Are Needed"; a router, bridge, or other network device may just be experiencing some problems. While the reachability confirmation is in
progress, the cached information is temporarily used under the assumption that it is most likely still valid.
The three NUD states NUD_STALE, NUD_DELAY, and NUD_PROBE support the task of reachability confirmation. The key reason for the use of these
states is that there is no need to start a reachability confirmation process until a packet needs to be sent to the associated neighbor.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
Let's define once again the exact meaning of these three NUD states, and then look at the two ways a mapping can be confirmed:
NUD_STALE
The cache contains the address of the neighbor, but the latter has not been confirmed for a certain amount of time (see the
discussion of reachable_time in the section "neigh_parms Structure" in Chapter 29). The next time a packet is sent to the neighbor, the
reachability verification process will be started.
NUD_DELAY
This state, closely tied to NUD_STALE, represents an optimization that can reduce the number of transmissions of solicitation
requests.
This state is entered when a packet is sent to a neighbor whose associated entry is in the NUD_STALE state. The NUD_DELAY state
represents a window of time where external sources could confirm the reachability of the neighbor. The simplest sort of external
confirmation is when the neighbor in question sends a packet, thus indicating that it is running and accessible.
This state gives some time to the upper network layers to provide a reachability confirmation, which may relieve the kernel from
sending a solicitation request and thus save both bandwidth and CPU usage. This state may look like a small optimization, but if
you think in terms of big networks, you can imagine the gain it can provide.
If no confirmation is received, the entry is put into the next state, NUD_PROBE, which resolves the status of the neighbor through
explicit solicitation requests or whatever other mechanism a protocol might use.
NUD_PROBE
When the neighbor has been in the NUD_DELAY state for the allotted amount of time and no proof of reachability has been received,
its state is changed to NUD_PROBE and the solicitation process starts.
The reachability status of a neighbor can be confirmed in two main ways. As we will see, these two methods do not have the same level of
authority. They are:
Confirmation from a unicast solicitation's reply

When your host receives a solicitation reply in answer to a solicitation request it previously sent out, it means that the neighbor
received the request and was able to send back a reply; this in turn means that either it already had your L2 address or it learned
your address from your request (see the section "Creating a neighbour Entry" in Chapter 27. It also means that there is a working
path in both directions. Note, however, that this is true only when the solicitation's reply is sent as a unicast packet. The reception
of a broadcast reply would move the state to NUD_STALE rather than NUD_REACHABLE. (You can find more discussion of this from the
standpoint of ARP in the section "Processing Ingress ARP Packets" in Chapter 28.)
External confirmation
If your host is sure it received a packet from the neighbor in response to something previously sent, it can assume the neighbor is
still reachable. Figure 26-14 shows an example, where the TCP layer of Host A confirms the reachability of Host B when it
receives a SYN/ACK in reply to its SYN. Note that if Host B was not a neighbor of Host A, the reception of the SYN/ACK from
Host B would confirm the reachability of the next hop gateway used by Host A to reach Host B.
Figure 26-14. Example of external neighbor reachability confirmation
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
Confirmation is done via dst_confirm, which confirms the validity of the routing table cache entry used to route the SYN packet
toward Host B. dst_confirm is a simple wrapper around neigh_confirm, which accomplishes the task we described earlier: it confirms the
reachability of the neighbor and therefore the L3-to-L2 mapping. Note that neigh_confirm only updates the neigh->confirmed timestamp; it
will be the neigh_periodic_timer function (which is executed by the expiration of the timer started when the neighbor entered the NUD_DELAY
state) that actually upgrades the neighbor entry's state to NUD_REACHABLE.
[*]
[*]
The delay between the reception of the confirmation from the L4 layer and the setting of the state to
NUD_REACHABLE does not affect traffic in any way.
Note that the correlation between the two packets in Figure 26-14 could not be performed at the IP layer because the latter
doesn't have any knowledge of data streams. This is why the L4 layer takes care of the confirmation. TCP SYN/ACK exchanges
are only one example of an L4 protocol providing external confirmation. Given a socket, and therefore the associated routing
cache entry and its next-hop gateway, a user-space application can confirm the reachability of the gateway by using the
MSG_CONFIRM option with transmission calls such as send and sendmsg.
While the reception of a solicitation's reply can move the state to NUD_REACHABLE regardless of the current state, external
confirmations can be used only when the current state is NUD_STALE. This means that if the entry had just been created and it was

in the NUD_INCOMPLETE state, external confirmations would not be allowed to confirm the reachability of the neighbor (see Figure
26-13).
Note that NUD_DELAY/NUD_PROBE and NUD_NONE can lead to NUD_REACHABLE, as shown in Figure 26-13; however, from NUN_NONE to get to
NUD_REACHABLE, you need full proof of reachability, while from NUD_DELAY/NUD_PROBE, any kind of confirmation is sufficient.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
Chapter 27. Neighboring Subsystem: Infrastructure
In Chapter 26, we saw the main problems that the neighboring protocols are asked to solve. You also learned that the Linux kernel
abstracted out parts of the solution into a common infrastructure shared by various neighboring protocols. In this chapter, we will see
how the infrastructure is designed. In particular, we will see how protocols interface to the common infrastructure, how caching and
proxying are implemented, and how external subsystems such as higher-layer protocols notify the neighboring protocols about
interesting events. We will conclude the chapter with a description of how L3 protocols such as IPv4 actually interface with their
neighboring protocols, and how queuing is implemented for buffers awaiting address resolution.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
27.1. Main Data Structures

To understand the code for the neighboring infrastructure, we first need to describe a few data structures used heavily in the neighboring
subsystem, and see how they interact with each other.
Most of the definitions for these structures can be found in the file include/net/neighbour.h. Note that the Linux kernel code uses the British
spelling neighbour for data structures and functions related to this subsystem. When speaking generically of neighbors, this book sticks to
the American spelling, which is the spelling found in RFCs and other official documents.
struct neighbour
Stores information about a neighbor, such as the L2 and L3 addresses, the NUD state, the device through which the neighbor
can be reached, etc. Note that a neighbour enTRy is associated not with a host, but with an L3 address. There can be more than
one L3 address for a host. For example, routers, among other systems, have multiple interfaces and therefore multiple L3
addresses.
struct neigh_table
Describes a neighboring protocol's parameters and functions. There is one instance of this structure for each neighboring
protocol. All of the structures are inserted into a global list pointed to by the static variable neigh_tables and protected by the lock

neigh_tbl_lock. This lock protects the integrity of the list, but not the content of each entry.
struct neigh_parms
A set of parameters that can be used to tune the behavior of a neighboring protocol on a per-device basis. Since more than one
protocol can be enabled on most interfaces (for instance, IPv4 and IPv6), more than one neigh_parms structure can be associated
with a net_device structure.
struct neigh_ops
A set of functions that represents the interface between the L3 protocols such as IP and dev_queue_xmit, the API introduced in
Chapter 11 and described briefly in the upcoming section "Common Interface Between L3 Protocols and Neighboring Protocols."
The virtual functions can change based on the context in which they are used (that is, on the status of the neighbor, as described
in Chapter 26).
struct hh_cache
Caches link layer headers to speed up transmission. It is faster to copy a cached header into a buffer in one shot than to fill in its
fields one by one. Not all device drivers implement header caching. See the section "L2 Header Caching."
struct rtable
struct dst_entry
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
When a host needs to route a packet, it first consults its cache and then, in the case of a cache miss, it queries the routing table.
Every time the host queries the routing table, the result is saved into the cache. The IPv4 routing cache is composed of rtable
structures. Each instance is associated with a different destination IP address. Among the fields of the rtable structure are the
destination address, the next hop (router), and a structure of type dst_entry that is used to store the protocol-independent
information. dst_entry includes a pointer to the neighbour structure associated with the next hop. I cover the dst_entry data structure in detail
in Chapter 36. In the rest of this chapter, I will often refer to dst_entry structures as elements of the routing table cache, even though
dst_entry is actually only a field of the rtable structure.
Figure 27-1 shows how dst_entry structures are linked to hh_cache and neighbour structures.
The neighboring code also uses some other small data structures. For instance, struct pneigh_entry is used by destination-based proxying, and
struct neigh_statistics is used to collect statistics about neighboring protocols. The first structure is described in the section "Acting As a Proxy,"
and the second one is described in the section "Statistics" in Chapter 29. Figure 27-2 also includes the following data structure types,
described in greater detail in Chapters 22 and 23:
Figure 27-1. Relationship among dst_entry, neighbour, and hh_cache structures

in_device, inet6_dev
Used to store the IPv4 and IPv6 configurations of a device, respectively.
net_device
There is one net_device structure for each network device recognized by the kernel. See Chapter 8.
Figure 27-2 shows the relationships between the most important data structures. Right now it might seem a big mess, but it will make
much more sense by the end of this chapter.
Here are the main points shown in Figure 27-2:
In the central part of the figure, you can see that each network device has a pointer to a data structure that holds the
configuration for each L3 protocol configured on the device. In the example shown in the figure, IPv6 is configured on one device
and IPv4 is configured on both. Both the in_device structure (IPv4 configuration) and inet6_dev structure (IPv6 configuration) include a
pointer to the configuration used by their neighboring protocols, respectively ARP and ND.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
All of the neigh_parms structures used by any given protocol are linked together in a unidirectional list whose root is stored in the
protocol's neigh_table structure.
The top and bottom of the figure show that each protocol keeps two hash tables. The first one, hash_buckets, caches the L3-to-L2
mappings resolved by the protocol or statically configured. The second one, phash_bucket, stores those IP addresses that are
proxied, as described in the section "Per-Device Proxying and Per-Destination Proxying." Note that phash_bucket is not a cache, so its
elements do not expire and don't need confirmation. Each pneigh_entry structure
Figure 27-2. Data structures' relationships
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
includes a pointer (not depicted in Figure 27-2) to its associated net_device structure. Figure 27-6 gives more detail on the structure
of the cache hash_buckets.
Each neighbour instance is associated with one or more hh_cache structures, if the device supports header caching. The section "L2
Header Caching," and Figures 27-1 and 27-10, give more details about the relationship between neighbour and hh_cache structures.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -

27.2. Common Interface Between L3 Protocols and Neighboring Protocols

The Linux kernel has a generic neighboring layer that connects L3 protocols to the main L2 transmit function (dev_queue_xmit) via a virtual
function table (VFT). A VFT is the mechanism frequently used in the Linux kernel for allowing subsystems to use different functions at
different times. The VFT for the neighboring subsystem is implemented as a data structure named neigh_ops. A pointer to one of these
structures is embedded as a field named ops in each neighbour structure.
The flexibility of the VFT interface allows different L3 protocols to use different neighboring protocols. This in turn allows different
neighboring protocols to behave quite differently while allowing the neighboring subsystem to provide a common generic interface
between the neighboring protocols and the L3 protocols.
In this section, we examine the VFT-based interface between the L3 protocols and the neighboring protocols, the advantages of using the
VFT, when it is first initialized, and how it is updated during the lifetime of a neighbor. The section concludes with a brief overview of the
functions used to control the initialization of the VFT. To better understand this section, you are invited to first read the section "neigh_ops
Structure" in Chapter 29.
Let's start with an overview of how the routines in the VFT are invoked. Given a neighbour instance and its embedded VFT neighbour->ops, the
function to which the output field points could in theory be invoked directly like this:
neigh->ops->output
But this construct is not found in the Linux code because even this is not general enough. The function in the output field of the neigh_ops
structure is only one of four functions that perform similar tasks, each function having its own field in neigh_ops. The individual protocol has to
decide which of the four functions to use. The proper function depends on events, the context, and the configuration of the interface and
device. So, to leave the neighboring infrastructure protocol-independent, the neighbour structure contains its own output field. The individual
protocol assigns the proper function from one of the fields in neigh->ops to neigh->output. This allows the code to be simpler and clearer. For
instance, instead of doing:
if (neighbour is not reachable)
neigh->ops->output(skb)
else
if (the device used to reach the neighbor can use cached headers)
neigh->ops->hh_output(skb)
else
neigh->ops->connected_output(skb)
the neighboring infrastructure can just call:

neigh->output
as long as neigh->output has been initialized by the protocol to the right neigh_ops method. Of course, each neighboring protocol uses its own logic
to initialize neigh->output; it does not necessarily have to follow the rules in this snapshot.
When a neighbor is created, its neighbour->ops field is initialized to the proper neigh_ops structure, as shown in Figure 27-3(a). This assignment
does not change during the neighbor's lifetime. However, as depicted in Figure 27-3(b), neigh->output can be changed to different functions
many times during the lifetime of the neighbor structure, driven both by events that take place during protocol operation, and (much less
often) by user commands. The following sections will go into detail on both initializations shown in Figure 27-3.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
Figure 27-3. (a) Initialization of neigh->ops; (b) initialization of neigh->output
27.2.1. Initialization of neigh->ops
On certain types of devices, the initialization of the functions listed in Figure 27-3(b) could be further optimized to speed up transmissions.
These include, for instance, the situations described in the section "Special Cases" in Chapter 26, where there is no need to map an L3
address to an L2 address. In those cases, the neighboring subsystem can almost be bypassed altogether and only the queue_xmit function
described in Chapter 11 is needed. The protocol code needs to know this kind of detail, but the general neighboring infrastructure does
not, so the protocol can just initialize neigh->output to neigh->ops->queue_xmitand everything remains transparent to the upper layers. Simple!
For this reason, each protocol provides for three different instances of the neigh_ops VFT:
A generic table that can be used in any context (xxx_generic_ops). This is the one that is normally used to handle neighbors whose L2
addresses need to be resolved.
An optimized set of functions that can be used when the device driver provides its own set of functions to manipulate L2 headers
and thus take advantage of the speedup coming from the use of cached headers (xxx_hh_ops).
A table that can be used when the device does not need to map L3 addresses to L2 addresses (xxx_direct_ops). An example is the
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
use of ISDN with raw IP encapsulation.
When the neighbor instance is created, the protocol initializes the neigh_ops VFT to the right instance depending on several factors. See the
section "neigh_ops Structure" in Chapter 29.
In the specific case of IPv4/ARP, a fourth instance of neigh_ops called arp_broken_ops is used to initialize those neighbour instances associated with
old devices that have not been adapted to the new neighboring infrastructure and therefore would not work otherwise. This once again
shows how generic the neighboring infrastructure is: by initializing the neigh_ops VFT in the right way, the kernel is even able to use the old

ARP code.
27.2.2. Initialization of neigh->output and neigh->nud_state

The state of a neighbor (neigh->nud_state) and the neigh->output function depend on each other. When nud_state changes, output often has to be
updated accordingly. As a simple example, if the state becomes stale, confirmation of reachability is required. But the neighboring
infrastructure doesn't waste time confirming reachability right away; there might be no further traffic and the effort might be wasted.
Instead, the neighboring infrastructure stops using the optimized output function that blindly plugs in the current address, and switches to the
slower output function that checks the address. In the example in Figure 27-3(a), we would change connected_output from c1 to o1.
For help in understanding this section, check Figure 26-13 in Chapter 26 for the possible states that neigh->nud_state can assume, based on
device type and protocol events.
The neighboring subsystem provides a generic routine, neigh_update, that moves a neighbor to the state provided as an input argument. A
later section in this chapter describes neigh_update in detail, but let's first look at the most common changes of state and the helper routines
that can be called, either directly or via neigh_update, to take care of them.
Let's start with the most common case: a device that needs a neighboring protocol, an address that does not belong to any of the special
cases described in Chapter 26, and a change of state caused by a transition (that is, we exclude creation and deletion).
[*]
Figure 26-12 in
Chapter 26 can then be simplified to produce Figure 27-4. The figure also shows the kernel functions where the transitions are handled.
However, not all of the transitions made by calls to neigh_update are shown, because most are too generic to add any value to the figure; only
the transition triggered by the reception of a solicitation reply is shown.
[*]
For the first initialization of neigh->output, check the source code of the constructor routines (e.g., arp_constructor/ndisc_constructor
for ARP/ND). For ARP, see the section "Initialization of a neighbour Structure" in Chapter 28.
Figure 27-4. Possible state transitions for a neighbor that has been resolved at least once
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
Note that some of the transitions in Figure 27-4 are asynchronous: they are taken care of by a timer and are therefore triggered by
timestamp comparisons.
[*]
Other transitions are taken care of synchronously by the protocols (e.g., neigh_event_send

[]
).
[*]
The routines used to compare timestamps, such as time_after_eq and time_before_eq, are defined in include/linux/jiffies.h.
[]
Part of neigh_event_send is also depicted in Figure 27-13 as part of the expanded neigh_resolve_output flowchart.
27.2.2.1. Common state changes: neigh_connect and neigh_suspect

The main ways a neighbor can enter the NUD_REACHABLE state (all described in Chapter 26) are:
Reception of a solicitation reply
When a solicitation reply is received, either to resolve a mapping for the first time or to confirm a neighbor in the NUD_PROBE state,
the protocol updates neigh->nud_state via neigh_update. This update is synchronous and happens right away.
L4 confirmation
The first time neigh_timer_handler is executed after the reception of an L4 reachability confirmation, the state is changed to
NUD_REACHABLE (see the section "Reachability Confirmation" in Chapter 26). An L4 confirmation is asynchronous and may be
slightly delayed.
Manual configuration
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
When a new neighbour structure is created by the user through a system administration command, this command can specify the
state, and NUD_REACHABLE is a valid state. In this case, neigh_connect is invoked via neigh_update.
Whenever the NUD_REACHABLE state is entered, the neighboring infrastructure calls the neigh_connect function to make the neigh->output function point
to neigh_ops->connected_output.
When a neighbor in the NUD_REACHABLE state moves to NUD_STALE or NUD_DELAY, or is simply initialized to a state different from one of the states
in NUD_CONNECTED (for example, by a call to neigh_update), the kernel invokes neigh_suspect to enforce confirmation of reachability (see the section
"Reachability Confirmation" in Chapter 26). neigh_suspect does this by setting neighbour->output to neigh_ops->output.
Both neigh_connect and neigh_suspect also update the neighbour->output and neighbour->hh_output functions of all of the hh_cache structures linked to the input
neighbour instance (see Figure 27-1). Neither function, however, updates the NUD state of a neighbour instance, because that is already taken
care of by their callers. Later in this chapter I'll use the forms "connect the neighbor" and "suspect the neighbor" to refer to the invocation of
neigh_connect and neigh_suspect, respectively, for that neighbor.

Some transitions (changes of NUD state) can happen at any time and more than once during the lifetime of a neighbour instance. Others can
take place only once. With some knowledge of networking, it is not hard to look at Figure 26-13 in Chapter 26 and identify the transitions
that belong to each of the two categories. For those neighbour instances initialized to permanent states (for instance, NUD_NOARP), neigh->output can
be initialized to neigh_ops->connected right away and it will never change.
27.2.2.2. Routines used for neigh->output

As explained in the previous section, neigh->output is initialized by the neighbor's constructor function, and later is manipulated as a consequence
of protocol events via the two routines neigh_connect and neigh_suspect. neigh->output is always set to one of the virtual functions of neigh_ops. This
section lists the functions that can be assigned to the neigh_ops virtual functions. The dev_queue_xmit function, which is not really part of the
neighboring subsystem, is defined in net/core/dev.c. The other routines are defined in net/core/neighbour.c.
dev_queue_xmit
The L3 layer always calls this function when transmitting a packet, regardless of the kind of device or L2 and L3 protocols used.
A neighboring protocol initializes the function pointers of neigh_ops to dev_queue_xmit when all the information needed to transmit on the
egress device is present and there is no extra work for the neighboring subsystem to do. If you look at arp_direct_ops in Chapter 28,
you can see that all four transmission virtual functions are set to dev_queue_xmit. That function is described in Chapter 11.
neigh_connected_output
This function just fills in the L2 header and then calls neigh_ops->queue_xmit. Therefore, it expects the L2 address to be resolved. It is
used by neighbour structures in the NUD_CONNECTED state.
neigh_resolve_output
This function resolves the L3 address to the L2 address before transmitting, so it is used when that association is not ready yet
or needs to be confirmed. Except for the situations in the section "Special Cases" in Chapter 26, neigh_resolve_output is usually the
default routine used when a new neighbour structure is created and its L3 address needs to be resolved.
neigh_compat_output
This function is present for backward compatibility. Before the neighboring infrastructure was introduced, it was possible to call
dev_queue_xmit even if the L2 address was not ready yet.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
neigh_blackhole
This function is used to handle the temporary case where a neighbour structure cannot be removed because someone is still
holding a reference to it. neigh_blackhole discards any packet received in input. This is necessary to ensure that no attempt to

transmit a packet to the neighbor will take place, because the neighbor's data structures are about to be removed. See the
section "Neighbor Deletion."
The section "Initialization of a neighbour Structure" in Chapter 28 shows how ARP uses these functions to initialize the different instances
of the neigh_ops VFT. The choices made by the functions are also shown in the flowchart in Figure 27-13.
27.2.3. Updating a Neighbor's Information: neigh_update

neigh_update, defined in net/core/neighbour.c, is a generic function that can be used to update the link layer address of a neighbour structure. This
is its prototype, with a brief description of the input parameters:
int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new,
u32 flags)
neigh
Pointer to the neighbour structure to update.
lladdr
New link layer (L2) address. lladdr may not always be initialized to a new value. For instance, when neigh_update is called to delete a
neighbour structure (by setting its state to NUD_FAILED, as described in the section "Neighbor Deletion," it is passed a NULL value for
lladdr.
new
New NUD state.
flags
Used to convey information such as whether an existing link layer address can be overridden, etc. Here are the available flags,
from include/net/neighbour.h:
NEIGH_UPDATE_F_ADMIN
Administrative change. This means the change derives from a user-space command (see the section "System Administration of
Neighbors" in Chapter 29).
NEIGH_UPDATE_F_OVERRIDE
The current L2 address can be overridden by lladdr. Administrative changes use this flag to distinguish between replace and add
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
commands, among other things (see Table 29-1 in Chapter 29). Protocol code can use this flag to enforce a minimum lifetime for
an L2 address (see, for example, the section "Final Common Processing in Chapter 28).

The next three flags are used only by IPv6 code:
NEIGH_UPDATE_ISROUTER
The neighbor is a router. This flag is used to initialize the IPv6 flag NTF_ROUTER in neighbour->flags.
NEIGH_UPDATE_F_OVERRIDE_ISROUTER
The IPv6 NTF_ROUTER flag can be overridden.
NEIGH_UPDATE_F_WEAK_OVERRIDE
If the link layer address lladdr supplied in input differs from the current known link layer address of the neighbor neigh->ha, the address
is suspected (i.e., its state is moved to NUD_STALE so that reachability confirmation is triggered).
The IPv6's ND protocol uses flags in the protocol header that can influence the setting of the NEIGH_UPDATE_F_XXX flags just listed. The
discussion that follows skips over the parts of neigh_update that deal with the IPv6-only flags.
neigh_update is used by all of the administrative interfaces to change the link layer address of a neighbour structure, as shown in Figure 29-1 in
Chapter 29. The function can also be used by the neighboring protocols themselves, but it is not the only function that changes state.
Figures 27-5(a) and 27-5(b) show a high-level description of neigh_update's internals. The flowchart is divided into different areas, each area
taking care of a different task:
Sanity checks
Changes applied to a neighbor whose current state is not NUD_VALID
Selection of the L2 address to use for a change applied to a neighbor whose current state is NUD_VALID
Setting a new link layer address
Change of NUD state
Handling an arp_queue queue
The following subsections explain the code in detail.
27.2.3.1. neigh_update optimization

Before changing the state of a neighbor, neigh_update first checks to see whether it is possible to avoid the change. An optimization discards
the change of state if both of the following conditions are met (see (c)):
The link layer address has not been modified (that is, the input lladdr is the same as the current neigh->ha).
The new state is NUD_STALE and the current one is NUD_CONNECTED, which means that the current state is actually better than the new
one.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -

Figure 27-5a. neigh_update function
27.2.3.2. Initial neigh_update operations
In this section, we trace the decisions made by neigh_update as it handles various values for the current state (neighbour->nud_state) and the
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
requested state (the new parameter).
Figure 27-5b. neigh_update function
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
Only administrative commands (NEIGH_UPDATE_F_ADMIN) can change the state of a neighbor that is currently in the NUD_NOARP or NUD_PERMANENT
state. A sanity check at the beginning of neigh_update causes it to exit right away if these constraints are violated.
When the new state new is not a valid oneif it is NUD_NONE or NUD_INCOMPLETEthe neighbor timer is stopped if it is running, and the entry is
marked suspect (that is, requiring reachability confirmation) through neigh_suspect if the old state was NUD_CONNECTED. See the section
"Initialization of neigh->output and neigh->nud_state." When the new state is a valid one, the neighbor timer is restarted if the new state
requires it (NUD_IN_TIMER).
When neigh_update is asked to change the NUD state to a value different from the current one, which is normally the case, it needs to check
whether the state is changing from a value included in NUD_VALID to another value not in NUD_VALID (remember that NUD_VALID is a derived state
that includes multiple NUD_XXX values). In particular, when the old state was not NUD_VALID and the new one is NUD_VALID, the host has to
transmit all of the packets that are waiting in the neighbor's arp_queue queue. Since the state of the neighbor could change while doing this
(because the host may be a symmetric multiprocesing, or SMP, system), the state of the neighbor is rechecked before sending each
packet.
27.2.3.3. Changes of link layer address

The reason for calling neigh_update is to change the NUD state, but it can also change the destination link layer address by which a neighbor
is reached. The function will do this if a new link layer address is provided (that is, if the lladdr parameter is not NULL) and if the input
parameter flags allows it. When the link layer address is changed, all of the cached headers need to be updated accordingly. This is taken
care of by neigh_update_hhs.
When no link layer address is supplied to neigh_update (i.e., lladdr is NULL), and the current NUD state is not a valid one, neigh_update discards the

input frame skb and returns with an error (no change of state is applied if there is no valid link layer address for the neighbor).
27.2.3.4. Notifications to arpd
Some sites with large networks choose to manage ARP requests through a user-space daemon called arpd instead of making the kernel
do it. When the kernel is compiled with support for arpd, and its use is configured (that is, app_probes > 0), neigh_update notifies the daemon about
the following events:
[*]
[*]
See the section "ARPD" in Chapter 28, and the section "neigh_parms Structure" in Chapter 29.
When a state is modified from NUD_VALID to a state that is not valid
When the link layer address is changed
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
27.3. General Tasks of the Neighboring Infrastructure
This section describes a few general concepts that you should be familiar with before delving into specific functions within the neighboring
infrastructure: caching , reference counting, and timers.
27.3.1. Caching

The neighboring layer implements two kinds of caching:
Neighbor mappings
As with any other kind of data that can be used multiple times, it makes sense to cache the results of the L3-to-L2 mappings.
Negative results (where an attempt to resolve the address failed) are not cached. But the neighbour structures associated with
failed mappings are set to the NUD_FAILED state so that the garbage collection timer can clean them up (see the section
"Garbage Collection").
L2 headers
The neighboring infrastructure caches L2 headers to speed up the time required to encapsulate an L3 packet into an L2 frame.
Otherwise, the infrastructure would have to initialize each field of the L2 header one by one.
Because the caching of neighbor mappings is central to the operation of the neighboring subsystem , this section describes it in detail.
(The later section "L2 Header Caching" describes L2 header caching.) The contents of a neighbour structure are described in the section

"neighbour Structure" in Chapter 29, and the structure's creation and deletion are described in later sections in this chapter. Here we will
stay at a higher level, describing how those structures are organized and accessed by the neighboring infrastructure.
The neighboring infrastructure places neighbour structures into caches, one per protocol, which are implemented as typical hash tables
where elements that collide into the same bucket are linked into a singly linked list. New elements are added at the head of the lists (see
the function neigh_create in the section "The neigh_create Function's Parameters"). The inputs to the hash function that distributes
elements into buckets are the L3 address, the associated device, and a random value that is recomputed regularly to reduce the
effectiveness of a hypothetical Denial of Service (DoS) attack. Figure 27-6 shows the structure of the cache. In Figure 27-2, you can see
its relationship to other key data structures, such as the per-protocol neigh_table structure.
Hash tables are allocated and freed with neigh_hash_alloc and neigh_hash_free, respectively. Each hash table is created with a size of
two elements at protocol initialization time (see neigh_table_init). When the number of elements in the table grows bigger than the number
of buckets, the table is reorganized as follows. First, the size of the table is doubled (thus, the size of the hash table is always a power of
2).
Figure 27-6. neighbour's cache
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
The random value used for hashing is recalculated. Finally, the elements are redistributed throughout the table using the same previously
mentioned variables: L3 address, device, and random number. This extension of the hash table is performed by neigh_hash_grow, which
is called by neigh_create when necessary.
Note that extension of the hash table is easily triggered. Therefore, it rarely has more than one or two structures per bucket.
The maximum number of elements in a table is controlled by the gc_threshX variables described in the section "Garbage Collection."
These limits are needed to prevent possible DoS attacks.
When the "neighboring system" needs to search a hash table for a neighbor, the search key is the destination L3 address (primary_key)
together with the device (dev) tHRough which the neighbor can be reached. Because different protocols may use keys of different lengths,
the common lookup APIs need to take into account the key length. Therefore, the key length is stored in the neigh_table structure.
The main function used to query a neighbor protocol's cache is neigh_lookup. There are two others, both wrappers around neigh_lookup,
that can either force the creation of a neighbour entry if the lookup fails or decide whether to create one according to an input parameter.
Here is a brief description of the three routines:
neigh_lookup

Checks whether the element being searched for exists, and returns a pointer to it when successful.
struct neighbour *neigh_lookup(struct neigh_table *tbl, const void *pkey,
struct net_device *dev)
{
struct neighbour *n;
int key_len = tbl->key_len;
u32 hash_val = tbl->hash(pkey, dev) & tbl->hash_mask;
read_lock_bh(&tbl->lock);
for (n = tbl->hash_buckets[hash_val]; n; n = n->next) {
if (dev == n->dev &&
!memcmp(n->primary_key, pkey, key_len)) {
neigh_hold(n);
NEIGH_CACHE_STAT_INC(tbl, hits);
break;
}
}
read_unlock_bh(&tbl->lock);
return n;
}
_ _neigh_lookup
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -

Understanding Linux Network Internals 2005 phần 7 pot

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về