Tải bản đầy đủ (.pdf) (128 trang)

Understanding Linux Network Internals 2005 phần 6 ppt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (6.25 MB, 128 trang )

IPSTATS_MIB_OUTREQUESTS
Number of packets that the system tried to transmit (successfully or not), not including forwarded packets. This
field is updated in ip_ouput (and in ip_mc_output for multicast).
IPSTATS_MIB_OUTDISCARDS
Number of packets whose transmission failed. This field is updated in several places, including ip_append_data,
ip_push_pending_frames, and raw_send_hdrinc.
IPSTATS_MIB_OUTNOROUTES
Number of locally generated packets discarded because there was no route to transmit them. Normally this field is
updated after a failure of ip_route_output_flow. ip_queue_xmit is one of the functions that can update it.
IPSTATS_MIB_OUTMCASTPKTS
Number of transmitted multicast packets. Not used by IPv4 at the moment.
Fields related to defragmentation
IPSTATS_MIB_REASMTIMEOUT
Number of packets that failed defragmentation because some of the fragments were not received in time. The
value reflects the number of complete packets, not the number of fragments. This field is updated in ip_expire,
which is the timer function executed when an IP fragment list is dropped due to a timeout. Note that this counter is
not used as defined in the two RFCs mentioned at the beginning of this section.
IPSTATS_MIB_REASMREQDS
Number of fragments received (and therefore the number of attempted reassemblies). This field is updated in
ip_defrag.
IPSTATS_MIB_REASMFAILS
Number of packets that failed the defragmentation. This field is updated in several places (_ _ip_evictor, ip_expire,
ip_frag_reasm, and ip_defrag) for different reasons.
IPSTATS_MIB_REASMOKS
Number of packets successfully defragmented. This field is updated in ip_frag_reasm.
Fields related to fragmentation
IPSTATS_MIB_FRAGFAILS
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
Number of failed fragmentation efforts. This field is updated in ip_fragment (and in ipmr_queue_xmit for multicast).
IPSTATS_MIB_FRAGOKS


Number of fragments transmitted. This field is updated in ip_fragment.
IPSTATS_MIB_FRAGCREATES
Number of fragments created. This field is updated in ip_fragment.
The values of these counters are exported in the /proc/net/snmp file.
Each CPU keeps its own accounting information about the packets it processes. Furthermore, it keeps two counters: one for events in
interrupt context and the other for events outside interrupt context. Therefore, the ip_statistics array includes two elements per CPU, one
for interrupt context and one for noninterrupt context. Not all of the events can happen in both contexts, but to make things easier and
clearer, the vector has simply been defined of double in size; those elements that do not make sense in one of the two contexts are
simply not to be used.
Because some pieces of code can be executed both in interrupt context and outside interrupt context, the kernel provides three different
macros to add an event to the IP statistics vector:
#define IP_INC_STATS (field) SNMP_INC_STATS (ip_statistics, field)
#define IP_INC_STATS_BH (field) SNMP_INC_STATS_BH (ip_statistics, field)
#define IP_INC_STATS_USER(field) SNMP_INC_STATS_USER(ip_statistics, field)
The first can be used in either context, because it checks internally whether it was called in interrupt context and updates the right
element accordingly. The second and the third macros are to be used for events that happened in and outside interrupt context,
respectively. The macros IP_INC_STATS, IP_INC_STATS_BH, and IP_INC_STATS_USER are defined in include/net/ip.h, and the three
associated SNMP_INC_XXX macros are defined in include/net/snmp.h.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
23.4. IP Configuration

The Linux IP protocol can be tuned and configured manually by a system administrator in different ways. This tuning includes both
changes to the protocol itself and to device configuration . The four main interfaces are:
ioctl calls made via ifconfig
ifconfig is the older Unix-legacy tool for configuring IP on network devices.
RTNetlink via ip
ip, which is part of the IPROUTE2 package, is the newer tool that Linux offers for configuring IP on network devices.

/proc filesystem
Protocol behavior can be tuned via a collection of files in the directory /proc/sys/net/ipv4.
RARP/BOOTP/DHCP
These three protocols can be used to dynamically assign an IP configuration to a host and its interfaces.
The last set of protocols in the preceding list have an interesting twist. They are normally implemented in user space, but Linux also has
a simple kernel-space implementation that is useful when used together with the nfsroot boot option. The latter allows the kernel to
mount the root directory (/) via NFS. To do that, it needs an IP configuration at boot time before the system is able to initialize the IP
configuration from user space (which, by the way, could be stored in a remote partition and not even be available to the system when it
mounts the root directory). Via kernel boot options, it is possible to give nfsroot a static configuration, or specify what protocols (yes, more
than one can be used concurrently) to use to obtain the configuration. The IP configuration code is in net/ipv4/ipconfig.c, and the one
used by nfsroot is in fs/nfs/nfsroot.c. The two files cross-reference variables and functions, but they are actually simple to read. We will
not cover them, because network filesystems and user-space clients are outside the scope of this book. Once you know how to read _
_setup macros (described in Chapter 7), reading the code should become a piece of cake. It is clear and well commented.
The third item in the list, /proc, is covered later in the section "Tuning via /proc Filesystem."
In this section, I will say a bit about the kernel interfaces that support the behavior of the first two items, ifconfig and ip. The purpose here
is not to cover the internals of the user-space commands or the associated kernel counterparts that handle configuration requests. It is to
show how user space and kernel space communicate, and the kernel functions that are invoked in response to a user-space command.
23.4.1. Main Functions That Manipulate IP Addresses and Configuration

This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
In net/ipv4/devinet.c, you can find several functions that can be used to add an IP address to a network interface, delete an address from
an interface, modify an address, retrieve the IP configuration of a device given its device index or net_device data structure, etc. Here I
introduce only a few of the functions that will be useful, to help you to understand the functions described later when we talk about the ip
and ifconfig user-space tools.
Before reading these descriptions of functions, it would be worthwhile reviewing the key data structures used by the IP layer, introduced
in Chapter 19 and described in detail later in this chapter. For instance, a single IP address is represented by an in_ifaddr structure and the
complete IPv4 configuration of a device by an in_device structure.
inetdev_init and inetdev_destroy
inetdev_init is invoked when the first IP configuration is applied to a device. It allocates the in_device structure and links it to

the associated net_device instance. It also creates a directory in /proc/sys/net/ipv4/conf/ (see the section "Tuning via /proc
Filesystem").
The IP configuration can be removed with inetdev_destroy, which simply undoes whatever was done in inetdev_init, plus
removes all of the linked in_ifaddr structures. The latter are removed with inet_free_ifa, which also decrements the reference
count on the in_device structure with in_dev_put. When the last reference is released, probably with the last call to
inet_free_ifa, the in_device instance is freed with in_dev_finish_destroy.
inet_alloc_ifa and inet_free_ifa
Those two functions allocate and free, respectively, an in_ifaddr data structure. A new one is allocated when a user adds a
new address to an interface. A deletion can be triggered by the removal of a single address, or by the removal of all of the
devices' IP configurations together. Both routines use the read-copy update (RCU) mechanism as a means to enforce mutual
exclusion.
inet_insert_ifa and inet_del_ifa
inet_insert_ifa adds a new in_ifaddr structure to the list within in_device. It detects duplicates and marks the address as
secondary if it finds out that it falls within another address's subnet. Suppose, for instance that eth0 already had the address
10.0.0.1/24. When a new 10.0.0.2/24 address is added, it will be recognized as secondary with respect to the first. Primary
addresses are also used to feed the entropy of the kernel random number generator with net_srandom. More information on
primary and secondary addresses can be found in Chapter 30.
inet_del_ifa simply removes an in_ifaddr structure from the associated in_device instance, making sure that, if the address is
primary, all of the associated secondary addresses are removed too, unless the administrator has explicitly configured the
device via its /proc/sys/net/ipv4/conf/dev_name/promote_secondaries file not to remove secondary addresses. Instead, a
secondary address can be promoted to a primary one when the associated primary address is removed. Given the in_device
instance, this configuration can be accessed with the IN_DEV_PROMOTE_SECONDARIES macro. The inet_del_ifa function
accepts an extra input parameter that can be used to tell whether the in_device structure should be freed when the last
in_ifaddr instance has been removed. While it is normal to remove the empty in_device structure, sometimes a caller might
not do it, such as when it knows it is going to add a new in_ifaddr soon.
In both cases, addition and deletion, successful completion leads to a Netlink broadcast notification with rtmsg_ifa (see the
section "Change Notification: rtmsg_ifa") and a notification to the other kernel subsystems via the inetaddr_chainnotification
chain (see Chapter 4).
inet_set_ifa
This is a wrapper for inet_insert_ifa that creates an in_device structure if none exists for the associated device, and sets the

scope of the address to local (RT_SCOPE_HOST) for addresses like 127.x.x.x. Refer to the section "Scope" in Chapter 30 for
more details on scopes.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
Many other, smaller functions can be used to make the code more readable. Here are a few of them:
inet_select_addr
This function is used to select an IP address among the ones configured on a given device. The function accepts an optional
scope as a parameter, which can be used to narrow down the lookup domain. We will see where this function is useful in
Chapter 35.
inet_make_mask and inet_mask_len
Given the number of 1s the netmask is composed of, inet_make mask creates the associated netmask. For example, an input
of 24 would generate the netmask with the decimal representation 255.255.255.0.
inet_mask_len is the converse, returning the number of 1s in a decimal netmask. For instance, 255.255.0.0 would return 16.
inet_ifa_match
Given an IP address and a netmask, inet_ifa_match checks whether a given second IP address falls within the same subnet.
This function is often used to classify secondary addresses and to check whether a given IP address belongs to one of the
locally configured subnets. See, for instance, inet_del_ifa.
for_primary_ifa and for_ifa
These two functions are macros that can be used to browse all of the in_ifaddr instances associated with a given in_device
structure. for_primary_ifa considers only primary addresses, and for_ifa goes through all of them.
23.4.2. Change Notification: rtmsg_ifa

Netlink provides the RTMGRP_IPV4_IFADDR multicast group to user-space applications interested in changes to the locally configured
IP addresses. The kernel uses the rtmsg_ifa function to notify those applications that registered to the group when any change takes
place on the local IP addresses. The function can be called when two types of events occur:
RTM_NEWADDR
A new address has been configured on a device.
RTM_DELADDR
An address has been removed from a device.
The generated message is initialized with inet_fill_ifaddr, the same function used to handle dump requests from user space (with

commands such as ip addr list). The message includes the address being added or removed, and the device associated with it.
So, who is interested in this kind of notification? Routing protocols are a major example. If you are using Zebra, the routing protocols you
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
have configured would like to remove all of the routes that are directly or indirectly dependent on an address that has gone away. In
Chapter 31, you will learn more about the way routing protocols interact with the kernel routing subsystem.
23.4.3. inetaddr_chain Notification Chain

The IP subsystem uses the inetaddr_chain notification chain to notify other kernel subsystems about changes to the IP configuration of
the local devices. A kernel subsystem can register and unregister itself with inetaddr_chain by means of the register_inetaddr_notifier and
unregister_inetaddr_notifier functions. Here are two examples of users for this notification chain:
Routing
See the section "External Events" in Chapter 32.
Netfilter masquerading
When a local IP address is used by the Netfilter's masquerading feature, and that address disappears, all of the connections
that are using that address must be dropped (see net/ipv4/netfilter/ipt_MASQUERADE.c).
The two NETDEV_DOWN and NEtdEV_UP events, respectively, are notified when an IP address is removed and when it is added to a
local device. Such notifications are generated by the inet_del_ifa and inet_insert_ifa routines introduced in the section "Main Functions
That Manipulate IP Addresses and Configuration."
23.4.4. IP Configuration via ip

Traditionally, Unix system administrators configured interfaces and routes manually using ifconfig, route, and other commands. Currently
Linux provides an umbrella ip command to handle IP configuration, with a number of subcommands.
In this section we will see how IPROUTE2 handles the main addressing operations, such as adding and removing an address. Once you
are familiar with these operations, you can easily understand and read through the code for the others.
Figure 23-2 shows the files and the main functions of the IPROUTE2 package that are involved with IP address configuration activities.
The labels on the lines are ip keywords, and the nodes show the function invoked and the file the latter belongs to. For instance, the
command ip address addwould be handled by ipaddr_modify.
Figure 23-2. IPROUTE2 files and functions for address configuration
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.

Simpo PDF Merge and Split Unregistered Version -
Table 23-1 shows the association between the operation specified with a command-line keyword (e.g., add) and the kernel handler run by
the kernel. For instance, when the kernel receives a request for an RTM_NEWADDR operation, it knows it is associated with an add
command and therefore invokes inet_rtm_newaddr. Some kernel operations are overloaded, and for these, the kernel needs extra flags
to figure out exactly what the user-space command is asking for. See Chapter 36 for an example. This association is defined in
net/ipv4/devinet.c in the inet_rtnetlink_table structure. For an introduction to RTNetlink, refer to Chapter 3.
Table 23-1. ip route commands and associated kernel operations
CLI keywordOperationKernel handler
addRTM_NEWADDRinet_rtm_newaddr
deleteRTM_DELADDRinet_rtm_deladdr
list, lst, showRTM_GETADDRinet_dumpifaddr
flushRTM_GETADDRinet_dumpifaddr
The list and flush commands need some explanation. list is simply a request to the kernel to dump information, for instance, about a given
device, and flush is a request to clear the entire IP configuration on the device.
The two functions inet_rtm_newaddr and inet_rtm_deladdr are wrappers for the generic functions inet_insert_ifa and inet_del_ifa that we
introduced in the section "Main Functions That Manipulate IP Addresses and Configuration." All the wrappers do is translate the request
that comes from user space into an input understandable by the two more-general functions. They also filter bad requests that are
associated with nonexistent devices.
23.4.5. IP Configuration via ifconfig

ifconfig is implemented in the ifconfig.c user-space file (part of the net-tools package). Unlike ip, ifconfig uses ioctl calls to interface to the
kernel. However, a set of functions are used by both the ip and ifconfig handlers. In Chapter 3, we had an overview of how ioctl calls are
handled by the kernel. Here all we need to know is that the requests related to IPv4 configuration are handled by the inet_ioctl function in
net/ipv4/af_inet.c. Based on the ioctl code you can see what helper functions inet_ioctl uses to process the user-space commands (e.g.,
devinet_ioctl).
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
As for IPROUTE2, user-space requests from ifconfig are handled on the kernel side by wrappers that end up calling the functions in the
section "Main Functions That Manipulate IP Addresses and Configuration."
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.

Simpo PDF Merge and Split Unregistered Version -
23.5. IP-over-IP
IP-over-IP, also called IP tunneling (or IPIP), consists of transmitting IP packets inside other IP packets. This protocol is useful in some
very interesting cases, including in a Virtual Private Network (VPN). Of course, nothing comes for free; you can well imagine the extra
weight of the doubling of the protocol: because each IP packet has two IP headers, the overhead becomes huge for small packets.
There are subtle complexities in implementation, too. For instance, what is the relationship between the IP options of the two headers?
If you consider just the IPv4 and IPv6 protocols, you already have four possible combinations of tunneling. But not all of these
combinations are likely to be used.
To make things more complex (I should actually say "flexible"), keep in mind that there is no limit to the number of recursions in
tunneling.
[*]
[*]
IPv6 defines the "tunnel encapsulation limit" as the maximum number of nested encapsulations. See section 6.6
of RFC 2473.
The different tunnel interfaces that can be created in Linux are not covered in this book. However, given the background on the IP
implementation in this part of the book, you can study the code in net/ipv4/ipip.c and include/net/ipip.h to derive the implementation
details.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
23.6. IPv4: What's Wrong with It?

We saw in the section "IP Protocol: The Big Picture" in Chapter 18 what the main tasks are of the IP protocol. IPv4 was designed almost
25 years ago (in 1981), and given the speed with which the Internet and network services have evolved since then, the protocol is
showing its age. Because IPv4 was not originally designed with today's big network topologies and commercial uses in mind, it has
shown several limitations over the years. These have been only partially solved, sometimes with special extensions to the protocol (e.g.,
classless interdomain routing), DiffServ Code Point (DSCP) replacement to ToS, congestion notification, etc.), and other times by
defining specialized external protocols such as IPsec.
Thanks to the experience gained with IPv4, the new IPv6 version of the protocol has been designed to address the known shortcomings
of IPv4, taking into consideration such aspects as:
Functionality

Ease of configuration
Performance
Transition from IPv4 networks to IPv6 networks
Security
Naturally, the committees designing the new protocol have tried to keep IPv4 and IPv6 as compatible as possible, and the transition from
one to another as painless as possible. This compatibility and interaction have to be handled not only at the application layer, but also at
the kernel layer.
When analyzing IPv4 packet transmission, we saw that fragmentation and options processing were the two most expensive tasks. It
should not come as a surprise, therefore, that IPv6 addressed both points:
Fragmentation has been limited in IPv6: an IP packet can be fragmented only at the source.
The presence of IP options may sometimes inhibit the fast processing path: this is true for both software routers like Linux on
a PC and commercial hardware IP implementations. For a commercial implementation, it could mean that IP packets without
options can be forwarded in hardware at much higher speed, and the ones with options have to be handled in software. The
way options are handled by IPv6 is also different: IPv6 uses the concept of extensions, whose main advantage is that not all
of the routers have to process them.
One other big limitation of IPv4 is the 32-bit size of its addresses and the limited hierarchy they come with. Network Address Translation
(NAT) is only a short-term solution that partially solves the problem. NAT comes with some limitations, which are listed on the following
page.
Each protocol has to be treated specially, so some protocols don't always work passing through a NAT router (e.g., H323).
The NAT router becomes a single point of failure. Because it needs to keep state information for all the connections passing
through it, designing a network with redundancy or security in mind is not easy.
Its tasks are complex and computationally heavy when there is a need to support those complex protocols that have not been
designed with NAT support in mind (these are considered to be "not NAT-friendly"
[*]
).
[*]
You can read RFC 3235 if you would like to see what is considered a NAT-friendly protocol or
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
application.

The limited number of addresses in IPv4 also contributes (because of its limited hierarchy) to the creation of huge routing tables. A core
router can have up to hundreds of thousands of routes. This trend is bad, for a couple of reasons:
The routes require lots of memory.
Lookups are slower.
Classless interdomain routing helps in reducing the size of the routing tables, but cannot solve the limited address space problem of IPv4.
In IPv6, the address has been made four times bigger in size, which does not mean four times as many addresses, but rather 2
96
times
as many! This potentially brings systems outside the NAT router and makes them full-fledged citizens of the Internet, with implications for
new types of applications.
IPv4 was not designed with security in mind. Because of this, several approaches of different granularity have been developed:
application end-to-end solutions such as Secure Sockets Layer (SSL), host end-to-end solutions such as IPsec, etc. Each has its own
pros and cons. SSL requires the applications to be written to use that security layer (which sits on top of TCP), whereas IPsec (which is
what most people identify VPNs with) does not: IPsec sits at the L3 layer and therefore is transparent to applications. IPsec can be used
by both IPv4 and IPv6, but it fits better with IPv6.
With IPv6, the neighboring system has changed as well. It is called neighbor discovery, and represents the counterpart to ARP for IPv4.
The QoS component is also expanded.
With IPv4 networks, it is already possible to carry out automatic host configuration, thanks to protocols such as DHCP; however, some
constraints make that solution less Plug and Play (PnP) than it should be. This issue has been solved by IPv6 too, with the so-called
autoconfiguration feature.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
23.7. Tuning via /proc Filesystem

The /proc filesystem was introduced in Chapter 3; it provides a simple interface for users to view and change kernel parameters and is
the model for the newer sysfs directory. It contains a huge number of files (or rather, virtual data structures that look to the user just like
files) that map to variables and functions inside the kernel and that can be used to tune the behavior of the networking component of the
kernel as well.
The files used for IPv4 tuning are located mainly in two directories:
/proc/sys/net/ipv4/

Table 23-2 shows some of the files in this directory that are used by IPv4. The kernel variables associated with those files are
declared in net/ipv4/sysctl_net_ipv4.c and are statically registered at boot time (see Chapter 3). Note that the directory
contains many more files than the ones in Table 23-2. Most of the extra files are associated with L4 protocols, especially TCP.
/proc/sys/net/ipv4/conf/
This directory contains a subdirectory for each network device recognized by the kernel, plus other special directories (see
Figure 36-4 in Chapter 36). Those subdirectories include configuration parameters that are device specific; among them are
accept_redirects, send_redirects, accept_source_route, and forwarding. These will be covered in Chapter 36, with the
exception of promote_secondaries, which is described in the section "Main Functions That Manipulate IP Addresses and
Configuration."
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
Table 23-2. IPv4-related files in /proc/sys/net/ipv4
/proc filenameAssociated kernel variableDefault value
ip_forwardipv4_devconf.forwarding
0
ip_no_pmtu_discipv4_config.no_pmtu_disc
0
ip_autoconfigipv4_config.autoconfig
0
ip_default_ttlsysctl_ip_default_ttl
IPDEFTTL (64)
ip_nonlocal_bindsysctl_ip_nonlocal_bind
0
ip_local_port_rangesysctl_ip_local_port_range[0]
sysctl_ip_local_port_range[1]
1
65535
a
ipfrag_high_treshsysctl_ipfrag_high_thresh
256K

ipfrag_low_treshsysctl_ipfrag_low_thresh
192K
ipfrag_timesysctl_ipfrag_time
IP_FRAG_TIME (30 * HZ)
ipfrag_secret_intervalsysctl_ipfrag_secret_interval
10 * 60 * HZ
ip_dynaddrsysctl_ip_dynaddr
0
inet_peer_gc_maxtimeinet_peer_gc_maxtime
120 * HZ
inet_peer_gc_mintimeinet_peer_gc_mintime
10 * HZ
inet_peer_maxttlinet_peer_maxttl
10 * 60 * HZ
inet_peer_minttlinet_peer_minttl
120 * HZ
inet_peer_thresholdinet_peer_threshold
65536 + 128
b
a
These values are updated by tcp_init at boot time based on the amount of memory available in the system. Even if they are updated
by TCP, they are used by any L4 protocol that uses ports.
b
This value is updated by inet_initpeers at boot time based on the amount of memory available in the system.
The first three elements in Table 23-2 are members of two data structures of type ipv4_devconf and ipv4_config, located, respectively, in
include/linux/inetdevice.h and include/net/ip.h and described later in this chapter. The other elements of those structures are either
exported elsewhere or not exported at all (we will cover them in the associated chapters). The meaning of the files and kernel variables
is as follows:
ip_forward
Set to a nonzero value to enable the device to forward traffic. See the section "Enabling and Disabling Forwarding" in Chapter

36.
ip_no_pmtu_disc
When 0, path MTU discovery is enabled.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
ip_autoconfig
This is set to 1 when the IP configuration of the host was done via a protocol such as DHCP. See the section "IP
Configuration."
ip_default_ttl
This is the default value of the IP TTL field used for unicast traffic. Multicast traffic uses the default value of 1 and does not
have an equivalent sysctl variable to set it.
ip_nonlocal_bind
When nonzero, it is possible for an application to bind to an address that is not local to the host. This allows, for instance,
binding a socket to an address even if the associated interface is down.
ip_local_port_range
Range of ports that can be used for outgoing connections.
ipfrag_high_thresh
ipfrag_low_thresh
Thresholds used to limit the amount of memory used by incoming IP fragments. When the memory used by fragments
reaches ipfrag_high_thresh, old entries are removed until the memory used declines to ipfrag_low_thresh. See the section
"Garbage Collection."
ipfrag_time
Maximum amount of time incoming IP fragments are kept in memory before expiring.
ipfrag_secret_interval
Interval after which the incoming IP fragments that are in the hash table are extracted and reinserted with a different hash
function. See the section "Hash Table Reorganization" in Chapter 22.
ip_dynaddr
This variable is used to handle the case of sockets bound to addresses associated with dial-on-demand interfaces that do not
receive any reply until the interface comes up. If ip_dynaddr is set, the sockets will retry binding.
inet_peer_threshold

Maximum number of inet_peer structures that can be allocated.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
inet_peer_gc_maxtime
inet_peer_gc_mintime
Amount of time between regular garbage collection passes. Since the amount of memory usable by the inet_peer structures
is limited (by inet_peer_threshold), there is a regular timer that expires unused entries based on these two variables.
inet_peer_gc_maxtime is used when the system is not heavily loaded, and inet_peer_gc_mintime is used in the opposite case.
Thus, the more entries there are, the more frequently the timer expires.
inet_peer_maxttl
inet_peer_minttl
Maximum and minimum TTL of inet_peer enTRies. Its value is supposed to be bigger than sysctl_ipfrag_time, for obvious
reasons.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
23.8. Data Structures Featured in This Part of the Book

The section "Main IPv4 Data Structures" in Chapter 19 gave a brief overview of the main data structures. This section has a detailed
description of each data structure type. Figure 23-3 shows the file that defines each data structure.
23.8.1. iphdr Structure
The meaning of its fields has already been covered in the section "IP Header" in Chapter 18.
23.8.2. ip_options Structure

This structure represents the options for a packet that needs to be transmitted or forwarded. The options are stored in this structure
because it is easier to read than the corresponding portion of the IP header itself.
Figure 23-3. Distribution of data structures in kernel files
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
Let's go field by field. They should be fairly simple to understand if you have read the section "IP Options" in Chapter 18. After this
description, you will be able to understand more easily how the parsing is done and how its results are used by the IP layer subsystems,

such as the code that processes incoming IP packets. Some of the bit fields are grouped together into an unsigned char; the
declarations of these end with :1.
unsigned char optlen
Length of the set of options. As explained in Chapter 18, this is limited to a maximum of 40 bytes by the definition of the IP
header.
unsigned char is_changed:1
Set if the IP header has been modified (such as an IP address or a timestamp). This is useful to know because if the packet
has to be forwarded, this field indicates that the IP checksum has to be recomputed.
_ _u32 faddr
unsigned char is_strictroute:1
unsigned char srr
unsigned char srr_is_hit:1
faddr is meaningful only for transmitted packets (that is, those generated locally) and only for those using source routing. The
value of faddr is set to the first of the IP addresses provided for source routing. See the section "Option: Strict and Loose
Source Routing" in Chapter 19.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
is_strictroute is a flag set to true when Strict Source Route is among the options.
srr contains the offset of the Source Route option in the header. If the option is not used, the value is zero.
srr_is_hit is true if the packet was source routed and the IP address of the receiving interface is one of the addresses in the
source route list (see ip_options_rcv_srr in net/ipv4/ip_options.c).
unsigned char rr
When rr is nonzero, Record Route is one of the IP options and the value of this field represents the offset inside the IP header
where the option starts. This field is used together with rr_needaddr.
unsigned char rr_needaddr:1
When rr_needaddr is true, Record Route is one of the IP options and there is still room in the header for another route;
therefore, the current node should copy the IP address of the outgoing interface into the IP header at the offset specified by
rr.
unsigned char ts
When ts is nonzero, Timestamp is one of the IP options and this field represents the offset inside the IP header where the

option starts. This field is used together with ts_needaddr and ts_needtime.
unsigned char is_setbyuser:1
This field makes sense only for transmitted packets and is set when the options were passed from user space with the
system call setsockopt. Currently, however, it is never used.
unsigned char is_data:1
unsigned char _data[0]
These fields are used in two situations: when the local node transmits a locally generated packet, and when the local node
replies to an ICMP echo request. In these cases, is_data is true and _data points to an area containing the options to append
to the IP header. The [0] definition is a common convention used for reserving space for a pointer.
When forwarding a packet, the options are in the associated skb buffer (see the ip_options_get function in the
net/ipv4/ip_options.c file).
unsigned char ts_needtime:1
When this option is true, Timestamp is one of the IP options and there is still room in the header for another timestamp;
therefore, the current node should add the time of transmission into the IP header at the offset specified by ts.
unsigned char ts_needaddr:1
Used with ts and ts_needtime to indicate that the IP address of the egress device should also be copied into the IP header.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
unsigned char router_alert
When this option is true, Router Alert is one of the IP options.
unsigned char _ _pad1, _ _pad2
Because memory accesses are faster when the location is aligned to a 32-bit boundary, the Linux kernel data structures are
often padded out with unused fields called _ _padn in order to make their sizes a multiple of 32 bits. This is the only purpose
of _ _pad1 and _ _pad2; they are not used otherwise.
The flags srr, rr, and ts also are useful when parsing the options in order to detect the ones that are present more than once, which is
illegal (see the section "Option Parsing" in Chapter 19).
23.8.3. ipcm_cookie Structure

This structure combines various pieces of information needed to transmit a packet.
struct ipcm_cookie

{
u32 addr;
int oif;
struct ip_options *opt;
};
The destination IP address is addr, the egress device is oif if defined, and the IP options are in an ip_options structure. Note that addr is
the only field that is always set. oif is 0 if there are no constraints on which device to use.
23.8.4. ipq Structure

Here is the description of the fields of the ipq structure. For the sake of simplicity, not all fields are shown in Figure 22-1 in Chapter 22.
struct ipq *next
When the fragments are put into the ipq_hash hash table, conflicting elements (elements with the same hash value) are
linked together with this field. Note that this field does not indicate the order of fragments within the packet; it is used simply
as a standard way to organize the hash table. The order of fragments within the packet is controlled by the fragments field
(see Figure 22-1 in Chapter 22).
struct ipq **pprev
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
Pointer back to the head of the list of IP packets that have the same hash value.
struct list_head lru_list
All of the ipq structures are kept sorted in a global list, ipq_lru_list, based on a least-recently-used criterion. This list is useful
when performing garbage collection. This field is used to link the ipq structure to such a list.
u32 user
The reason why an IP packet is to be defragmented, which indirectly says what kernel subsystem asked for the
defragmentation. The list of allowed values for IP_DEFRAG_XXX is in include/net/ip.h. The most common one is
IP_DEFRAG_LOCAL_DELIVER, which is used when defragmenting ingress packets that are to be delivered locally.
u32 saddr
u32 daddr
u16 id
u8 protocol

These parameters represent the source IP address, destination IP address, IP packet ID, and L4 protocol identifier,
respectively. As described in Chapter 18, these four parameters identify the original IP packet a fragment belongs to. For that
reason, they are also the parameters used by the hash function to optimally spread elements throughout the hash table.
u8 last_in
Stores three flags, whose possible values are:
COMPLETE
All of the fragments have been received and can therefore be joined together to obtain the original IP packet. This
flag can also be used to mark those ipq structures that have been chosen for deletion (see ipq_kill in
net/ipv4/ip_fragment.c).
FIRST_IN
The first of the fragments (the one with offset=0) has been received. The first fragment is the only one carrying all
of the options that were in the original IP packet.
LAST_IN
The last of the fragments (the one with MF=0) has been received. The last fragment is important because it is the
one that tells us the size of the original IP packet.
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
struct sk_buff *fragments
List of fragments received so far.
int len
Offset where the fragment with the biggest offset ends. When the last fragment is received (the one with MF=0), len will tell
the size of the original IP packet.
int meat
Represents how many bytes of the original packet we have received so far. When its value is the same as len, the packet has
been completely received.
spinlock_t lock
Protects the structure from race conditions. It could happen, for instance, that different IP fragments are received at the same
time by different NICs handled by different CPUs.
atomic_t refcnt
Counter used to keep track of external references to this packet. As an example of its purpose, the timer timer increments

refcnt to make sure that no one is going to free the ipq structure while the timer is still pending; otherwise, the timer might
expire and try to access a data structure that does not exist anymore. You can imagine the consequences.
struct timer_list timer
Chapter 18 explained why IP fragments cannot stay forever in memory and should be removed after some time if
defragmentation is not possible. This field is the timer that takes care of that.
int iif
ID of the device from which the last fragment was received. When a list of fragments expires, this field is used to decide which
device to use to transmit the FRAGMENTATION REASSEMBLY TIMEOUT ICMP message (see ip_expire in the
net/ipv4/ip_fragment.c file).
struct timeval stamp
Time when the last fragment was received (see ip_frag_queue in net/ipv4/ip_fragment.c).
The ipq_hash table is protected by ipfrag_lock, which can be taken in either shared (read-only) or exclusive (read-write) mode. Do not
confuse this lock with the one embedded in each ipq element.
23.8.5. inet_peer Structure

This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
The kernel keeps an instance of this structure for each remote host it has been talking to in the recent past. In the section "Long-Living IP
Peer Information," you saw how it is used. All instances of inet_peer structures are kept in an AVL tree, a structure optimized for frequent
lookups. The functions used to manipulate inet_peer instances are in net/ipv4/inetpeer.c.
struct inet_peer *avl_left
struct inet_peer *avl_right
Left and right pointers to the two subtrees.
_ _u16 avl_height
Height of the AVL tree.
struct inet_peer *unused_next
struct inet_peer **unused_prevp
Used to link the node into a list that contains elements that expired. unused_prevp is used to check whether the node is in
that list.
A node can be put into that list and then taken back out of it several times without ever being removed completely. See the

section "Garbage Collection."
unsigned long dtime
Time when this element was added to the unused list inet_peer_unused_head via inet_putpeer.
atomic_t refcnt
Reference count for the element. Among the users of this structure are the routing subsystem and the TCP layer.
_ _u32 v4daddr
IP address of the remote peer.
_ _u16 ip_id_count
IP packet ID to use next for this peer (see inet_getid in include/net/inetpeer.h).
_ _u32 tcp_ts
unsigned long tcp_ts_stamp
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
Used by TCP to manage timestamps.
23.8.6. ipstats_mib Structure

The SNMP protocol employs a type of object called an MIB to collect statistics about systems. A data structure called ipstats_mib keeps
statistics on the IP layer. The section "IP Statistics" covered this structure in more detail.
23.8.7. in_device Structure
The in_device structure stores all of the IPv4-related configuration for a network device, such as changes made by a user with the ifconfig
or ip command. This structure is linked to the net_device structure via net_device->ip_ptr and can be retrieved with in_dev_get and _
_in_dev_get. The difference between those two functions is that the first one takes care of all of the necessary locking, and the second
one assumes the caller has taken care of it already.
Since in_dev_get internally increases a reference count on the in_dev structure when it succeeds (i.e., when a device is configured to
support IPv4), its caller is supposed to decrement the reference count with in_dev_put when it is done with the structure.
The structure is allocated and linked to the device with inetdev_init, which is called when the first IPv4 address is configured on the
device. Here are the meanings of its fields:
struct net_device *dev
Pointer back to the associated net_device structure.
atomic_t refcnt

Reference count. The structure cannot be freed until this field is 0.
int dead
This field is set to mark the device as dead. This is useful to detect those cases where the entry cannot be destroyed because
it has a nonzero reference count, but a destroy action has been initiated. The two most common events that trigger the
removal of an in_device structure are:
Unregistration of the device (see Chapter 8)
Removal of the last configured IP address from the device (see inet_del_ifa in net/ipv4/devinet.c)
struct in_ifaddr *ifa_list
List of IPv4 addresses configured on the device. The in_ifaddr instances are kept sorted by scope (bigger scope first), and
elements with the same scope are kept sorted by address type (primary first). The in_ifaddr data structure is further
described in the section "in_ifaddr Structure."
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -
struct neigh_parms *arp_parms
The meaning of this field is described in detail in Part VI.
struct ipv4_devconf cnf
See the section "ipv4_devconf Structure"
struct rcu_head rcu_head
Used by the RCU mechanism to enforce mutual exclusion. It accomplishes the same job as a lock.
The rest of the fields are used by the multicast code. For instance, mc_list stores the device's multicast configuration and it is the
multicast counterpart of ifa_list. mr_vl_seen and mr_v2_seen are timestamps used by the IGMP protocol to keep track of the reception of
versions 1 and 2 IGMP packets.
23.8.8. in_ifaddr Structure

When configuring an IPv4 address on an interface, the kernel creates an in_ifaddr structure that includes the 4-byte address along with
several other fields. Here are their meanings:
struct in_ifaddr *ifa_next
Pointer to the next element in the list. The list contains all of the addresses configured on the device.
struct in_device *ifa_dev
Pointer back to the associated in_device structure.

u32 ifa_local
u32 ifa_address
The values of these two fields depend on whether the address is assigned to a tunnel interface. If so, ifa_local and
ifa_address are the local and remote addresses of the tunnel, respectively. If not, both contain the address of the local
interface.
u32 ifa_mask
unsigned char ifa_prefixlen
This document was created by an unregistered ChmMagic, please go to to register it. Thanks.
Simpo PDF Merge and Split Unregistered Version -

×