Transmission Control Protocol - TCP
Aqui você vai encontrar tudo sobre o Transmission Control Protocol (TCP), que é um dos principais componentes da pilha TCP/IP, e muito cobrado em concursos públicos.Esse protocolo provê entrega confiável e em ordem de um fluxo de bytes, making it suitable for applications like file transfer and e-mail. It is so important in the Internet protocol suite that sometimes the entire suite is referred to as "the TCP/IP protocol suite."
Reason for TCP
The Internet Protocol (IP) works by exchanging groups of information called packets. Packets are short sequences of bytes that contain a header and a body. The header describes the destination that the packet needs to arrive at, and the routers on the internet pass the packets along in generally the right direction until it arrives at the final destination; the body contains application data.
The IP protocol can in cases of congestion, discard packets, and for efficiency reasons two consecutive packets on the internet can take different routes to the destination, and in that case, the packets can arrive at the destination in the wrong order.
The TCP protocol's software libraries uses the IP protocol and provides to applications simpler interfaces, hides most of the underlying packet structure from applications, rearrange out-of-order packets, acts to minimize network congestion, and retransmits any packets that may have been discarded.
Thus TCP very significantly simplifies the task of writing many applications.
Applicability of TCP
TCP is used extensively by many of the Internet's most popular application protocols and resulting applications, including the World Wide Web, E-mail, File Transfer Protocol, Secure Shell, and some streaming media applications.
However, because TCP is optimized for accurate delivery rather than timely delivery, TCP sometimes incurs long delays while waiting for out-of-order messages or retransmissions of lost messages, and it is not particularly suitable for real-time applications such as Voice over IP. For such applications, protocols like the Real-time Transport Protocol (RTP) running over the User Datagram Protocol (UDP) are usually recommended instead.
Using TCP, applications on networked hosts can create connections to one another, over which they can exchange streams of data using Stream Sockets. TCP also distinguishes data for multiple connections by concurrent applications (e.g., Web server and e-mail server) running on the same host.
In the Internet protocol suite, TCP is the intermediate layer between the Internet Protocol (IP) below it, and an application above it. Applications often need reliable pipe-like connections to each other, whereas the Internet Protocol does not provide such streams, but rather only best effort delivery (i.e., unreliable packets). TCP does the task of the transport layer in the simplified OSI model of computer networks. The other main transport-level Internet protocols are UDP and SCTP.
Applications send streams of octets (8-bit bytes) to TCP for delivery through the network, and TCP divides the byte stream into appropriately sized segments (usually delineated by the maximum transmission unit (MTU) size of the data link layer of the network to which the computer is attached). TCP then passes the resulting packets to the Internet Protocol, for delivery through a network to the TCP module of the entity at the other end. TCP checks to make sure that no packets are lost by giving each packet a sequence number, which is also used to make sure that the data is delivered to the entity at the other end in the correct order. The TCP module at the far end sends back an acknowledgment for packets which have been successfully received; a timer at the sending TCP will cause a timeout if an acknowledgment is not received within a reasonable round-trip time (or RTT), and the (presumably) lost data will then be re-transmitted. The TCP checks that no bytes are corrupted by using a checksum; one is computed at the sender for each block of data before it is sent, and checked at the receiver.
Unlike TCP's traditional counterpart, User Datagram Protocol, which can immediately start sending packets, TCP provides connections that need to be established before sending data. TCP connections have three phases. :
Before describing these three phases, a note about the various states of a connection end-point or Internet socket:
represents waiting for a connection request from any remote TCP and port. (usually set by TCP servers)
represents waiting for the remote TCP to send back a TCP packet with the SYN and ACK flags set. (usually set by TCP clients)
represents waiting for the remote TCP to send back an acknowledgment after having sent back a connection acknowledgment to the remote TCP. (usually set by TCP servers)
represents that the port is ready to receive/send data from/to the remote TCP. (set by TCP clients and servers)
represents waiting for enough time to pass to be sure the remote TCP received the acknowledgment of its connection termination request. According to RFC 793 a connection can stay in TIME-WAIT for a maximum of four minutes.
To establish a connection, TCP uses a three-way handshake. Before a client attempts to connect with a server, the server must first bind to a port to open it up for connections: this is called a passive open. Once the passive open is established, a client may initiate an active open. To establish a connection, the three-way (or 3-step) handshake occurs:
The active open is performed by the client sending a SYN to the server.
In response, the server replies with a SYN-ACK.
Finally the client sends an ACK back to the server.
At this point, both the client and server have received an acknowledgment of the connection.
The initiating host (client) sends a synchronization (SYN flag set) packet to initiate a connection. Any SYN packet holds a Sequence Number. The Sequence Number is a 32-bit field in TCP segment header. Let the Sequence Number value for this session be x.
The other host receives the packet, records the Sequence Number x from the client, and replies with an acknowledgment and synchronization (SYN-ACK). The Acknowledgment is a 32-bit field in TCP segment header. It contains the next sequence number that this host is expecting to receive (x + 1). The host also initiates a return session. This includes a TCP segment with its own initial Sequence Number of value y.
The initiating host responds with the next Sequence Number (x + 1) and a simple Acknowledgment Number value of y + 1, which is the Sequence Number value of the other host + 1.
Vulnerability to Denial of Service:
By using a spoofed IP address and repeatedly sending SYN packets attackers can cause the server to consume large amounts of resources keeping track of the bogus connections. Proposed solutions to this problem include SYN cookies and Cryptographic puzzles
There are a few key features that set TCP apart from User Datagram Protocol:
Ordered data transfer
Retransmission of lost packets
Discarding duplicate packets
Error-free data transfer
Ordered data transfer, retransmission of lost packets and discarding duplicate packets
In the first two steps of the 3-way handshaking, both computers exchange an initial sequence number (ISN). This number can be arbitrary. This sequence number identifies the order of the bytes sent from each computer so that the data transferred is in order regardless of any fragmentation or disordering that occurs during transmission. For every byte transmitted the sequence number must be incremented.
Conceptually, each byte sent is assigned a sequence number and the receiver then sends an acknowledgment back to the sender that effectively states that they received it. What is done in practice is only the first data byte is assigned a sequence number which is inserted in the sequence number field and the receiver sends an acknowledgment value of the next byte they expect to receive.
For example, if computer A sends 4 bytes with a sequence number of 100 (conceptually, the four bytes would have a sequence number of 100, 101, 102, & 103 assigned) then the receiver would send back an acknowledgment of 104 since that is the next byte it expects to receive in the next packet. By sending an acknowledgment of 104, the receiver is signaling that it received bytes 100, 101, 102, & 103 correctly. If, by some chance, the last two bytes were corrupted then an acknowledgment value of 102 would be sent since 100 & 101 were received successfully.
However, a problem can occasionally arise when packets are lost. For example, 10,000 bytes are sent in 10 different TCP packets, and the first packet is lost during transmission. The sender would then have to resend all 10,000 bytes; the recipient cannot say that it received bytes 1,000 to 9,999 but only that it failed to receive the first packet, containing bytes 0 to 999. In order to solve this problem, an option of selective acknowledgment (SACK) has been added. This option allows the receiver to acknowledge isolated blocks of packets that were received correctly, rather than the sequence number of the last packet received successively, as in the basic TCP acknowledgment. Each block is conveyed by the starting and ending sequence numbers. In the example above, the receiver would send SACK with sequence numbers 1,000 and 10,000. The sender will thus retransmit only the first packet.
The SACK option is not mandatory and it is used only if both parties support it. This is negotiated when connection is established. SACK uses the optional part of the TCP header. See #TCP segment structure. The use of SACK is widespread - all popular TCP stacks support it. Selective acknowledgment is also used in SCTP.
Error-free data transfer
Sequence numbers and acknowledgments cover discarding duplicate packets, retransmission of lost packets, and ordered-data transfer. To assure correctness a checksum field is included (see TCP segment structure for details on checksumming).
The TCP checksum is a quite weak check by modern standards. Data Link Layers with a high probability of bit error rates may require additional link error correction/detection capabilities. If TCP were to be redesigned today, it would most probably have a 32-bit cyclic redundancy check specified as an error check instead of the current checksum. The weak checksum is partially compensated for by the common use of a CRC or better integrity check at layer 2, below both TCP and IP, such as is used in PPP or the Ethernet frame. However, this does not mean that the 16-bit TCP checksum is redundant: remarkably, surveys of Internet traffic have shown that software and hardware errors that introduce errors in packets between CRC-protected hops are common, and that the end-to-end 16-bit TCP checksum catches most of these simple errors. This is the end-to-end principle at work.
A Simplified TCP State Diagram. See * TCP EFSM diagram for a more detailed state diagram including the states inside the ESTABLISHED state.
The final part to TCP is congestion control. TCP uses a number of mechanisms to achieve high performance and avoid 'congestion collapse', where network performance can fall by several orders of magnitude. These mechanisms control the rate of data entering the network, keeping the data flow below a rate that would trigger collapse.
Acknowledgments for data sent, or lack of acknowledgments, are used by senders to implicitly interpret network conditions between the TCP sender and receiver. Coupled with timers, TCP senders and receivers can alter the behavior of the flow of data. This is more generally referred to as flow control, congestion control and/or network congestion avoidance.
Modern implementations of TCP contain four intertwined algorithms: Slow-start, congestion avoidance, fast retransmit, and fast recovery (RFC2581).
Enhancing TCP to reliably handle loss, minimize errors, manage congestion and go fast in very high-speed environments are ongoing areas of research and standards development.
TCP window size
TCP sequence numbers and windows behave very much like a clock. The window, whose width (in bytes) is defined by the receiving host, shifts each time it receives and acks a segment of data. Once it runs out of sequence numbers, it loops back to 0.The TCP receive window size is the amount of received data (in bytes) that can be buffered during a connection. The sending host can send only up to that amount of data before it must wait for an acknowledgment and window update from the receiving host. When a receiver advertises the window size of 0, the sender stops sending data and starts the persist timer. The persist timer is used to protect TCP from the dead lock situation. The dead lock situation could be when the new window size update from the receiver is lost and the receiver has no more data to send while the sender is waiting for the new window size update. When the persist timer expires the TCP sender sends a small packet so that the receivers ACKs the packet with the new window size and TCP can recover from such situations.
For more efficient use of high bandwidth networks, a larger TCP window size may be used. The TCP window size field controls the flow of data and is limited to between 2 and 65,535 bytes.
Since the size field cannot be expanded, a scaling factor is used. The TCP window scale option, as defined in RFC 1323, is an option used to increase the maximum window size from 65,535 bytes to 1 Gigabyte. Scaling up to larger window sizes is a part of what is necessary for TCP Tuning.
The window scale option is used only during the TCP 3-way handshake. The window scale value represents the number of bits to left-shift the 16-bit window size field. The window scale value can be set from 0 (no shift) to 14.
Many routers and packet firewalls rewrite the window scaling factor during a transmission. This causes sending and receiving sides to assume different TCP window sizes. The result is non-stable traffic that is very slow. The problem is visible on some sending and receiving sites which are behind the path of broken routers.
For more information on problems that may be caused, especially with Linux and Vista systems, please see main topic TCP window scale option.
The connection termination phase uses, at most, a four-way handshake, with each side of the connection terminating independently. When an endpoint wishes to stop its half of the connection, it transmits a FIN packet, which the other end acknowledges with an ACK. Therefore, a typical tear down requires a pair of FIN and ACK segments from each TCP endpoint.
A connection can be "half-open", in which case one side has terminated its end, but the other has not. The side that has terminated can no longer send any data into the connection, but the other side can.
It is also possible to terminate the connection by a 3-way handshake, when host A sends a FIN and host B replies with a FIN & ACK (merely combines 2 steps into one) and host A replies with an ACK. This is perhaps the most common method.
It is possible for both hosts to send FINs simultaneously then both just have to ACK. This could possibly be considered a 2-way handshake since the FIN/ACK sequence is done in parallel for both directions.
Some host TCP stacks may implement a "half-duplex" close sequence, as Linux or HP-UX do. If such a host actively closes a connection but still has not read all the incoming data the stack already received from the link, this host will send a RST instead of a FIN (Section 126.96.36.199 in RFC 1122). This allows a TCP application to be sure that the remote application has read all the data the former sent - waiting the FIN from the remote side when it will actively close the connection. Unfortunatelly, the remote TCP stack cannot distinguish between a Connection Aborting RST and this Data Loss RST - both will make the remote stack to throw away all the data it received, but the application still didn't read.
Some application protocols may violate the OSI model layers, using the TCP open/close handshaking for the application protocol open/close handshaking - these may find the RST problem on active close. As an example:
s = connect(remote);
For a usual program flow like above, a TCP/IP stack like that described above does not guarantee that all the data will arrive to the other application unless the programmer is sure that the remote side will not send anything.
TCP uses the notion of port numbers to identify sending and receiving application end-points on a host, or Internet sockets. Each side of a TCP connection has an associated 16-bit unsigned port number (1-65535) reserved by the sending or receiving application. Arriving TCP data packets are identified as belonging to a specific TCP connection by its sockets, that is, the combination of source host address, source port, destination host address, and destination port. This means that a server computer can provide several clients with several services simultaneously, as long as a client takes care of initiating any simultaneous connections to one destination port from different source ports.
Port numbers are categorized into three basic categories: well-known, registered, and dynamic/private. The well-known ports are assigned by the Internet Assigned Numbers Authority (IANA) and are typically used by system-level or root processes. Well-known applications running as servers and passively listening for connections typically use these ports. Some examples include: FTP (21), ssh (22), TELNET (23), SMTP (25) and HTTP (80). Registered ports are typically used by end user applications as ephemeral source ports when contacting servers, but they can also identify named services that have been registered by a third party. Dynamic/private ports can also be used by end user applications, but are less commonly so. Dynamic/private ports do not contain any meaning outside of any particular TCP connection.
Development of TCP
TCP is a complex and evolving protocol. However, while significant enhancements have been made and proposed over the years, its most basic operation has not changed significantly since its first specification RFC 675 in 1974, and the v4 specification RFC 793, published in September 1981. RFC 1122, Host Requirements for Internet Hosts, clarified a number of TCP protocol implementation requirements. RFC 2581, TCP Congestion Control, one of the most important TCP related RFCs in recent years, describes updated algorithms to be used in order to avoid undue congestion. In 2001, RFC 3168 was written to describe explicit congestion notification (ECN), a congestion avoidance signalling mechanism. Common applications that use TCP include HTTP (World Wide Web), SMTP (e-mail) and FTP (file transfer).
The original TCP congestion control was called TCP Tahoe, several alternative congestion control algorithms have been proposed:
BIC TCP by Lisong Xu, Khaled Harfoush, and Injong Rhee at North Carolina State University
Compound TCP by K. Tan, J. Song, Q. Zhang, and M. Sridharan at Microsoft Research
CUBIC by Injong Rhee, and Lisong Xu
Fast TCP by Cheng Jin, David X. Wei and Steven H. Low. at Caltech.
H-TCP by D. Leithi, and R. Shorten at Hamilton Institute
High Speed TCP proposed by S. Floyd in RFC 3649
HSTCP-LP by A. Kuzmanovic, E. W. Knightly, and R. Les Cottrell
NewReno, proposed by S. Floyd, T. Henderson and A. Gurtov in RFC 3782
Scalable TCP by Tom Kelly
TCP Hybla by Carlo Caini and Rosario Firrincieli at University of Bologna
TCP-Illinois by Shao Liu, Tamer Basar and R. Srikant
TCP-LP by Aleksandar Kuzmanovic
TCP Reno by BSD 4.3BSD
TCP Vegas by Lawrence S. Brakmo and Larry L. Peterson at University of Arizona
TCP Veno by C. P. Fu, S. C. Liew
TCP Westwood by Saverio Mascolo, Claudio Casetti, Mario Gerla, M. Y. Sanadidi, and Ren Wang
TCP Westwood+ by A. Dell’Aera, L. A. Grieco, S. Mascolo
XCP by Aaron Falk, Dina Katabi
YeAH-TCP by Andrea Baiocchi, Angelo P. Castellani and Francesco Vacirca.
Ensemble Flow Congestion Management (EFCM), Fuzzy Explicit Window Adaptation (FEWA), Enhanced TCP (ETCP)
An extension mechanism TCP Interactive (iTCP) allows applications to subscribe to TCP events and respond accordingly enabling various functional extensions to TCP including application assisted congestion control.
TCP over wireless
Parts of this article may be confusing or unclear.
Please help clarify the article. Suggestions may be on the talk page. (August 2007)
TCP has been optimized for wired networks. Any packet loss is considered to be the result of congestion and the window size is reduced dramatically as a precaution. However, wireless links are known to experience sporadic and usually temporary losses due to fading, shadowing, hand off, etc. that cannot be considered congestion. Erroneous back-off of the window size due to wireless packet loss is followed by a congestion avoidance phase with a conservative decrease in window size which causes the radio link to be underutilized. Extensive research has been done on this subject on how to combat these harmful effects. Suggested solutions can be categorized as end-to-end solutions (which require modifications at the client and/or server), link layer solutions (such as RLP in CDMA2000), or proxy based solutions (which require some changes in the network without modifying end nodes).
Hardware TCP implementations
One way to overcome the processing power requirements of TCP is building hardware implementations of it, widely known as TCP Offload Engines (TOE). The main problem of TOEs is that they are hard to integrate into computing systems, requiring extensive changes in the operating system of the computer or device. The first company to develop such a device was Alacritech.
A packet sniffer, which intercepts TCP traffic on a network link, can be useful in debugging networks, network stacks and applications which use TCP by showing the user what packets are passing through a link. Some networking stacks support the SO_DEBUG socket option, which can be enabled on the socket using setsockopt. That option dumps all the packets, TCP states and events on that socket which will be helpful in debugging. netstat is another utility that can be used for debugging.
Alternatives to TCP
For many applications TCP is not appropriate. One big problem (at least with normal implementations) is that the application cannot get at the packets coming after a lost packet until the retransmitted copy of the lost packet is received. This causes problems for real-time applications such as streaming multimedia (such as Internet radio), real-time multiplayer games and voice over IP (VoIP) where it is sometimes more useful to get most of the data in a timely fashion than it is to get all of the data in order.
Also for embedded systems, network booting and servers that serve simple requests from huge numbers of clients (e.g. DNS servers) the complexity of TCP can be a problem. Finally some tricks such as transmitting data between two hosts that are both behind NAT (using STUN or similar systems) are far simpler without a relatively complex protocol like TCP in the way.
Generally where TCP is unsuitable the User Datagram Protocol (UDP) is used. This provides the application multiplexing and checksums that TCP does, but does not handle building streams or retransmission giving the application developer the ability to code those in a way suitable for the situation and/or to replace them with other methods like forward error correction or interpolation.
SCTP is another IP protocol that provides reliable stream oriented services not so dissimilar from TCP. It is newer and considerably more complex than TCP so has not yet seen widespread deployment, however it is especially designed to be used in situations where reliability and near-real-time considerations are important.
Venturi Transport Protocol (VTP) is a patented proprietary protocol that is designed to replace TCP transparently in order to overcome perceived inefficiencies related to wireless data transport.
TCP also has some issues in high bandwidth utilization environments. The TCP congestion avoidance algorithm works very well for ad-hoc environments where it is not known who will be sending data, but if the environment is predictable, a timing based protocol such as ATM can avoid the overhead of the retransmits that TCP needs.
TCP segment structure
A TCP segment consists of two sections:
The header consists of 11 fields, of which only 10 are required. The eleventh field is optional (pink background in table) and aptly named: options.
TCP Header Bit offset Bits 0–3 4–7 8–15 16–31
0 Source port Destination port
32 Sequence number
64 Acknowledgment number
96 Data offset Reserved Flags Window
128 Checksum Urgent pointer
160 Options (optional)
Source port – identifies the sending port
Destination port – identifies the receiving port
Sequence number – has a dual role
If the SYN flag is present then this is the initial sequence number and the first data byte is the sequence number plus 1
if the SYN flag is not present then the first data byte is the sequence number
Acknowledgment number – if the ACK flag is set then the value of this field is the sequence number that the sender of the acknowledgment expects next.
Data offset – specifies the size of the TCP header in 32-bit words. The minimum size header is 5 words and the maximum is 15 words thus giving the minimum size of 20 bytes and maximum of 60 bytes. This field gets its name from the fact that it is also the offset from the start of the TCP packet to the data.
Reserved – for future use and should be set to zero
Flags (aka Control bits) – contains 8 bit flags
CWR – Congestion Window Reduced (CWR) flag is set by the sending host to indicate that it received a TCP segment with the ECE flag set (added to header by RFC 3168).
ECE (ECN-Echo) – indicate that the TCP peer is ECN capable during 3-way handshake (added to header by RFC 3168).
URG – indicates that the URGent pointer field is significant
ACK – indicates that the ACKnowledgment field is significant
PSH – Push function
RST – Reset the connection
SYN – Synchronize sequence numbers
FIN – No more data from sender
Window – the number of bytes that may be received on the receiving side before being halted from sliding any further and receiving any more bytes as a result of a packet at the beginning of the sliding window not having been acknowledged or received. Starts at acknowledgement field.
Checksum – The 16-bit checksum field is used for error-checking of the header and data
Urgent pointer – if the URG flag is set, then this 16-bit field is an offset from the sequence number indicating the last urgent data byte
Options – the total length of the option field must be a multiple of a 32-bit word and the data offset field adjusted appropriately
Fields used to compute the checksum
TCP checksum using IPv4
When TCP runs over IPv4, the method used to compute the checksum is defined in RFC 793:
The checksum field is the 16 bit one's complement of the one's complement sum of all 16-bit words in the header and text. If a segment contains an odd number of header and text octets to be checksummed, the last octet is padded on the right with zeros to form a 16-bit word for checksum purposes. The pad is not transmitted as part of the segment. While computing the checksum, the checksum field itself is replaced with zeros.
In other words, all 16-bit words are summed together using one's complement (with the checksum field set to zero). The sum is then one's complemented. This final value is then inserted as the checksum field. Algorithmically speaking, this is the same as for IPv6. The difference is in the data used to make the checksum. When computing the checksum, a pseudo-header that mimics the IPv4 header is shown in the table below.
TCP pseudo-header (IPv4) Bit offset Bits 0–3 4–7 8–15 16–31
0 Source address
32 Destination address
64 Zeros Protocol TCP length
96 Source port Destination port
128 Sequence number
160 Acknowledgement number
192 Data offset Reserved Flags Window
224 Checksum Urgent pointer
256 Options (optional)
The source and destination addresses are those in the IPv4 header. The protocol is that for TCP (see List of IPv4 protocol numbers): 6. The TCP length field is the length of the TCP header and data.
TCP checksum using IPv6
When TCP runs over IPv6, the method used to compute the checksum is changed, as per RFC 2460:
Any transport or other upper-layer protocol that includes the addresses from the IP header in its checksum computation must be modified for use over IPv6, to include the 128-bit IPv6 addresses instead of 32-bit IPv4 addresses.
When computing the checksum, a pseudo-header that mimics the IPv6 header is shown in the table below.
TCP pseudo-header (IPv6) Bit offset Bits 0 - 7 8–15 16–23 24–31
0 Source address
128 Destination address
256 TCP length
288 Zeros Next header
320 Source port Destination port
352 Sequence number
384 Acknowledgement number
416 Data offset Reserved Flags Window
448 Checksum Urgent pointer
480 Options (optional)
Source address – the one in the IPv6 header
Destination address – the final destination; if the IPv6 packet doesn't contain a Routing header, that will be the destination address in the IPv6 header, otherwise, at the originating node, it will be the address in the last element of the Routing header, and, at the receiving node, it will be the destination address in the IPv6 header.
TCP length – the length of the TCP header and data;
Next Header – the protocol value for TCP
The last field is not a part of the header. The contents of this field are whatever the upper layer protocol wants but this protocol is not set in the header and is presumed based on the port selection.
TCP Maximum Segment Size (MSS) and Relationship to IP Datagram Size
(Page 1 of 3)
TCP segments are the messages that carry data between TCP devices. The Data field is where the actual data being transmitted is carried, and since the length of the Data field in TCP is variable, this raises an interesting question: how much data should we put into each segment? With protocols that accept data in blocks from the higher layers there isn't as much of a question, but TCP accepts data as a constant stream from the applications that use it. This means it must decide how many bytes to put into each message that it sends.
A primary determinant of how much data to send in a segment is the current status of the sliding window mechanism on the part of the receiver. When Device A receives a TCP segment from Device B, it examines value of the Window field to know the limit on how much data Device B is allowing Device A to send in its next segment. There are also important issues in the selection and adjustment of window size that impact the operation of the TCP system as a whole, which are discussed in the reliability section.
In addition to the dictates of the current window size, each TCP device also has associated with it a ceiling on TCP size—a segment size that will never be exceeded regardless of how large the current window is. This is called the maximum segment size (MSS). When deciding how much data to put into a segment, each device in the TCP connection will choose the amount based on the current window size, in conjunction with the various algorithms described in the reliability section, but it will never be so large that the amount of data exceeds the MSS of the device to which it is sending.
Note: I need to point out that the name “maximum segment size” is in fact misleading. The value actually refers to the maximum amount of data that a segment can hold—it does not include the TCP headers. So if the MSS is 100, the actual maximum segment size could be 120 (for a regular TCP header) or larger (if the segment includes TCP options).
TCP Header Format
TCP segments are sent as internet datagrams. The Internet Protocol header carries several information fields, including the source and destination host addresses . A TCP header follows the internet header, supplying information specific to the TCP protocol. This division allows for the existence of host level protocols other than TCP.
TCP Header Format
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
| Source Port | Destination Port |
| Sequence Number |
| Acknowledgment Number |
| Data | |U|A|P|R|S|F| |
| Offset| Reserved |R|C|S|S|Y|I| Window |
| | |G|K|H|T|N|N| |
| Checksum | Urgent Pointer |
| Options | Padding |
| data |
TCP Header Format
Note that one tick mark represents one bit position.
Source Port: 16 bits
The source port number.
Destination Port: 16 bits
The destination port number.
Sequence Number: 32 bits
The sequence number of the first data octet in this segment (except
when SYN is present). If SYN is present the sequence number is the
initial sequence number (ISN) and the first data octet is ISN+1.
Acknowledgment Number: 32 bits
If the ACK control bit is set this field contains the value of the
next sequence number the sender of the segment is expecting to
receive. Once a connection is established this is always sent.
Data Offset: 4 bits
The number of 32 bit words in the TCP Header. This indicates where
the data begins. The TCP header (even one including options) is an
integral number of 32 bits long.
Reserved: 6 bits
Reserved for future use. Must be zero.
Control Bits: 6 bits (from left to right):
URG: Urgent Pointer field significant
ACK: Acknowledgment field significant
PSH: Push Function
RST: Reset the connection
SYN: Synchronize sequence numbers
FIN: No more data from sender
Window: 16 bits
The number of data octets beginning with the one indicated in the
acknowledgment field which the sender of this segment is willing to
Checksum: 16 bits
The checksum field is the 16 bit one's complement of the one's
complement sum of all 16 bit words in the header and text. If a
segment contains an odd number of header and text octets to be
checksummed, the last octet is padded on the right with zeros to
form a 16 bit word for checksum purposes. The pad is not
transmitted as part of the segment. While computing the checksum,
the checksum field itself is replaced with zeros.
The checksum also covers a 96 bit pseudo header conceptually
prefixed to the TCP header. This pseudo header contains the Source
Address, the Destination Address, the Protocol, and TCP length.
This gives the TCP protection against misrouted segments. This
information is carried in the Internet Protocol and is transferred
across the TCP/Network interface in the arguments or results of
calls by the TCP on the IP.
| Source Address |
| Destination Address |
| zero | PTCL | TCP Length |
The TCP Length is the TCP header length plus the data length in
octets (this is not an explicitly transmitted quantity, but is
computed), and it does not count the 12 octets of the pseudo
Urgent Pointer: 16 bits
This field communicates the current value of the urgent pointer as a
positive offset from the sequence number in this segment. The
urgent pointer points to the sequence number of the octet following
the urgent data. This field is only be interpreted in segments with
the URG control bit set.
Options may occupy space at the end of the TCP header and are a
multiple of 8 bits in length. All options are included in the
checksum. An option may begin on any octet boundary. There are two
cases for the format of an option:
Case 1: A single octet of option-kind.
Case 2: An octet of option-kind, an octet of option-length, and
the actual option-data octets.
The option-length counts the two octets of option-kind and
option-length as well as the option-data octets.
Note that the list of options may be shorter than the data offset
field might imply. The content of the header beyond the
End-of-Option option must be header padding (i.e., zero).
A TCP must implement all options.
Currently defined options include (kind indicated in octal):
Kind Length Meaning
---- ------ -------
0 - End of option list.
1 - No-Operation.
2 4 Maximum Segment Size.
Specific Option Definitions
End of Option List
This option code indicates the end of the option list. This
might not coincide with the end of the TCP header according to
the Data Offset field. This is used at the end of all options,
not the end of each option, and need only be used if the end of
the options would not otherwise coincide with the end of the TCP
This option code may be used between options, for example, to
align the beginning of a subsequent option on a word boundary.
There is no guarantee that senders will use this option, so
receivers must be prepared to process options even if they do
not begin on a word boundary.
Maximum Segment Size
|00000010|00000100| max seg size |
Maximum Segment Size Option Data: 16 bits
If this option is present, then it communicates the maximum
receive segment size at the TCP which sends this segment.
This field must only be sent in the initial connection request
(i.e., in segments with the SYN control bit set). If this
option is not used, any segment size is allowed.
The TCP header padding is used to ensure that the TCP header ends
and data begins on a 32 bit boundary. The padding is composed of
A fundamental notion in the design is that every octet of data sent over a TCP connection has a sequence number. Since every octet is sequenced, each of them can be acknowledged. The acknowledgment mechanism employed is cumulative so that an acknowledgment of sequence number X indicates that all octets up to but not including X have been received. This mechanism allows for straight-forward duplicate detection in the presence of retransmission. Numbering of octets within a segment is that the first data octet immediately following the header is the lowest numbered, and the following octets are numbered consecutively.
It is essential to remember that the actual sequence number space is finite, though very large. This space ranges from 0 to 2**32 - 1. Since the space is finite, all arithmetic dealing with sequence numbers must be performed modulo 2**32. This unsigned arithmetic preserves the relationship of sequence numbers as they cycle from 2**32 - 1 to 0 again. There are some subtleties to computer modulo arithmetic, so great care should be taken in programming the comparison of such values. The symbol "=<" means "less than or equal" (modulo 2**32).
The typical kinds of sequence number comparisons which the TCP must perform include:
(a) Determining that an acknowledgment refers to some sequence
number sent but not yet acknowledged.
(b) Determining that all sequence numbers occupied by a segment
have been acknowledged (e.g., to remove the segment from a
(c) Determining that an incoming segment contains sequence numbers
which are expected (i.e., that the segment "overlaps" the
In response to sending data the TCP will receive acknowledgments. The following comparisons are needed to process the acknowledgments.
SND.UNA = oldest unacknowledged sequence number
SND.NXT = next sequence number to be sent
SEG.ACK = acknowledgment from the receiving TCP (next sequence
number expected by the receiving TCP)
SEG.SEQ = first sequence number of a segment
SEG.LEN = the number of octets occupied by the data in the segment
(counting SYN and FIN)
SEG.SEQ+SEG.LEN-1 = last sequence number of a segment
A new acknowledgment (called an "acceptable ack"), is one for which the inequality below holds:
SND.UNA < SEG.ACK =< SND.NXT
A segment on the retransmission queue is fully acknowledged if the sum of its sequence number and length is less or equal than the acknowledgment value in the incoming segment.
When data is received the following comparisons are needed:
RCV.NXT = next sequence number expected on an incoming segments, and
is the left or lower edge of the receive window
RCV.NXT+RCV.WND-1 = last sequence number expected on an incoming
segment, and is the right or upper edge of the receive window
SEG.SEQ = first sequence number occupied by the incoming segment
SEG.SEQ+SEG.LEN-1 = last sequence number occupied by the incoming
A segment is judged to occupy a portion of valid receive sequence space if
RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND
RCV.NXT =< SEG.SEQ+SEG.LEN-1 < RCV.NXT+RCV.WND
The first part of this test checks to see if the beginning of the segment falls in the window, the second part of the test checks to see if the end of the segment falls in the window; if the segment passes either part of the test it contains data in the window.
Actually, it is a little more complicated than this. Due to zero windows and zero length segments, we have four cases for the acceptability of an incoming segment:
Segment Receive Test
------- ------- -------------------------------------------
0 0 SEG.SEQ = RCV.NXT
0 >0 RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND
>0 0 not acceptable
>0 >0 RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND
or RCV.NXT =< SEG.SEQ+SEG.LEN-1 < RCV.NXT+RCV.WND
Note that when the receive window is zero no segments should be acceptable except ACK segments. Thus, it is be possible for a TCP to maintain a zero receive window while transmitting data and receiving ACKs. However, even when the receive window is zero, a TCP must process the RST and URG fields of all incoming segments.
We have taken advantage of the numbering scheme to protect certain control information as well. This is achieved by implicitly including some control flags in the sequence space so they can be retransmitted and acknowledged without confusion (i.e., one and only one copy of the control will be acted upon). Control information is not physically carried in the segment data space. Consequently, we must adopt rules for implicitly assigning sequence numbers to control. The SYN and FIN are the only controls requiring this protection, and these controls are used only at connection opening and closing. For sequence number purposes, the SYN is considered to occur before the first actual data octet of the segment in which it occurs, while the FIN is considered to occur after the last actual data octet in a segment in which it occurs. The segment length (SEG.LEN) includes both data and sequence space occupying controls. When a SYN is present then SEG.SEQ is the sequence number of the SYN.
Initial Sequence Number Selection
The protocol places no restriction on a particular connection being used over and over again. A connection is defined by a pair of sockets. New instances of a connection will be referred to as incarnations of the connection. The problem that arises from this is -- "how does the TCP identify duplicate segments from previous incarnations of the connection?" This problem becomes apparent if the connection is being opened and closed in quick succession, or if the connection breaks with loss of memory and is then reestablished.
To avoid confusion we must prevent segments from one incarnation of a connection from being used while the same sequence numbers may still be present in the network from an earlier incarnation. We want to assure this, even if a TCP crashes and loses all knowledge of the sequence numbers it has been using. When new connections are created, an initial sequence number (ISN) generator is employed which selects a new 32 bit ISN. The generator is bound to a (possibly fictitious) 32 bit clock whose low order bit is incremented roughly every 4 microseconds. Thus, the ISN cycles approximately every 4.55 hours. Since we assume that segments will stay in the network no more than the Maximum Segment Lifetime (MSL) and that the MSL is less than 4.55 hours we can reasonably assume that ISN's will be unique.
For each connection there is a send sequence number and a receive sequence number. The initial send sequence number (ISS) is chosen by the data sending TCP, and the initial receive sequence number (IRS) is learned during the connection establishing procedure.
For a connection to be established or initialized, the two TCPs must synchronize on each other's initial sequence numbers. This is done in an exchange of connection establishing segments carrying a control bit called "SYN" (for synchronize) and the initial sequence numbers. As a shorthand, segments carrying the SYN bit are also called "SYNs". Hence, the solution requires a suitable mechanism for picking an initial sequence number and a slightly involved handshake to exchange the ISN's.
The synchronization requires each side to send it's own initial sequence number and to receive a confirmation of it in acknowledgment from the other side. Each side must also receive the other side's initial sequence number and send a confirming acknowledgment.
1) A --> B SYN my sequence number is X
2) A <-- B ACK your sequence number is X
3) A <-- B SYN my sequence number is Y
4) A --> B ACK your sequence number is Y
Because steps 2 and 3 can be combined in a single message this is called the three way (or three message) handshake.
A three way handshake is necessary because sequence numbers are not tied to a global clock in the network, and TCPs may have different mechanisms for picking the ISN's. The receiver of the first SYN has no way of knowing whether the segment was an old delayed one or not, unless it remembers the last sequence number used on the connection (which is not always possible), and so it must ask the sender to verify this SYN. The three way handshake and the advantages of a clock-driven scheme are discussed in .
Knowing When to Keep Quiet
To be sure that a TCP does not create a segment that carries a sequence number which may be duplicated by an old segment remaining in the network, the TCP must keep quiet for a maximum segment lifetime (MSL) before assigning any sequence numbers upon starting up or recovering from a crash in which memory of sequence numbers in use was lost. For this specification the MSL is taken to be 2 minutes. This is an engineering choice, and may be changed if experience indicates it is desirable to do so. Note that if a TCP is reinitialized in some sense, yet retains its memory of sequence numbers in use, then it need not wait at all; it must only be sure to use sequence numbers larger than those recently used.
The TCP Quiet Time Concept
This specification provides that hosts which "crash" without
retaining any knowledge of the last sequence numbers transmitted on
each active (i.e., not closed) connection shall delay emitting any
TCP segments for at least the agreed Maximum Segment Lifetime (MSL)
in the internet system of which the host is a part. In the
paragraphs below, an explanation for this specification is given.
TCP implementors may violate the "quiet time" restriction, but only
at the risk of causing some old data to be accepted as new or new
data rejected as old duplicated by some receivers in the internet
TCPs consume sequence number space each time a segment is formed and
entered into the network output queue at a source host. The
duplicate detection and sequencing algorithm in the TCP protocol
relies on the unique binding of segment data to sequence space to
the extent that sequence numbers will not cycle through all 2**32
values before the segment data bound to those sequence numbers has
been delivered and acknowledged by the receiver and all duplicate
copies of the segments have "drained" from the internet. Without
such an assumption, two distinct TCP segments could conceivably be
assigned the same or overlapping sequence numbers, causing confusion
at the receiver as to which data is new and which is old. Remember
that each segment is bound to as many consecutive sequence numbers
as there are octets of data in the segment.
Under normal conditions, TCPs keep track of the next sequence number
to emit and the oldest awaiting acknowledgment so as to avoid
mistakenly using a sequence number over before its first use has
been acknowledged. This alone does not guarantee that old duplicate
data is drained from the net, so the sequence space has been made
very large to reduce the probability that a wandering duplicate will
cause trouble upon arrival. At 2 megabits/sec. it takes 4.5 hours
to use up 2**32 octets of sequence space. Since the maximum segment
lifetime in the net is not likely to exceed a few tens of seconds,
this is deemed ample protection for foreseeable nets, even if data
rates escalate to l0's of megabits/sec. At 100 megabits/sec, the
cycle time is 5.4 minutes which may be a little short, but still
The basic duplicate detection and sequencing algorithm in TCP can be
defeated, however, if a source TCP does not have any memory of the
sequence numbers it last used on a given connection. For example, if
the TCP were to start all connections with sequence number 0, then
upon crashing and restarting, a TCP might re-form an earlier
connection (possibly after half-open connection resolution) and emit
packets with sequence numbers identical to or overlapping with
packets still in the network which were emitted on an earlier
incarnation of the same connection. In the absence of knowledge
about the sequence numbers used on a particular connection, the TCP
specification recommends that the source delay for MSL seconds
before emitting segments on the connection, to allow time for
segments from the earlier connection incarnation to drain from the
Even hosts which can remember the time of day and used it to select
initial sequence number values are not immune from this problem
(i.e., even if time of day is used to select an initial sequence
number for each new connection incarnation).
Suppose, for example, that a connection is opened starting with
sequence number S. Suppose that this connection is not used much
and that eventually the initial sequence number function (ISN(t))
takes on a value equal to the sequence number, say S1, of the last
segment sent by this TCP on a particular connection. Now suppose,
at this instant, the host crashes, recovers, and establishes a new
incarnation of the connection. The initial sequence number chosen is
S1 = ISN(t) -- last used sequence number on old incarnation of
connection! If the recovery occurs quickly enough, any old
duplicates in the net bearing sequence numbers in the neighborhood
of S1 may arrive and be treated as new packets by the receiver of
the new incarnation of the connection.
The problem is that the recovering host may not know for how long it
crashed nor does it know whether there are still old duplicates in
the system from earlier connection incarnations.
One way to deal with this problem is to deliberately delay emitting
segments for one MSL after recovery from a crash- this is the "quite
time" specification. Hosts which prefer to avoid waiting are
willing to risk possible confusion of old and new packets at a given
destination may choose not to wait for the "quite time".
Implementors may provide TCP users with the ability to select on a
connection by connection basis whether to wait after a crash, or may
informally implement the "quite time" for all connections.
Obviously, even where a user selects to "wait," this is not
necessary after the host has been "up" for at least MSL seconds.
To summarize: every segment emitted occupies one or more sequence
numbers in the sequence space, the numbers occupied by a segment are
"busy" or "in use" until MSL seconds have passed, upon crashing a
block of space-time is occupied by the octets of the last emitted
segment, if a new connection is started too soon and uses any of the
sequence numbers in the space-time footprint of the last segment of
the previous connection incarnation, there is a potential sequence
number overlap area which could cause confusion at the receiver.
TCP Connection Open
The "three-way handshake" is the procedure used to establish a connection. This procedure normally is initiated by one TCP and responded to by another TCP. The procedure also works if two TCP simultaneously initiate the procedure. When simultaneous attempt occurs, each TCP receives a "SYN" segment which carries no acknowledgment after it has sent a "SYN". Of course, the arrival of an old duplicate "SYN" segment can potentially make it appear, to the recipient, that a simultaneous connection initiation is in progress. Proper use of "reset" segments can disambiguate these cases.
Several examples of connection initiation follow. Although these examples do not show connection synchronization using data-carrying segments, this is perfectly legitimate, so long as the receiving TCP doesn't deliver the data to the user until it is clear the data is valid (i.e., the data must be buffered at the receiver until the connection reaches the ESTABLISHED state). The three-way handshake reduces the possibility of false connections. It is the implementation of a trade-off between memory and messages to provide information for this checking.
The simplest three-way handshake is shown in figure 7 below. The figures should be interpreted in the following way. Each line is numbered for reference purposes. Right arrows (-->) indicate departure of a TCP segment from TCP A to TCP B, or arrival of a segment at B from A. Left arrows (<--), indicate the reverse. Ellipsis (...) indicates a segment which is still in the network (delayed). An "XXX" indicates a segment which is lost or rejected. Comments appear in parentheses. TCP states represent the state AFTER the departure or arrival of the segment (whose contents are shown in the center of each line). Segment contents are shown in abbreviated form, with sequence number, control flags, and ACK field. Other fields such as window, addresses, lengths, and text have been left out in the interest of clarity.
TCP A TCP B
1. CLOSED LISTEN
2. SYN-SENT -->
3. ESTABLISHED <--
4. ESTABLISHED -->
5. ESTABLISHED -->
Basic 3-Way Handshake for Connection Synchronization
In line 2 of figure 7, TCP A begins by sending a SYN segment indicating that it will use sequence numbers starting with sequence number 100. In line 3, TCP B sends a SYN and acknowledges the SYN it received from TCP A. Note that the acknowledgment field indicates TCP B is now expecting to hear sequence 101, acknowledging the SYN which occupied sequence 100.
At line 4, TCP A responds with an empty segment containing an ACK for TCP B's SYN; and in line 5, TCP A sends some data. Note that the sequence number of the segment in line 5 is the same as in line 4 because the ACK does not occupy sequence number space (if it did, we would wind up ACKing ACK's!).
Simultaneous initiation is only slightly more complex, as is shown in figure 8. Each TCP cycles from CLOSED to SYN-SENT to SYN-RECEIVED to ESTABLISHED.
TCP A TCP B
1. CLOSED CLOSED
2. SYN-SENT -->
3. SYN-RECEIVED <--
5. SYN-RECEIVED -->
6. ESTABLISHED <--
Simultaneous Connection Synchronization
The principle reason for the three-way handshake is to prevent old duplicate connection initiations from causing confusion. To deal with this, a special control message, reset, has been devised. If the receiving TCP is in a non-synchronized state (i.e., SYN-SENT, SYN-RECEIVED), it returns to LISTEN on receiving an acceptable reset. If the TCP is in one of the synchronized states (ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT), it aborts the connection and informs its user. We discuss this latter case under "half-open" connections below.
TCP A TCP B
1. CLOSED LISTEN
2. SYN-SENT -->
3. (duplicate) ...
4. SYN-SENT <--
5. SYN-SENT -->
7. SYN-SENT <--
8. ESTABLISHED -->
Recovery from Old Duplicate SYN
As a simple example of recovery from old duplicates, consider figure 9. At line 3, an old duplicate SYN arrives at TCP B. TCP B cannot tell that this is an old duplicate, so it responds normally (line 4). TCP A detects that the ACK field is incorrect and returns a RST (reset) with its SEQ field selected to make the segment believable. TCP B, on receiving the RST, returns to the LISTEN state. When the original SYN (pun intended) finally arrives at line 6, the synchronization proceeds normally. If the SYN at line 6 had arrived before the RST, a more complex exchange might have occurred with RST's sent in both directions.
Half-Open Connections and Other Anomalies
An established connection is said to be "half-open" if one of the TCPs has closed or aborted the connection at its end without the knowledge of the other, or if the two ends of the connection have become desynchronized owing to a crash that resulted in loss of memory. Such connections will automatically become reset if an attempt is made to send data in either direction. However, half-open connections are expected to be unusual, and the recovery procedure is mildly involved.
If at site A the connection no longer exists, then an attempt by the user at site B to send any data on it will result in the site B TCP receiving a reset control message. Such a message indicates to the site B TCP that something is wrong, and it is expected to abort the connection.
Assume that two user processes A and B are communicating with one another when a crash occurs causing loss of memory to A's TCP. Depending on the operating system supporting A's TCP, it is likely that some error recovery mechanism exists. When the TCP is up again, A is likely to start again from the beginning or from a recovery point. As a result, A will probably try to OPEN the connection again or try to SEND on the connection it believes open. In the latter case, it receives the error message "connection not open" from the local (A's) TCP. In an attempt to establish the connection, A's TCP will send a segment containing SYN. This scenario leads to the example shown in figure 10. After TCP A crashes, the user attempts to re-open the connection. TCP B, in the meantime, thinks the connection is open.
TCP A TCP B
1. (CRASH) (send 300,receive 100)
2. CLOSED ESTABLISHED
3. SYN-SENT -->
4. (!!) <--
5. SYN-SENT -->
6. SYN-SENT CLOSED
7. SYN-SENT -->
Half-Open Connection Discovery
When the SYN arrives at line 3, TCP B, being in a synchronized state, and the incoming segment outside the window, responds with an acknowledgment indicating what sequence it next expects to hear (ACK 100). TCP A sees that this segment does not acknowledge anything it sent and, being unsynchronized, sends a reset (RST) because it has detected a half-open connection. TCP B aborts at line 5. TCP A will
continue to try to establish the connection; the problem is now reduced to the basic 3-way handshake of figure 7.
An interesting alternative case occurs when TCP A crashes and TCP B tries to send data on what it thinks is a synchronized connection. This is illustrated in figure 11. In this case, the data arriving at TCP A from TCP B (line 2) is unacceptable because no such connection exists, so TCP A sends a RST. The RST is acceptable so TCP B processes it and aborts the connection.
TCP A TCP B
1. (CRASH) (send 300,receive 100)
2. (??) <--
Active Side Causes Half-Open Connection Discovery
In figure 12, we find the two TCPs A and B with passive connections waiting for SYN. An old duplicate arriving at TCP B (line 2) stirs B into action. A SYN-ACK is returned (line 3) and causes TCP A to generate a RST (the ACK in line 3 is not acceptable). TCP B accepts the reset and returns to its passive LISTEN state.
TCP A TCP B
1. LISTEN LISTEN
3. (??) <--
5. LISTEN LISTEN
Old Duplicate SYN Initiates a Reset on two Passive Sockets
A variety of other cases are possible, all of which are accounted for by the following rules for RST generation and processing.
As a general rule, reset (RST) must be sent whenever a segment arrives which apparently is not intended for the current connection. A reset must not be sent if it is not clear that this is the case.
There are three groups of states:
1. If the connection does not exist (CLOSED) then a reset is sent
in response to any incoming segment except another reset. In
particular, SYNs addressed to a non-existent connection are rejected
by this means.
If the incoming segment has an ACK field, the reset takes its
sequence number from the ACK field of the segment, otherwise the
reset has sequence number zero and the ACK field is set to the sum
of the sequence number and segment length of the incoming segment.
The connection remains in the CLOSED state.
2. If the connection is in any non-synchronized state (LISTEN,
SYN-SENT, SYN-RECEIVED), and the incoming segment acknowledges
something not yet sent (the segment carries an unacceptable ACK), or
if an incoming segment has a security level or compartment which
does not exactly match the level and compartment requested for the
connection, a reset is sent.
If our SYN has not been acknowledged and the precedence level of the
incoming segment is higher than the precedence level requested then
either raise the local precedence level (if allowed by the user and
the system) or send a reset; or if the precedence level of the
incoming segment is lower than the precedence level requested then
continue as if the precedence matched exactly (if the remote TCP
cannot raise the precedence level to match ours this will be
detected in the next segment it sends, and the connection will be
terminated then). If our SYN has been acknowledged (perhaps in this
incoming segment) the precedence level of the incoming segment must
match the local precedence level exactly, if it does not a reset
must be sent.
If the incoming segment has an ACK field, the reset takes its
sequence number from the ACK field of the segment, otherwise the
reset has sequence number zero and the ACK field is set to the sum
of the sequence number and segment length of the incoming segment.
The connection remains in the same state.
3. If the connection is in a synchronized state (ESTABLISHED,
FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT),
any unacceptable segment (out of window sequence number or
unacceptible acknowledgment number) must elicit only an empty
acknowledgment segment containing the current send-sequence number
and an acknowledgment indicating the next sequence number expected
to be received, and the connection remains in the same state.
If an incoming segment has a security level, or compartment, or
precedence which does not exactly match the level, and compartment,
and precedence requested for the connection,a reset is sent and
connection goes to the CLOSED state. The reset takes its sequence
number from the ACK field of the incoming segment.
In all states except SYN-SENT, all reset (RST) segments are validated by checking their SEQ-fields. A reset is valid if its sequence number is in the window. In the SYN-SENT state (a RST received in response to an initial SYN), the RST is acceptable if the ACK field acknowledges the SYN.
The receiver of a RST first validates it, then changes state. If the receiver was in the LISTEN state, it ignores it. If the receiver was in SYN-RECEIVED state and had previously been in the LISTEN state, then the receiver returns to the LISTEN state, otherwise the receiver aborts the connection and goes to the CLOSED state. If the receiver was in any other state, it aborts the connection and advises the user and goes to the CLOSED state.
TCP Connection Close
CLOSE is an operation meaning "I have no more data to send." The notion of closing a full-duplex connection is subject to ambiguous interpretation, of course, since it may not be obvious how to treat the receiving side of the connection. We have chosen to treat CLOSE in a simplex fashion. The user who CLOSEs may continue to RECEIVE until he is told that the other side has CLOSED also. Thus, a program could initiate several SENDs followed by a CLOSE, and then continue to RECEIVE until signaled that a RECEIVE failed because the other side has CLOSED. We assume that the TCP will signal a user, even if no RECEIVEs are outstanding, that the other side has closed, so the user can terminate his side gracefully. A TCP will reliably deliver all buffers SENT before the connection was CLOSED so a user who expects no data in return need only wait to hear the connection was CLOSED successfully to know that all his data was received at the destination TCP. Users must keep reading connections they close for sending until the TCP says no more data. There are essentially three cases:
1) The user initiates by telling the TCP to CLOSE the connection
2) The remote TCP initiates by sending a FIN control signal
3) Both users CLOSE simultaneously
Case 1: Local user initiates the close
In this case, a FIN segment can be constructed and placed on the
outgoing segment queue. No further SENDs from the user will be
accepted by the TCP, and it enters the FIN-WAIT-1 state. RECEIVEs
are allowed in this state. All segments preceding and including FIN
will be retransmitted until acknowledged. When the other TCP has
both acknowledged the FIN and sent a FIN of its own, the first TCP
can ACK this FIN. Note that a TCP receiving a FIN will ACK but not
send its own FIN until its user has CLOSED the connection also.
Case 2: TCP receives a FIN from the network
If an unsolicited FIN arrives from the network, the receiving TCP
can ACK it and tell the user that the connection is closing. The
user will respond with a CLOSE, upon which the TCP can send a FIN to
the other TCP after sending any remaining data. The TCP then waits
until its own FIN is acknowledged whereupon it deletes the
connection. If an ACK is not forthcoming, after the user timeout
the connection is aborted and the user is told.
Case 3: both users close simultaneously
A simultaneous CLOSE by users at both ends of a connection causes
FIN segments to be exchanged. When all segments preceding the FINs
have been processed and acknowledged, each TCP can ACK the FIN it
has received. Both will, upon receiving these ACKs, delete the
TCP A TCP B
1. ESTABLISHED ESTABLISHED
3. FIN-WAIT-2 <--
5. TIME-WAIT -->
6. (2 MSL)
Normal Close Sequence
TCP A TCP B
1. ESTABLISHED ESTABLISHED
2. (Close) (Close)
3. CLOSING -->
4. TIME-WAIT TIME-WAIT
(2 MSL) (2 MSL)
Simultaneous Close Sequence
Packet decoders are the most useful software tools for examining TCP operation, and TCPdump is such a free software tool designed specifically for TCP.
TCPdump is the UNIX version of a packet decoder, and Lawrence Berkeley Labs is the place to look for it. Originally written by Van Jacobsen to analyze TCP performance problems, it is still a decent tool for that task, but many features have been added since then.
Getting TCPdump to work on a UNIX system can be a chore. TCPdump must be able to put the interface (typically an Ethernet) into promiscuous mode to read all the network traffic. Currently supported systems include SunOS, Ultrix, and most BSDs. Linux is not supported, though there have been reports of a port.
The simplest way to use TCPdump is to run it with just an `-i' switch to specify which network interface should be used. This will dump summary information for every Internet packet received or transmitted on the interface. However, TCPdump provides several important options, as well as the ability to specify an expression to restrict the range of packets you wish to study.
Rather than rehash here what is better documented elsewhere, I suggest you read TCPdump's exceptionally well written manual page, particularly if you intend to use TCPdump for analyzing TCP, DNS, NFS, SLIP, or Appletalk.
Problems You Might Encounter
Check to make sure you're specifying the correct network interface with the -i option, which I suggest you always use explicitly. If you're having DNS problems, TCPdump might hang trying to lookup DNS names for IP addresses, try the -f or -n options to disable this feature. If you still see nothing, check the kernel interface - TCPdump might be mis-configured for your system.
At the end of its run, TCPdump will inform you if any packets were dropped in the kernel. If this becomes a problem, it's likely that your host can't keep up with the network traffic and decode it at the same time. Try using TCPdump's -w option to bypass the decoding and write the raw packets to a file, then come back later and decode the file with the -r switch. You can also try using -s to reduce the capture snapshot size.
Messages that end like [|rip] and [|domain]
Messages ending with [|proto] indicate that the packet couldn't be completely decoded because the capture snapshot size (the so-called "snarf length") was too small. Increase it with the -s switch.
Decoding RIP traces
TCPdump assumes that UDP packets sourced from or targeted at port 520 conform to the Routing Information Protocol (RIP), the distance-vector interior IP routing protocol, of which several versions are in use. RIP packets can be explicitly requested from traceroute by specifying the clause udp port route.
For each RIP packet, TCPdump prints the RIP command. If the RIP command is rip-resp, the routing information in the packet is printed.
If the RIP decode ends with [|rip], the packet was truncated and though it contained additional routing entries, they could not be decoded. Use the -s switch to enlarge the capture snapshot size. According to RFC 1058, the maximum size of a RIP packet is 512 bytes, excluding the IP header (usually 20 bytes) and the UDP header (usually 8 bytes). Using -s540 should capture even the largest RIP packets.
Network Working Group J. Postel
Request for Comments: 879 ISI
The TCP Maximum Segment Size
and Related Topics
This memo discusses the TCP Maximum Segment Size Option and related topics. The purposes is to clarify some aspects of TCP and its interaction with IP. This memo is a clarification to the TCP specification, and contains information that may be considered as "advice to implementers".
This memo discusses the TCP Maximum Segment Size and its relation to
the IP Maximum Datagram Size. TCP is specified in reference . IP
is specified in references [2,3].
This discussion is necessary because the current specification of
this TCP option is ambiguous.
Much of the difficulty with understanding these sizes and their
relationship has been due to the variable size of the IP and TCP
There have been some assumptions made about using other than the
default size for datagrams with some unfortunate results.
HOSTS MUST NOT SEND DATAGRAMS LARGER THAN 576 OCTETS UNLESS THEY
HAVE SPECIFIC KNOWLEDGE THAT THE DESTINATION HOST IS PREPARED TO
ACCEPT LARGER DATAGRAMS.
This is a long established rule.
To resolve the ambiguity in the TCP Maximum Segment Size option
definition the following rule is established:
THE TCP MAXIMUM SEGMENT SIZE IS THE IP MAXIMUM DATAGRAM SIZE MINUS
The default IP Maximum Datagram Size is 576.
The default TCP Maximum Segment Size is 536.
2. The IP Maximum Datagram Size
Hosts are not required to reassemble infinitely large IP datagrams.
The maximum size datagram that all hosts are required to accept or
reassemble from fragments is 576 octets. The maximum size reassembly
buffer every host must have is 576 octets. Hosts are allowed to
accept larger datagrams and assemble fragments into larger datagrams,
hosts may have buffers as large as they please.
Hosts must not send datagrams larger than 576 octets unless they have
specific knowledge that the destination host is prepared to accept
3. The TCP Maximum Segment Size Option
TCP provides an option that may be used at the time a connection is
established (only) to indicate the maximum size TCP segment that can
be accepted on that connection. This Maximum Segment Size (MSS)
announcement (often mistakenly called a negotiation) is sent from the
data receiver to the data sender and says "I can accept TCP segments
up to size X". The size (X) may be larger or smaller than the
default. The MSS can be used completely independently in each
direction of data flow. The result may be quite different maximum
sizes in the two directions.
The MSS counts only data octets in the segment, it does not count the
TCP header or the IP header.
A footnote: The MSS value counts only data octets, thus it does not
count the TCP SYN and FIN control bits even though SYN and FIN do
consume TCP sequence numbers.
4. The Relationship of TCP Segments and IP Datagrams
TCP segment are transmitted as the data in IP datagrams. The
correspondence between TCP segments and IP datagrams must be one to
one. This is because TCP expects to find exactly one complete TCP
segment in each block of data turned over to it by IP, and IP must
turn over a block of data for each datagram received (or completely
5. Layering and Modularity
TCP is an end to end reliable data stream protocol with error
control, flow control, etc. TCP remembers many things about the
state of a connection.
IP is a one shot datagram protocol. IP has no memory of the
datagrams transmitted. It is not appropriate for IP to keep any
information about the maximum datagram size a particular destination
host might be capable of accepting.
TCP and IP are distinct layers in the protocol architecture, and are
often implemented in distinct program modules.
Some people seem to think that there must be no communication between
protocol layers or program modules. There must be communication
between layers and modules, but it should be carefully specified and
controlled. One problem in understanding the correct view of
communication between protocol layers or program modules in general,
or between TCP and IP in particular is that the documents on
protocols are not very clear about it. This is often because the
documents are about the protocol exchanges between machines, not the
program architecture within a machine, and the desire to allow many
program architectures with different organization of tasks into
6. IP Information Requirements
There is no general requirement that IP keep information on a per
IP must make a decision about which directly attached network address
to send each datagram to. This is simply mapping an IP address into
a directly attached network address.
There are two cases to consider: the destination is on the same
network, and the destination is on a different network.
For some networks the the directly attached network address can
be computed from the IP address for destination hosts on the
directly attached network.
For other networks the mapping must be done by table look up
(however the table is initialized and maintained, for
The IP address must be mapped to the directly attached network
address of a gateway. For networks with one gateway to the
rest of the Internet the host need only determine and remember
the gateway address and use it for sending all datagrams to
For networks with multiple gateways to the rest of the
Internet, the host must decide which gateway to use for each
datagram sent. It need only check the destination network of
the IP address and keep information on which gateway to use for
The IP does, in some cases, keep per host routing information for
other hosts on the directly attached network. The IP does, in some
cases, keep per network routing information.
A Special Case
There are two ICMP messages that convey information about
particular hosts. These are subtypes of the Destination
Unreachable and the Redirect ICMP messages. These messages are
expected only in very unusual circumstances. To make effective
use of these messages the receiving host would have to keep
information about the specific hosts reported on. Because these
messages are quite rare it is strongly recommended that this be
done through an exception mechanism rather than having the IP keep
per host tables for all hosts.
7. The Relationship between IP Datagram and TCP Segment Sizes
The relationship between the value of the maximum IP datagram size
and the maximum TCP segment size is obscure. The problem is that
both the IP header and the TCP header may vary in length. The TCP
Maximum Segment Size option (MSS) is defined to specify the maximum
number of data octets in a TCP segment exclusive of TCP (or IP)
To notify the data sender of the largest TCP segment it is possible
to receive the calculation of the MSS value to send is:
MSS = MTU - sizeof(TCPHDR) - sizeof(IPHDR)
On receipt of the MSS option the calculation of the size of segment
that can be sent is:
where MSS is the value in the option, and MTU is the Maximum
Transmission Unit (or the maximum packet size) allowed on the
directly attached network.
This begs the question, though. What value should be used for the
"sizeof(TCPHDR)" and for the "sizeof(IPHDR)"?
There are three reasonable positions to take: the conservative, the
moderate, and the liberal.
The conservative or pessimistic position assumes the worst -- that
both the IP header and the TCP header are maximum size, that is, 60
MSS = MTU - 60 - 60 = MTU - 120
If MTU is 576 then MSS = 456
The moderate position assumes the that the IP is maximum size (60
octets) and the TCP header is minimum size (20 octets), because there
are no TCP header options currently defined that would normally be
sent at the same time as data segments.
MSS = MTU - 60 - 20 = MTU - 80
If MTU is 576 then MSS = 496
The liberal or optimistic position assumes the best -- that both the
IP header and the TCP header are minimum size, that is, 20 octets
MSS = MTU - 20 - 20 = MTU - 40
If MTU is 576 then MSS = 536
If nothing is said about MSS, the data sender may cram as much as
possible into a 576 octet datagram, and if the datagram has
minimum headers (which is most likely), the result will be 536
data octets in the TCP segment. The rule relating MSS to the
maximum datagram size ought to be consistent with this.
A practical point is raised in favor of the liberal position too.
Since the use of minimum IP and TCP headers is very likely in the
very large percentage of cases, it seems wasteful to limit the TCP
segment data to so much less than could be transmitted at once,
especially since it is less that 512 octets.
For comparison: 536/576 is 93% data, 496/576 is 86% data, 456/576
is 79% data.
8. Maximum Packet Size
Each network has some maximum packet size, or maximum transmission
unit (MTU). Ultimately there is some limit imposed by the
technology, but often the limit is an engineering choice or even an
administrative choice. Different installations of the same network
product do not have to use the same maximum packet size. Even within
one installation not all host must use the same packet size (this way
lies madness, though).
Some IP implementers have assumed that all hosts on the directly
attached network will be the same or at least run the same
implementation. This is a dangerous assumption. It has often
developed that after a small homogeneous set of host have become
operational additional hosts of different types are introduced into
the environment. And it has often developed that it is desired to
use a copy of the implementation in a different inhomogeneous
Designers of gateways should be prepared for the fact that successful
gateways will be copied and used in other situation and
installations. Gateways must be prepared to accept datagrams as
large as can be sent in the maximum packets of the directly attached
networks. Gateway implementations should be easily configured for
installation in different circumstances.
A footnote: The MTUs of some popular networks (note that the actual
limit in some installations may be set lower by administrative
ARPANET, MILNET = 1007
Ethernet (10Mb) = 1500
Proteon PRONET = 2046
9. Source Fragmentation
A source host would not normally create datagram fragments. Under
normal circumstances datagram fragments only arise when a gateway
must send a datagram into a network with a smaller maximum packet
size than the datagram. In this case the gateway must fragment the
datagram (unless it is marked "don't fragment" in which case it is
discarded, with the option of sending an ICMP message to the source
reporting the problem).
It might be desirable for the source host to send datagram fragments
if the maximum segment size (default or negotiated) allowed by the
data receiver were larger than the maximum packet size allowed by the
directly attached network. However, such datagram fragments must not
combine to a size larger than allowed by the destination host.
For example, if the receiving TCP announced that it would accept
segments up to 5000 octets (in cooperation with the receiving IP)
then the sending TCP could give such a large segment to the
sending IP provided the sending IP would send it in datagram
fragments that fit in the packets of the directly attached
There are some conditions where source host fragmentation would be
If the host is attached to a network with a small packet size (for
example 256 octets), and it supports an application defined to
send fixed sized messages larger than that packet size (for
example TFTP ).
If the host receives ICMP Echo messages with data it is required
to send an ICMP Echo-Reply message with the same data. If the
amount of data in the Echo were larger than the packet size of the
directly attached network the following steps might be required:
(1) receive the fragments, (2) reassemble the datagram, (3)
interpret the Echo, (4) create an Echo-Reply, (5) fragment it, and
(6) send the fragments.
10. Gateway Fragmentation
Gateways must be prepared to do fragmentation. It is not an optional
feature for a gateway.
Gateways have no information about the size of datagrams destination
hosts are prepared to accept. It would be inappropriate for gateways
to attempt to keep such information.
Gateways must be prepared to accept the largest datagrams that are
allowed on each of the directly attached networks, even if it is
larger than 576 octets.
Gateways must be prepared to fragment datagrams to fit into the
packets of the next network, even if it smaller than 576 octets.
If a source host thought to take advantage of the local network's
ability to carry larger datagrams but doesn't have the slightest idea
if the destination host can accept larger than default datagrams and
expects the gateway to fragment the datagram into default size fragments, then the source host is misguided. If indeed, the
destination host can't accept larger than default datagrams, it
probably can't reassemble them either. If the gateway either passes
on the large datagram whole or fragments into default size fragments
the destination will not accept it. Thus, this mode of behavior by
source hosts must be outlawed.
A larger than default datagram can only arrive at a gateway because
the source host knows that the destination host can handle such large
datagrams (probably because the destination host announced it to the
source host in an TCP MSS option). Thus, the gateway should pass on
this large datagram in one piece or in the largest fragments that fit
into the next network.
An interesting footnote is that even though the gateways may know
about know the 576 rule, it is irrelevant to them.
11. Inter-Layer Communication
The Network Driver (ND) or interface should know the Maximum
Transmission Unit (MTU) of the directly attached network.
The IP should ask the Network Driver for the Maximum Transmission
The TCP should ask the IP for the Maximum Datagram Data Size (MDDS).
This is the MTU minus the IP header length (MDDS = MTU - IPHdrLen).
When opening a connection TCP can send an MSS option with the value
equal MDDS - TCPHdrLen.
TCP should determine the Maximum Segment Data Size (MSDS) from either
the default or the received value of the MSS option.
TCP should determine if source fragmentation is possible (by asking
the IP) and desirable.
If so TCP may hand to IP segments (including the TCP header) up to
MSDS + TCPHdrLen.
If not TCP may hand to IP segments (including the TCP header) up
to the lesser of (MSDS + TCPHdrLen) and MDDS.
IP checks the length of data passed to it by TCP. If the length is
less than or equal MDDS, IP attached the IP header and hands it to
the ND. Otherwise the IP must do source fragmentation.
12. What is the Default MSS ?
Another way of asking this question is "What transmitted value for
MSS has exactly the same effect of not transmitting the option at
In terms of the previous section:
The default assumption is that the Maximum Transmission Unit is
MTU = 576
The Maximum Datagram Data Size (MDDS) is the MTU minus the IP
MDDS = MTU - IPHdrLen = 576 - 20 = 556
When opening a connection TCP can send an MSS option with the
value equal MDDS - TCPHdrLen.
MSS = MDDS - TCPHdrLen = 556 - 20 = 536
TCP should determine the Maximum Segment Data Size (MSDS) from
either the default or the received value of the MSS option.
Default MSS = 536, then MSDS = 536
TCP should determine if source fragmentation is possible and
If so TCP may hand to IP segments (including the TCP header) up
to MSDS + TCPHdrLen (536 + 20 = 556).
If not TCP may hand to IP segments (including the TCP header)
up to the lesser of (MSDS + TCPHdrLen (536 + 20 = 556)) and
13. The Truth
The rule relating the maximum IP datagram size and the maximum TCP
segment size is:
TCP Maximum Segment Size = IP Maximum Datagram Size - 40
The rule must match the default case.
If the TCP Maximum Segment Size option is not transmitted then the
data sender is allowed to send IP datagrams of maximum size (576)
with a minimum IP header (20) and a minimum TCP header (20) and
thereby be able to stuff 536 octets of data into each TCP segment.
The definition of the MSS option can be stated:
The maximum number of data octets that may be received by the
sender of this TCP option in TCP segments with no TCP header
options transmitted in IP datagrams with no IP header options.
14. The Consequences
When TCP is used in a situation when either the IP or TCP headers are
not minimum and yet the maximum IP datagram that can be received
remains 576 octets then the TCP Maximum Segment Size option must be
used to reduce the limit on data octets allowed in a TCP segment.
For example, if the IP Security option (11 octets) were in use and
the IP maximum datagram size remained at 576 octets, then the TCP
should send the MSS with a value of 525 (536-11).
 Postel, J., ed., "Transmission Control Protocol - DARPA Internet
Program Protocol Specification", RFC 793, USC/Information
Sciences Institute, September 1981.
 Postel, J., ed., "Internet Protocol - DARPA Internet Program
Protocol Specification", RFC 791, USC/Information Sciences
Institute, September 1981.
 Postel, J., "Internet Control Message Protocol - DARPA Internet
Program Protocol Specification", RFC 792, USC/Information
Sciences Institute, September 1981.
 Plummer, D., "An Ethernet Address Resolution Protocol or
Converting Network Protocol Addresses to 48-bit Ethernet
Addresses for Transmission on Ethernet Hardware", RFC 826,
MIT/LCS, November 1982.
 Sollins, K., "The TFTP Protocol (Revision 2)", RFC 783, MIT/LCS,