ch16lev1sec2.html

16.2. How It all Works

16.2.1. Internet Communication between Client and Server

The HTTP Server

We discuss the client/server model and the TCP/IP protocols for regulating network operations in Chapter 20 , "Send It Over the Net and Sock It to 'Em!" On the Internet, communication is also handled by a TCP/IP connection. The Web is based on this model. The server side responds to client (browser) requests and provides feedback by sending back a document, by executing a CGI program, or by issuing an error message. The network protocol that is used by the Web so that the server and client know how to talk to each other is the Hypertext Transport Protocol, or HTTP. This does not preclude the TCP/IP protocol's being implemented. HTTP objects are mapped onto the transport data units, a process that is beyond the scope of this discussion; it is a simple, straightforward process that is unnoticed by the typical Web user. (See www.cis.ohio-state.edu/cgi-bin/rfc/rfc2068.html for a technical description of HTTP.) The HTTP protocol was built for the Web to handle hypermedia information; it is object oriented and stateless. In object-oriented terminology, the documents and files are called objects, and the operations that are associated with the HTTP protocol are called methods. When a protocol is stateless, neither the client nor the server stores information about each other but manages its own state information.

Once a TCP/IP connection is established between the Web server and client, the client will request some service from the server. Web servers are normally located at well-known TCP port 80. The client tells the server what type of data it can handle, by sending Accept statements with its requests. For example, one client may accept only HTML text, whereas another client might accept sounds and images as well as text. The server will try to handle the request (requests and responses are in ASCII text) and send back whatever information it can to the client (browser).

Example 16.1.

(Client's (Browser) Request) GET /pub HTTP/1.1 Connection: Keep-Alive User-Agent: Mozilla/4.0 Gold Host: severname.com Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg,*/*

Example 16.2.

(Server's Response) HTTP/1.1 200 OK Server: Apache/1.2b8 Date: Mon, 22 Jan 2007 13:43:22 GMT Last-modified: Mon, 01 Dec 2007 12:15:33 Content-length: 288 Accept-Ranges: bytes Connection: close Content-type: text/html <HTML><HEAD><TITLE>Hello World!</TITLE> ---continue with body--- </HTML> Connection closed by foreign host.

The response confirms what HTTP version was used, the status code describing the results of the server's attempt (did it succeed or fail?), a header, and data. The header part of the message indicates whether the request is okay, what type of data is being returned (for example, the content type may be html/text), and how many bytes are being sent. The data part contains the actual text being sent.

The user then sees a formatted page on the screen, which may contain highlighted hyperlinks to some other page. Regardless of whether the user clicks on a hyperlink, once the document is displayed, that transaction is completed, and the TCP/IP connection will be closed. Once closed, a new connection will be started if there is another request. What happened in the last transaction is of no interest to either client or server; in other words, the protocol is stateless.

HTTP is also used to communicate between browsers, proxies, and gateways to other Internet systems supported by FTP, Gopher, WAIS, and NNTP protocols.

HTTP Status Codes and the Access Log File

When the server responds to the client, it sends information that includes the way it handled the request. Most Web browsers handle these codes silently if they fall in the range between 100 and 300. The codes within the 100 range are informational, indicating that the server's request is being processed. The most common status code is 200, indicating success, which means the information requested was accepted and fulfilled.

Check your server's access log to see what status codes were sent by your server after a transaction was completed.^[1] The following example consists of excerpts taken from the Apache server's access log, called access.log. This log reports information about a request handled by the server and the status code generated as a result of the request.

^[1] For more detailed information on status codes, see www.w3.org/Protocols/HTTP/HTRESP.html.

Table 16.1. HTTP Status Codes
Status CodeMessage
100 Continue
200 Success, OK
204 No Content
301 Document Moved
304 Document Not Modified, No Message Body
400 Bad Request
401 Unauthorized
403 Forbidden
404 Not Found
405 Method Not Allowed
500 Internal Server Error
501 Not Implemented
503 Service Unavailable

Table 16.1. HTTP Status Codes
Status	CodeMessage
100	Continue
200	Success, OK
204	No Content
301	Document Moved
304	Document Not Modified, No Message Body
400	Bad Request
401	Unauthorized
403	Forbidden
404	Not Found
405	Method Not Allowed
500	Internal Server Error
501	Not Implemented
503	Service Unavailable

See http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html.

Example 16.3.

(From Apache's Access log) 1 127.0.0.1 - - [22/May/2007:20:50:42 -0700] "GET /cgi-bin/firstc gi.pl HTTP/1.1" 200 235 2 127.0.0.1 - - [22/May/2007:20:50:43 -0700] "GET /Williewonker.j pg HTTP/1.1" 304 - 3 127.0.0.1 - - [22/May/2007:20:50:52 -0700] "GET /cgi-bin/env.pl x HTTP/1.1" 500 623

Explanation

The server hostname is 127.0.0.1, followed by two dashes indicating unknown values, such as user ID and password. The time the request was logged, the type of request is GET (see "The GET Method" on page 541), and the file accessed was firstcgi.pl. The protocol is HTTP/1.1. The status code sent by the server was 200, indicating success! The request was fullfilled.
The status code 304 indicates that the request was for a document with no message body that has not been modified. The document, in this case, is a jpeg image file.
The status code 500 indicates an Internal Server Error, meaning that there was some internal error, such as a syntax error in the Perl program or an incorrect #! line that contains the full path to you. The browser's request was not fulfilled. The number of bytes sent was 623.

The URL (Uniform Resource Locator)

URLs are what you use to get around on the Web. You click on a link and you are transported to some new page, or you type a URL in the browser's Location box and a file opens up or a script runs. It is a virtual address that specifies the location of pages, objects, scripts, etc. It refers to an existing protocol, such as HTTP, Gopher, FTP, mailto, file, Telnet, or news (see Table 16.2 ). A typical URL for the popular Web HTTP protocol looks like this:

http://www.comp.com/dir/text.html

Table 16.2. Web Protocols
Protocol Function Example
http: HyperText Transfer Protocol http://www.nnic.noaa.gov/cgi-bin/netcast.cgi open Web page or start CGI script
ftp: File Transfer Protocol ftp://jague.gsfc.nasa.gov/pub
mailto: Mail protocol by e-mail address mailto:debbiej@aol.com
file: Open a local file file://opt/apache/htdocs/file.html
telnet: Open a Telnet session telnet://nickym@netcom.com
news: Opens a news session by news server news:alt.fan.john-lennon Name or Address

Table 16.2. Web Protocols
Protocol	Function	Example
http:	HyperText Transfer Protocol	http://www.nnic.noaa.gov/cgi-bin/netcast.cgi open Web page or start CGI script
ftp:	File Transfer Protocol	ftp://jague.gsfc.nasa.gov/pub
mailto:	Mail protocol by e-mail address	mailto:debbiej@aol.com
file:	Open a local file	file://opt/apache/htdocs/file.html
telnet:	Open a Telnet session	telnet://nickym@netcom.com
news:	Opens a news session by news server	news:alt.fan.john-lennon Name or Address

The two basic pieces of information provided in the URL are the protocol http and the data needed by the protocol, www.comp.com/dir/files/text.html. The parts of the URL are further defined in Table 16.3.

Table 16.3. Parts of a URL
Part Description
protocol Service such as HTTP, Gopher, FTP, Telnet, news, etc.
host/IP number DNS host name or its IP number
port TCP port number used by server, normally port 80
path Path and filename reference for the object on a server
parameters Specific parameters used by the object on a server
query The query string for a CGI script
fragment Reference to subset of the object

Table 16.3. Parts of a URL
Part	Description
protocol	Service such as HTTP, Gopher, FTP, Telnet, news, etc.
host/IP number	DNS host name or its IP number
port	TCP port number used by server, normally port 80
path	Path and filename reference for the object on a server
parameters	Specific parameters used by the object on a server
query	The query string for a CGI script
fragment	Reference to subset of the object

The default HTTP network port is 80; if an HTTP server resides on a different network port, say 12345 on www.comp.com, the URL becomes

http://www.comp.com.12345/dir/text.html

Not all parts of a URL are necessary. If you are searching for a document in the Locator box in the Netscape browser, the URL may not need the port number, parameters, query, or fragment parts. If the URL is part of a hotlink in the HTML document, it may contain a relative path to the next document, that is, relative to the root directory of the server. If the user has filled in a form, the URL line may contain information appended to a question mark in the URL line. The appearance of the URL really depends on what protocol you are using and what operation you are trying to accomplish.

Example 16.4.

1 http://www.cis.ohio-state.edu/htbin/rfc2068.html 2 http://127.0.0.1/Sample.html 3 ftp://ptgp023@ptgpftp.pearsoned.com/quigley 4 file:///c:/wamp/www/family.jpg 5 http://localhost/cgi-bin/form.cgi?name=Fred+Thompson

Explanation

The protocol is http.
The hostname www.cis.ohio-state.edu/htbin/rfc2068.html consists of ^[a]
^[a] Most Web servers run on hostnames starting with www, but this is only a convention.
The hostname translated to an IP address by the Domain Name Service, DNS.
The domain name is ohio-state.edu.
The top-level domain name is edu.
The directory where the HTML file is stored is htbin.
The file to be retrieved is rfc20868.html, an HTML document.
The protocol is http.
The IP address is used instead of the hostname; this is the IP address for a local host.
The file is in the server's document root. The file consists of HTML text.
The protocol is ftp.
The ftp server is ptgpftp.pearsoned.
The top-level domain is com.
The directory is quigley.
The protocol is file. A local file will be opened.
The hostname is missing. It then refers to the local host.
The full path to the file index.html is listed.
The information after the question mark is the query part of the URL, which may have resulted from submitting input into a form. The query string is URL encoded. In this example, a plus sign has replaced the space between hello and there. The server stores this query in an environment variable called QUERY_STRING. It will be passed on to a CGI program called from the HTML document. (See "The GET Method" on page 541.)

^[a] Most Web servers run on hostnames starting with www, but this is only a convention.

^[a] Most Web servers run on hostnames starting with www, but this is only a convention.

File URLs and the Server's Root Directory

If the protocol used in the URL is file, the server assumes that file is on the local machine. A full pathname followed by a filename is included in the URL. When the protocol is followed by a server name, all pathnames are relative to the document root of the server. The server root is the directory defined in the server's main directory where the configuration, error, and log files are kept. The document root is the directory where you store HTML documents, images, and any other documents that will be served up by the server; e.g., a file called "htdocs" or "www".

The leading slash that precedes the path is not really part of the path as with a UNIX absolute path, which starts at the root directory. Rather, the leading slash is used to separate the path from the hostname. An example of a URL leading to documents in the server's root directory:

     http://localhost/index.html

The full pathname for this might be

     C:/wamp/www/index.html

A shorthand method for linking to a document on the same server is called a partial, or relative, URL. For example, if a document at http://www.myserver/stories/webjoke.html contains a link to images/webjoke.gif, this is a relative URL. The browser will expand the relative URL to its absolute URL, http://www.myserver/stories/images/webjoke.gif, and make a request for that document if asked.