Glossary L to R

Log archive Log files Log formats MIME Ping Port Proxy Queries Rank Referral RFC Robot

Log File Archive

Archiving Access Activity

Many web servers keep their raw log files only for a limited time period. This information may be crucial in determining what has happened at a particular time. If the web server crashes or the log files get lost then you need the security of an off-site log backup. Hosting companies may well back up the web site files but not the logs, the logs can grow to a very large size and therefore backup is an expensive option for them to provide. Some servers will keep only the previous month or week on the site before it is replaced by the next one.

All web servers will store information about the accesses made to the site in a log file archive. This enables a web master to investigate problems. Standard web hosting companies generate their 'free' access statistics by processing these log files to work what and when is being accessed on the site.

Log Files

Web Server Log Files

A web site host runs a special service that manages the HTTP protocol. The HTTP service is the standard way that HTML pages are transferred to browsers over the Internet. Each HTTP request received by the server is typically a request for the contents of a particular web page or graphics file. The server logs all these requests with each new request or 'hit' as a separate record as a line. The server log file is the source of information used to generate web site statistics offered by most web hosting companies. It gives information about the date and time, source IP address, data requested, the referring web page and browser. The referral data give vital information about the links which people are using to reach a web site. The data requested is the full URL used to reach the site, quite often this will include the keywords used by the search engine to list the web site.

Here is a sample line from a log file :


69.125.123.175 - - [29/Jan/2006:01:32:03 -0600] "GET /africa.htm HTTP/1.1" 200 5456 " http://search.yahoo.com/search?_adv_prop=web&x=op&ei=UTF-8&prev_vm=p&va=scams&va_vt=any& vp=south+africa&vp_vt=any &vo=sigcau&vo_vt=any" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

The meaning of each of these fields is as follows :

69.125.123.175	The IP address of the person or robot requesting the data
-	The User name given to access the resource (in this case none was given)
-	The Password given to access the resource (in this case none was given)
[29/Jan/2006:01:32:03 -0600]	The date and time that the request occurred, this includes the time zone difference (6 hours in this case)
GET	The HTTP access method for the resource, GET means read the whole resource
/africa.htm	Identifies the resource to be fetched, in this case it is an HTML page called africa.htm located in the root folder
HTTP/1.1	The access method used to fetch the resource, in this case it is version 1.1 of HTTP
200	HTTP status code returned, see errors for details of these
5456	The size of the data returned in bytes
http://search.yahoo.com/...	Which page which linked to the resource. This is the referrer field, and in this case is from the Yahoo! search engine ➚.
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)	Identifies the browser or robot that accessed the page. In this case it is Internet Explorer® running on Windows® XP.

Log File Formats

Logging accesses to Web sites

All web servers will store information about the accesses made to the site in a log file. This enables a web master to investigate problems and for statistics to be generated showing what and when is being accessed on the site.

There are two main formats in use. Different servers (Microsoft® IIS, Apache®, ...) use different log formats. However, they are all text based and contain information about each access to a resource (HTML page or graphics file) is recorded as a single line of information (a hit). A web site administrator can usually control how much information goes into the log file, as the full set of information is rather large. Each log file record may contain the following fields of information :

Date and Time	Date and time of the record, possibly including time zone information
Client IP	The IP address of the requesting client agent (may not be the same as used by previous accesses)
User	The authenticated name used to access the site (typical access is anonymous)
Server Site	The web site name of the web server
Host Site	The domain name of the site being hosted
Computer	The name of the computer running the web server (for large sites multiple computers may service the same web address)
Server IP	The IP address of the web server
Method	The type of request - typically a GET to fetch information for display
Url	The URL address of the information being requested from the web site
Query	Any query string associated with the request (typically follows a ? in the URL)
Status	Server's response code that is sent back to client (200 for success, 404 not found etc.)
Req Size	Size of the request issued by the client in bytes
Resp Size	Size of the response (typically a file) sent to the client as a response to the request
Resp Time	How long the server took to process the request
Port	The machine port address used to access the server, typically 80 for HTTP
Protocol	The protocol used by the client typically HTTP1.0 or HTTP1.1
Browser	A long string describing the type of browser the client used to issue request, often includes platform information
Cookie	Any associated cookie value submitted by client
Referrer	Where the client's request came from, an external site or an internal page reference.

Some servers will archive log files in compressed format (e.g. ZIP, GZ or CAB).

Log file format specification ➚
A log file fragment explained ➚

MIME

Specifying information format

The HTTP protocol transfers information around in binary format, it is up to the client and server to negotiate so that the client (typically a browser) is only sent information that it can understand. This negotiation is carried out using MIME (Multipurpose Internet Mail Extensions) types.

As the acronym suggests this was originally developed to describe the content of email messages but is now much widely used within HTTP. It uses a simple two-part text description to describe the content format consisting of a type and a subtype. So text/html indicates that it is basically text but the text is in HTML format, text/plain is for raw untagged text (as is a .txt file) and image/jpg indicates a graphics file in JPEG image format.

When a browser requests data it states the MIME types it is willing to receive as a response, the server will then choose an available format for the response from these types.

The MIME Information Page ➚
Media Types ➚
RFC2046 : Multipurpose Internet Mail Extensions 2 : Media Types ➚

Ping

Using Ping to check a Server is working

One important facet of site monitoring is knowing as soon as possible that servers have failed or are not accessible. A web server provides multiple services not just HTTP. Just because a server is not responding to HTTP requests does not imply it is not functioning at all.

The simplest means of establishing whether a site or server is alive is using the Interface Control Message Protocol (ICMP) protocol to Ping a server. This is a much simpler request than fetching an HTML page in terms of the communication overheads. It runs over the IP protocol and so checks that the IP part of TCP/IP is functioning OK. This protocol is also used by the tracert command line utility to find the route that communication is taking to a server.

The Ping connectivity check works well in an Office Intranet situation too, it can regularly monitor whether the key servers and workstations are responding properly to IP traffic within an office local area network.

The protocol supports a number of commands but the ECHO command is the one of interest for Ping monitoring. It instructs routers to pass the message over IP to a particular destination IP address requesting an ECHO REPLY to be sent back. Measuring the time between from issuing the ECHO and receiving the ECHO REPLY determines the responsiveness of the remote server. The ICMP echo reply includes a Time to Live (TTL) value. This indicates the number of router hops that the message has gone through from the source. Normally the packet starts off with a 255 TTL value and then each router it passes through decrements the value by one. If the number of hops is erratic or suddenly becomes large this indicates a router problem.

The same ECHO command can be used to trace a route over the Internet (as used in tracert program ➚). In this case the protocol's TTL field is used to limit how many hops between routers it can make before the request fails. If the limit is reached then a failure response is returned, with the IP address of the most distant router on the path returned. By iterating over all TTL values until the destination server is reached all the routers can be identified. By inspecting the time delay between reaching routers along the communication path bottlenecks can be easily identified.

Port

Connecting to the correct Service Port

Each IP address can be accessed on a range of numeric port numbers. The port number requested is part of a client connect request and can be specified as part of a URL. When a URL omits the port number the default port number for that service is assumed (80 for HTTP web service). These map onto inter-communicating sockets, when a server socket is set up it chooses a unique port number on which to listen for requests (as part of the bind socket API call), the client issues a connect to a server giving an IP address and a port.

In most cases the port number is assigned to a particular service, so the number is really acting as a name for the service that is required. The only ports of interest on the Internet to users are the ones used for HTTP and FTP.

A more comprehensive list of standard ports is as follows :

Port	Name	Description
7	echo	simple echo service
13	daytime	find out server's clock setting
21	ftp	file transfer protocol
22	sftp	secure file transfer protocol
23	telnet	terminal access service
25	smtp	email service
42	nameserver	DNS lookup
43	whois	who is service
53	DNS	domain name lookup service
70	gopher	predecessor to http
80	WWW	world wide web (HTTP)

Proxy

Indirect proxy access to the Internet

Originally each computer wishing to use the Internet had to connect to it directly, this is OK for servers or a home user dialing up for a connection but not convenient for an office environment where hundreds of PCs may want to use the Internet all at the same time. To solve this problem Proxy servers are used. These servers have a dedicated Internet connection but make requests on behalf (as a proxy) for all the computers wishing to access the Internet through it.

Most browsers have connection settings that allow you to configure the IP address and port that is then used to communicate with a Proxy Server. HTTP requests are then sent to the Proxy Server using TCP/IP which then in turn sends them out onto the Internet. A proxy server may run on a separate machine (often in conjunction with a firewall) or as an ordinary program running on a PC. It needs to keep track of all client requests so it can route all the responses sent to it from the server back to the browser that requested the information.

Query

Getting information from a user

Queries are an important part of HTML. They are used to pass additional information to a server about the data requested. The most widespread usage is when an HTML Form is submitted (as an HTTP POST) and the various values entered on the form are sent as query strings tagged onto the end of the URL. Search engines such as Google ➚ use this mechanism to send the search phrase or keywords that the user typed in when the Search button is clicked upon. Each web server is free to use whichever keywords it likes in the query string there are few constraints it has to follow.

Rank

Web Site Ranking

There are a number of Internet services that attempt to rank web sites in some sort of popularity order. As there are about 500 million web sites this is not an easy task and relative ranking scores are not to be totally trusted.

For example on this particular day the top five web sites according to Alexa ➚ are Yahoo!, MSN, Google, Passport, EBay, Microsoft.

It is not possible to look at individual web site traffic statistics in order to gauge popularity. The server logs are not publicly accessible. If you use the Google or Alexa Toolbars to reach web sites this is one way that these sites can build up statistics in order to rank sites. Each time the toolbar is used, the click is recorded and added to their database. This makes all ranking measures rather inaccurate, it is best to treat them as a very rough indication of relative popularity only.

Google ➚ also includes a Page Rank figure as a rough estimate (score out of ten) of the importance of the page. Most web sites manage to score between 3 and 6 out of ten. Only very large and very popular web sites score 8 or more ('Currently' CNN ➚ scores 8/10 and BBC News ➚ scores 9/10). View with suspicion any page with a rank below 3.

Referrals

Looking at how visitors find your site

Do you know how people are reaching your web site?

When a web site has a set of links to it, you get a referral when the user follows the link. This is usually by a person clicking on a link in a browser or else by an automated scanning engine or robot.

It is important to monitor the number of referrals coming to a site to be able to adjust the site content and therefore the keywords to attract more visitors. It may be that you are getting people to come to the site for the wrong reason and are not going to stay or come back again.

When a user follows an HTML link to get from one web page to another, the web server typically store how a page (locally or remotely was referred to). Tracking how people get referred to a web site is crucial to measuring the effectiveness of a web site. Are people getting to the information quickly and easily? Are they only looking at one page and then leaving the web site? Which keywords and phrases are people using to reach your site?

Some firewalls give the option to remove the referral information in order to give the user more anonymity, if a web site needs to be sure it knows where it has been referenced from it needs to include this as part of the query part of a URL.

Referral monitoring allows the sources of external web traffic to be identified. Typically Google will be the largest source of referrals as most people find web sites from a search engine, and Google is by the far the most popular one at present.

By using referrer information you can work out :

When another site has added a link to one of your pages.
When a site gets included in search engine databases.
Which Keywords are being used to find your site on search engines.
When your site is mentioned in an online forum or blog.

RFC

Internet Standards and Proposals

The technical standards that govern how the various parts of the Internet function are documented as Request For Comment (RFC) documents. Although the name suggests that these are just early proposals, these documents include actual working standards for much of Internet technology. Some RFCs are experimental and some have been entirely superseded so it is important to refer to the appropriate RFC.

They are co-ordinated by committees of Internet professionals. There are over three thousand RFCs in existence covering all aspects of the Internet. A good starting point is Internet Engineering Task Force (IETF web site www.ietf.org ➚ . There are copies of the RFCs on different sites, including IETF and World Wide Web Consortium [W3C] ➚.

Here is a list of commonly referenced RFCs, but be warned that their technical nature can make them tough reading :

RFC777 : Internet Control Message Protocol ➚
RFC791 : Internet Protocol Specification ➚
RFC792 : Internet Control Message Protocol ➚
RFC959 : File Transfer Protocol [FTP] ➚
RFC1034 : Domain Names - Concepts and Facilities ➚
RFC1035 : Domain Names - Implementation and Specification ➚
RFC1180 : A TCP/IP Tutorial ➚
RFC2046 : Multipurpose Internet Mail Extensions 2 : Media Types ➚
RFC2133 : Basic Socket Interface Extensions for IPv6 ➚
RFC2616 : Hypertext Transfer Protocol HTTP/1.1 ➚
RFC2660 : The Secure HyperText Transfer Protocol ➚

Robot

Automated Internet Scan

On the Internet a robot is not some mechanical human-like servant but just a special type of computer program. Ever wondered how search engines build up their indices of web sites? Well, search engines amongst other programs use robots to continually trawl web sites analyzing the contents as if they were human visitors using a browser. They use HTTP just like browsers in order to access information. A server can state which pages should be inspected by robots, the instructions are stored in the robots.txt file. Over time, search engines have grown much more sophisticated and the way that they scan sites is complex. Many will first scan the site's index page, coming back after weeks or months to drill down to scan the rest of the web site.

The server log usually includes a browser field in each log record indicating the name of the robot. No well-behaved robot will flood a server with requests as this would affect the server's performance. They will spread their site scan over hours or days. Robots should include a contact URL or email address in the browser information included in the HTTP request header so that a web master can analyze the activity by robots.