Site Monitoring

Glossary F to K

Firewalls

Protecting the Server

Connecting a computer to the Internet makes it vulnerable to attack from anyone else on the Internet or a virus program. By default some computers will respond to connect requests on a range of ports and the service permitting this connection may install software permitting backdoor access to resources on the computer.

Actually, firewalls typically provide protection in both directions, they stop external access as well as sensitive information going out onto the Internet. One facility they often provide is a block on referral information from being sent to a server on HTTP requests as this indicates how a client reached a particular server page.

The firewall can also protect against denial of service attacks affecting servers within the firewall, it can spot repeated access attempts from the same source and block them out before the request is passed on to the Server. This is the main reason why some servers do not provide the Ping service.

You may need to configure a firewall to let Site Vigil access the Internet to perform its multifarious monitoring functions.

Web site monitoring

FTP

Copying files to a Web Server

The File Transfer Protocol (FTP) provides an alternative transport mechanism for data to HTTP. It is typically used to move files between computers over the Internet and especially copy files to a web server so they can be viewed in a browser. It is designed to cope with large files and allow the remote file system to be browsed. The key concept is of a current local and a current remote folder between which to transfer files. It is still the most common way of uploading files to a web server.

The principal problem with FTP is that it treats ASCII text and binary files differently, if a binary file is transferred in text mode it will become garbled as the protocol will not deal with newline translation and non 7bit ASCII characters contained in binary files. Binary mode needs to be used for graphics files as well as executables.

FTP was designed for usage as a command line utility program (in Windows® the ftp program provides access to the underlying FTP protocol commands). By default it uses port 20 for the data and port 21 for the command connection.

Some Servers support restarting of failed FTP transfer requests. Download accelerator programs exploit this facility to speed up download of large files by setting off requests for multiple fragments of a file at the same time.

For a more secure alternative you may be able to use Secure FTP.

Passive mode FTP is a way of giving FTP a better chance of working through firewalls. It also conserves socket resources. Normally, with the original 'active mode' FTP the server attempts to connect back to an arbitrary port on the client computer. Some firewalls will attempt to block this connection because it looks like an illegal access (possibly malicious) to the computer.

For a fuller explanation of passive mode see Active and Passive FTP mode
For the full protocol specification visit : RFC959 : The File Transfer Protocol

Site Vigil supports both the File Transfer Protocol and Secure version to allow access to web server log files.

Web site monitoring

Google

The most important search engine at present is Google® . Starting from nowhere they now take the lion's share of web searches (about 70-80%) so that if you rely on people getting to your site through a search engine it is Google that should take most of your effort.

The key to getting a high position in the results list is to make sure keywords are used sensibly and appropriately. However it is rich content on the page that is the ultimate way to achieve a good placement.

For much more on Google and other search engines please refer to our companion search position reference pages.

Site Vigil Our Site Vigil product includes poineering search engine monitoring. You can easily track how your site is performing on all search engines, including, of course, Google.

Web site monitoring

HTML

Displaying WWW Pages

The Hyper Text Markup Language (HTML) is the core language of the Internet. Although it was originally really intended as a text based description for documents by Tim Berners-Lee at CERN .

The chief reason that HTML took off was the provision of hypertext links so that very large numbers of pages can be inter-linked in a very flexible manner. Secondly, HTML supported the display of graphics, but it took some development of technology before the display monitor technology and communication speed made graphics a practical option.

With the high investment in billions of HTML web pages it is unlikely that any new technology will take over in the next few years. XML offers a few benefits and is not all that different. It is most likely that users will move to use more sophisticated web design programs that will mean that they rarely get to see, let alone edit, the HTML that the tool generates for the web pages.

Even though Hyper Text Mark-up Language (HTML) has been around for thirty years it is still the most common way of defining how to present data to a user. Originally it was intended that the mark-up tags (things like Bold, Underline, Headings) would allow a browser to interpret them however they thought fit. The idea being that the user decides how to display the information not the data provider. But immense standardization pressure from web designers forced all browsers to display HTML in the same way. A web page designer can now be confident that an HTML page will look more or less the same on a range of different browsers.

HTML is a text file format and has its own MIME type, it does not define how information is to be transferred. The HTTP protocol is the most common way of specifying how to access to HTML but other protocols could be used too. For example, if web pages are held locally on your computer you can use the file protocol to access them directly.

For a good online reference to HTML refer to HTML Code Tutorial

Site Vigil supports automatic checks for HTML correctness of any page on the web. It can even scan a whole web site looking for problems with the HTML syntax. It'll spot broken links too.

Web site monitoring

HTTP

Getting web information

The HTTP protocol is the ordinary way that web servers provide web page information for display in a browser. The Hyper Text Transfer Protocol (not to be confused with HTML which defines the contents of a document) is a mechanism for transferring data around.

Each HTTP request has a simple format as it is typically requesting a file (HTML, GIF, JPG) to be sent over the Internet back to the program requesting it. A request has a standard text header that states the information that is requested. For each request the server may send back one or more responses. The response has a multiple line text header that is terminated with a blank line, the data immediately follows this header. A TCP/IP socket connection is used for the communication with a separate request for each separate request, so that to display a page containing references to graphics files (with HTML IMG tags) there will be multiple HTTP request-response pairs needed to display the complete page.

HTTP is a high-level communication protocol, it does not concern itself about such things as data packets and retries, these are handled by the TCP/IP communication layers. The standard has two main versions, HTTP/1.0 version is now largely superseded by the significant enhancements added to HTTP/1.1.

HTTP defines the various status codes that can result from making a request to a server (for example the 404 - 'page not found' error code).

A simple browser request typically has the entries :

GET http://www.spot.com/index.html HTTP/1.1
Host: www.spot.com
User-Agent: Mozilla/6.0 (windows; U; XP+; en-us)
Accept: text/html

In the first list the action GET tells the server that it is fetching information, this is followed by the URL defining what is wanted and then by HTTP/1.1 defines the version of HTTP used by the client requesting the information. Other actions include HEAD to just return information about the resource but not the actual data and POST to send it data entered in a form.

The second line tells the browser which host to get the information from (this is an HTTP 1.1 feature to allow shared hosting)

Next it tells the server information about the software requesting the data, this often includes the browser type , the Operating System platform and the language locale. The last line tells the server the MIME file types that the browser is willing to accept as a response. This may be a list of types in preference order so that a user gets the best possible match to what format they would like. In this example the client is only willing to accept HTML in text format.

The HTTP response header coming back from the server when it processes the request is along the lines of :

HTTP/1.1 200 OK
Date: Wed, 10 Jul 2002 14:14:25 GMT
Server: Apache/1.3.26 (Unix) mod_throttle/3.1.2 PHP/4.1.2 mod_ssl/2.8.10 OpenSSL/0.9.6b
Last-Modified: Sat, 23 Feb 2002 23:14:11 GMT
ETag: "80314-16d4-3c782243"
Accept-Ranges: bytes
Content-Length: 5844
Content-Type: text/html
Age: 1974
Connection: close

The first line tells the client browser the version of HTTP in use and then the status code. 200 is the normal success HTTP code, see our Status Code page for details of these codes. This is followed by information about the date/time request received and details about the Server - including the software packages it is running. The timestamp for the information is then sent, a browser can then use the date and time to check if its information is still up-to-date. The Age entry is used to implement caching, if it is present then this indicates that the information is not fresh from the original source but has come from a cache somewhere along the route. A web page can modify the values of some of the response header values using the HTTP-EQUIV <META> tag.

See also RFC2616 : Hypertext Transfer Protocol HTTP/1.1
HTTP Request Headers
How HTTP works

Site Vigil uses the HTTP protocol to monitor access to pages on web sites. There is a wide range of monitoring options available.

Web site monitoring

HTTPS

Secure access to information

HTTPS is the secure version of HTTP. As far as the requesting program is concerned the protocol works the same as HTTP except that the information is passed over the Internet in a secure manner.

All communication goes through a Secure Socket Layer (SSL) rather than directly as plain unecrypted text. The encryption uses the public / private encryption key pair (PKI : Public Key Infrastructure) which makes the messages hard (but not impossible) to decipher. Different countries may impose limits on how long these encryption keys are allowed to be.

The security works both ways - information passed to the server is encrypted and therefore hard to eavesdrop or decode, and information sent back from the server is equally secure. This allows for credit card information to be typed in and sent with high confidence that it won't fall into the wrong hands. When the protocol is used the URL just uses https: rather than http: as the protocol name. Most browsers will indicate the secure status by a padlock in the status bar.

For a web server to support HTTPS it must possess an SSL certificate. This is unique to a web server and authenticates that the computer being communicated with is the intended server. It has to be renewed on an annual basis through a trusted agency such as Thawte . There is all a mechanism to revoke certificates, so all in all it's a fairly secure system.

See also RFC2660 : The Secure HyperText Transfer Protocol
SSL and Certificates

Site Vigil can check that web pages are accessible using HTTPS as well as the more normal HTTP.

Web site monitoring

Hyperlinks

If there is one feature that made the Internet take off then that must be the simple way that you can hypertext link to other pages and sites using HTML. A good web site will have attracted a large number of links to it over the time it has been online. If the web site making the link is prestigious (according to site ranking) then that adds to the significance of a web site and how high up the search engine result lists it resides.

Many excellent web sites lie unvisited because no one has chosen to add links to it. The Internet needs web sites to include a range of links to related or interesting sites as that is how good sites can get more and more noticed.

In our companion guide to search engine position, we describe how search engines currently rank web pages.

Site Vigil will check all the internal and/or external links of a web site, automatically on a regular basis so you can be sure the links are all pointing somewhere sensible.

Web site monitoring

IP Address

Your unique access IP Address

The familiar way to identify a computer with which you want to communicate is a domain name which is passed to DNS to obtain the four part IP address. Each part is represented as a number in the range 0 to 255. The communication system will then typically use sockets to communicate with the server attached to this address. In broad terms every system connected to the Internet needs a unique IP address. Each client and server connection to the Internet must have a unique IP address to define it's endpoints at any one time. When IP addresses were invented it was thought that the system had plenty of unique addresses (2 ^ 8 ^ 4 or 4,294,967,296 addresses) but shortage of addresses is now a major headache.

The Internet currently uses IP version 4, there are now moves to adopt version 6 (IPv6 ) which has (2 ^ 128) addresses or far more than will ever be needed (trillions for every person).

One way to avoid allocating IP addresses that are not in use at a particular time is to use dynamic IP addresses. Here the IP address used by a client may change from one HTTP request to another. An ISP allocates them from its set of unused addresses. Dial-up connections typically use this scheme as there may be many thousands of potential users but only a small number of active connection points onto the Internet that require unique addresses.

Using Proxy servers also helps reduce the number of required addresses as a single server acts on behalf of a number of clients so that they don't each need their own address. The Proxy looks after getting the response back to the computer that requested the data.

Under HTTP1.0 each server needed a unique address for each domain it was hosting, this was inconvenient for servers hosting several sites on the same server with a limited number of IP addresses. So with HTTP1.1 several domain names can now share the same IP address, the protocol provides sufficient data in the HTTP Host field to distinguish which actual domain it wants to access.

IP addresses are used for local area networks as well as for the Internet and an office will typically set up a range of IP addresses which are not accessible from the Internet. Special IP address ranges have been reserved for this purpose, of these IP addresses beginning with 192.168.1 for class C or 10.10 for class B are most widely used. In the local area situation IP subnet masks are used to divide a large LAN into sub-networks.

A very popular web site may not be able to handle all the requests on a single server computer, to support multiple servers incoming requests are farmed out to a set of servers. However incoming requests will be mapped by DNS onto a single IP address so this can become a bottleneck.

Different authorities are responsible for allocating IP addresses under the central control of IANA (Internet Assigned Numbers Authority). In America this is ARIN (American Registry for Internet Numbers), whilst in Europe this is RIPE (Réseaux IP Européens).

You can use their services to find out about a particular IP address - perhaps one that is making a virus or denial of service attack.

See also : RFC791 : Internet Protocol Specification
RFC2133: Basic Socket Interface Extensions for IPv6
IP Address Version 6

You can get Site Vigil to inform you when the IP address for any domain name has changed. This is particularly useful when you want to track the transfer of a domain from one web hosting company to another.

Web site monitoring

ISP

An Internet Service Provider (ISP) provides communication between computers over the Internet. Traditionally an ISP provided banks of modems that connected customer's phone lines to the Internet. As broadband becomes more common ISPs now offer a range of different types of high speed connections. To make best use of the available IP addresses an ISP may switch addresses used during a session or for higher speed connections nominate a fixed address.

Many ISPs also provide web hosting facilities too as they have usually a good quality connexion to the Internet.

Web site monitoring

Keywords

Web page keywords are the words and phrases that epitomise the product or service that a web site is offering. Originally the term keyword had a narrowing meaning, referring to a particular HTML META tag in the page header. Now that search engines use sophisticated analysis techniques to determine what each web site is all about, it is both more important and more difficult to get the appropriate keywords registered for each separate web page.

Each page needs to be carefully constructed to place keywords in all the appropriate places, in the title, description, keywords, headers and text. This is all explained in our companion guide on Search Engine algorithms.

You can get Site Vigil to analyse and display keywords for a web site you have chosen to monitor. This lets you work out how best to increase traffic to your web site.