Search Engine Site Scan
To keep an up-to-date directory of web sites by keyword the search engines need to build up a huge (and I mean huge) database of information. The web sites are scanned at regular intervals and usually scanned in two phases. The first scan just looks at the index page and checks its validity. At some time later the second phase does a spider scan of all the pages it can find on the web site by following links from the index page. The regularity of further visits will be carefully tuned to the frequency of changes of the pages (no point rescanning static data again and again) and the importance of the web site.
If there is no other site on the Internet that links to your site it is necessary to submit the site to search engines. There are tools to do this, but this is best done manually - at least for the most important ones. This is because the system gets abused with automated tools continually resubmitting the same sites. Using these tools may even get you banned from listing on the engine results. To check whether you are listed it's usually possible to put in the site domain name as the keyphrase and see if any results are returned. Once people start adding links to your site there is really no need to submit or re-submit, it might even cause an engine to refuse to list the site.
The search engine robot will download each page (but not the graphics images) and analyze them. The analysis algorithm will take into account the different sources of information within the page. The way this works is specific to each search engine and changes frequently. Making a page optimized for one search engine will mean it is not optimal for others. Fortunately, as Google is by far the leading engine at present, it is clearly important to optimize for Google rather than anything else. Because the data and algorithms change from day to day your position in the search results may change dramatically over a short space of time, this 'erratic' behavior has been termed the Google Dance ➚.
You can control which pages are included and excluded from the search engines as well as which links get followed by special META directives in the header part of a page.
How Search Engines work now
As with much of the Internet, the world had to move on from blind trust in the META keywords placed in the header section of pages. Instead of thousands of web sites on the Internet there are now millions, search engine results are now numbered in millions for popular keywords. The wheat and chaff have to be separated. Search engines have taken different approaches to achieve this, what they have in common is the lack of reliance in the specified META keywords. Indeed, Google® ➚ were the first to totally ignore the meta keywords on the basis they were too heavily abused. They no longer rely on a single element in a page they build up a relevance score for each potential keyword. Our page elements outlines which are the key parts of a page as far as the search engine is concerned.
What's in a Word?
Originally search engines just took the literal META keywords and built an index based on the text. This is unfortunate because there are ever so many English words that have multiple meanings such as mole which can be either a burrowing rodent or a skin spot. If you just type in mole to search engine should it give results based on just one or the other meaning or a mixture? If the search engine has nothing else to go on perhaps it should give a mixture or else choose the most common usage of the word.
However if you put in a search engine keyphrase of mole trap then the context indicates that it's the animal you are meaning and in this case the results should not mention skin blemishes at all. Conversely mole itch would be expected to give skin disease matches. This all requires the search engine to be much more clever than just keeping tracks of individual words it must keep track of context and work out from that that a web page is using a particular meaning of a word. In broad terms this means the analysis algorithm has to understand the nuances of the meanings of words not just the sequence of characters representing the word. So if a web page mentions both 'nut' and 'bolt' then in that context 'bolt' is to do with mechanical engineering and the two words mutually reinforce the context - Nothing what so ever to do with lightning.
Google is Best
How did Google® ➚ come to dominate the search engine market? They weren't there to start with. Basically they got the design right, three main design aims were met, make the results more accurate than the competitors, secondly make it fast and lastly use an uncluttered interface. At the time that Google was released, the opposition had interfaces cluttered with lots of links and garish banner advertisements. Which brings me to the final and probably the most significant reason for Google's success, their business model was better. They designed in from the outset that advertising would be targetted according to what the user was searching for. An advertiser buys space knowing that the user is really looking for items under a particular search term. This makes advertising much more affordable, the model of the competition was to put banner adverts on absolutely every screen - which makes them much more prominent but beyond the budget of most firms and in most cases completely inappropriate to what a user was looking for. The lesson to learn is that Google understood what the user wanted from a search engine and stuck to it, even if in the early days they received no income from it.
Click the link to learn more about page elements that search engines analyse.
For more detailed information on search engine algorithms please visit : How Search Engines Rank Web Pages ➚ or How a large Web Search Engine works ➚.
Our SiteVigil product (produced by Silurian Software) is designed to help you monitor position of web pages on search engines without burying you in all the associated technical jargon. It will show you the keywords used on any page using its in-built Analyzer utility. It also gives you access to the ownership of the domain and web site.