The main protocols used in the Internet do not have any embedded search functions. The HTTP protocol is good for the navigation purposes only. The same is about the FTP protocol that is even simpler. Nowadays, the volume of information available in the Internet is increasing very fast and that is the reason why the navigation features of the Internet protocols are not sufficient for the search. To achieve this, the proposed search engine strategies were developed to structure the Internet information and make it easy to find.
Each search system contains three components:
- An agent (a search engine spider or crawler), which is moving over the Internet and gathers the information;
- A database that keeps all the information gathered by the search engine spider;
- The search mechanism, which people use as an interface for the database interaction.
The search mechanisms are the tools for searching and structuring of the information. Search engine strategies, implemented with varied agents, spiders, crawlers and robots, are helpful to gather information about the documents displaced in the Internet. These tools are special software that finds pages in the Internet, extracts hyper-links and indexes the information they found to build the database. Each search mechanism has its own set of rules determining the algorithm for gathering the documents.
There are some differences between the known search engine strategies.
The agents are the most intelligent search tools. Actually, they can do more than simply search. Some advanced agents can even perform transactions on your name. Nowadays, the agents are able to search sites on specific themes and return the list of the sites, sorted by the attendance. The agents can be programmed to extract the information from the existing databases and the agents are also able to find and index different types of the web resources, not only the pages.
The search engine spiders perform a common search for information in the Internet. The spiders send the information about the found document, index it and extract the final information. They also view headers and some links to index the information.
The crawlers are very simple representatives of the search engine strategies. They view the headers and return only the first link.
The robots are programmed to go through different links, to perform indexation and to check the document's links. By their nature, the robots can be stocked in the loop and that is the reason why they require a significant amount of the Internet resources.
The agents extract and index different types of information. Some of them index each word in a document and, at the same time, the other agents index one hundred of more significant words, the size of the document and the number of words in it, its name, headers, etc.
The agents can move over the Internet and find the information to place it to the database of the search mechanism. The administrators of the search engines can determine what sites or types of sites should be visited and indexed by the agents.
There is an opportunity to register your site in the search systems manually. This is a good solution for the new commercial sites. People can place the information into the index by filling a special form. This data goes to the database directly.
When someone wants to find some information available in the Internet, he visits the page of the search system and finds the form with the details of the information he needs. Keywords, dates and some other criteria can be added here.