Basic principles of search engines. What is a search engine, how does search work Internet search engines how they work

By definition, an Internet search engine is an information retrieval system that helps us find information in world wide web. This facilitates the global exchange of information. But the Internet is an unstructured database. It is growing exponentially and has become a huge repository of information. Finding information on the Internet is a difficult task. There is a need to have a tool to manage, filter and retrieve this ocean information. The search engine serves this purpose.

How does it work search system?

Internet search engines are engines that search and retrieve information on the Internet. Most of them use a crawler indexer architecture. They depend on their track modules. Crawlers, also called spiders, are small programs that crawl web pages.

Crawlers visit an initial set of URLs. They mine URLs that appear on crawled pages and send this information to the crawler control module. The crawler decides which pages to visit next and gives those URLs to the crawlers.

The topics covered by different search engines vary depending on the algorithms they use. Some search engines are programmed to search sites on a specific topic, while others' crawlers may visit as many places as possible.

The indexing module extracts information from each page it visits and enters the URL into the database. This results in a huge lookup table with a list of URLs pointing to pages of information. The table shows the pages that were covered during the crawl.

The analysis module is another important part of the search engine architecture. It creates a utility index. The index utility can provide access to pages of a given length or pages containing a certain number of pictures on them.

During the crawling and indexing process, the search engine stores the pages it retrieves. They are temporarily stored in page storage. Search engines maintain a cache of the pages they visit to speed up the retrieval of pages that have already been visited.

The search engine query module receives search queries from users in the form of keywords. The ranking module sorts the results.

The crawler indexer architecture has many variations. They change in a distributed search engine architecture. These architectures consist of collectors and brokers. Collectors collect indexing information from web servers while brokers provide the indexing engine and query interface. Brokers index the update based on information received from collectors and other brokers. They can filter information. Many search engines today use this type of architecture.

Search engines and page ranking

When we create a query in a search engine, the results are displayed in a certain order. Most of us tend to visit the top pages and ignore the bottom ones. This is because we believe that the top few pages are more relevant to our query. So everyone is interested in having their pages rank in the top ten search engine results.

The words listed in the search engine query interface are the keywords that were requested in the search engines. They are a list of pages relevant to the requested keywords. During this process, search engines retrieve those pages that have frequent occurrences of these keywords. They look for relationships between keywords. The placement of keywords also counts, as does the ranking of the pages containing them. Keywords that appear in page titles or URLs are given more weight. Pages that have links pointing to them make them even more popular. If many other sites link to a page, it is seen as valuable and more relevant.

There is a ranking algorithm that every search engine uses. The algorithm is a computerized formula designed to provide relevant pages to a user's request. Each search engine may have a different ranking algorithm that analyzes pages in the engine's database to determine relevant responses to search queries. Search engines index different information differently. This means that a given query posed to two different search engines may return pages in different orders or retrieve different pages. The popularity of a website are factors that determine relevance. Click-through popularity of a site is another factor that determines its rank. This is a measure of how often a site is visited.

Webmasters try to trick search engine algorithms in order to increase the ranking of their site in search results. Stuffing website pages with keywords or using meta tags to cheat search engine ranking strategies. But search engines are smart enough! They are improving their algorithms so that the machinations of webmasters do not affect search results.

You need to understand that even the pages after the first few in the list may contain exactly the information you were looking for. But rest assured that good search engines will always bring you highly relevant pages in the top order!

Sales Generator

Reading time: 13 minutes

We will send the material to you:

From this article you will learn:

On what principle did the first search engines work?
How modern search engines work
On what principles is the work of any search engine based?
What formulas do search engines use in their work?
How sites are ranked
What are the principles of operation of the Yandex search engine

Before engaging in SEO promotion of a website, it is important to study the principles of search engines in order to achieve the desired results. This knowledge will be useful for developing an individual strategy for optimizing an Internet resource for certain keywords and will help bring it to the TOP of search results.

What principles underlay the work of the first search engines?

At the dawn of the Internet, only a small number of users were able to connect to it. The amount of available information was also small. At that time, it was mainly employees of research organizations who worked on the Internet. Searching for information on the Internet was not as popular as it is now.

The first attempt to organize access to electronic data via the Internet was made by the Yahoo! search engine, which appeared in 1994. The company's developers created an open catalog of sites, links to which were grouped according to relevant topics. As the number of resources in the database grew, it became necessary to add a catalog search option. It was not yet a search engine in the form we are accustomed to, since it searched for information on the internal database of sites, and not on all existing Internet resources.

Such link directories used to be very popular, but today they have lost their relevance, since the number of sites is constantly increasing. For example, the largest modern directory on the Internet, DMOZ (another name is the Open Directory Project), includes about 5 million sites, while the Google search engine database contains more than 8 billion links.

In 1994, the first real search engine, WebCrawler, appeared.

In 1995, two more search engines were created: Lycos and AltaVista. The latter search engine has long held a leading position in the field of online information search.

In 1997, Stanford University students Sergey Brin and Larry Page developed the Google search engine, which became the most popular in the world.

Also in 1997, the Yandex search engine, which is popular on the Runet, began operating.

How do search engines work today?

If you are not a programmer and your profession has nothing to do with IT, why do you need to understand the principles of search engines? The fact is that so-called organic traffic comes to the company’s website through search engines - these are users who themselves found your Internet resource using keywords in Yandex or Google. Organic traffic is a tasty piece of the pie (target audience). The higher its level, the greater the conversion and sales of the site.

In order for users to easily find your Internet resource, it is important to fill it with the right content. Search engines rank sites based on the quality of content, which affects their place in search results. It turns out that, knowing how search engines index Internet resources, you can optimize their content and promote it to the TOP.

On the other hand, using search engines, you can analyze user actions. Study what they are looking for, what information, products or services are currently relevant to them. If according to Yandex statistics it is possible to form a picture of the actions of Runet users, then according to according to Google- Worldwide Internet.

The main concept with which any search engine works is the search index - a certain data structure that reflects information about documents and the location of keywords in them.

The operating principles of many search engines are very similar. The main difference is the approach to site ranking (the method of organizing resources in search results).

Every day, a huge number of users search for various information using search engines.

For example, the following search queries are popular:

“Write an abstract”:

"Buy":

To increase the speed of the search engine, the search architecture consists of two elements:

basic search;
metasearch.

Basic search- a program that searches within its part of the index and provides the user with all links that meet the search query.

Metasearch- a program that, when processing a request, determines the location of the user and produces ready-made search results, if the keyword is popular, but if there have been no similar requests before, then a basic search is activated, which, using machine learning, processes the links in the database and provides a list of them to the user .

The search engine simultaneously analyzes the user and the search query itself according to the following criteria:

length;
definition;
popularity;
competitiveness;
syntax;
geography.

The following request types are distinguished:

navigation;
informational;
transactional;
multimedia;
general;
official

After the search query has been sorted by parameters and classified into one of the listed types, the search engine selects a ranking function.

Search engines do not disclose information about the ranking of search queries, so the example in the figure above is just an assumption by SEO specialists.

Knowing the types of requests is necessary to choose a website promotion strategy. For example, if a user enters a general query, the search engine will give him links different types(commercial, multimedia, informational, etc.). If at the same time you are promoting your commercial website for a general request and want to bring it to the top ten of search results, then with a high probability you will not end up in the TOP, but only in the number of places for commercial Internet resources, determined by the search engine ranking formula. It turns out that promoting a website to the first lines in search results for general search queries is much more difficult than for other types.

The Yandex search engine has been using machine learning since 2009. Matrixnet- a special algorithm that ranks sites for certain queries.

The basic operating principle of this algorithm is as follows: the assessor department collects primary information to assess the effectiveness of the ranking formula. Employees of this department, based on an experimental formula, evaluate a sample of Internet resources according to certain parameters:

1. Vital- official Internet resource of the company or not. This could be a website, a social network page, or information on reputable resources.

2. Useful(score 5) - a site that provides all necessary information by search query. For example, a user enters “banner fabric” in the search bar. In order for the algorithm to evaluate a resource as useful, it must contain the following information:

what is banner fabric;
specifications;
photos;
kinds;
price list;
Additional information.

Examples of queries that make it to the TOP of search results:

3. Relevant+(score 4) - this score indicates that the information on the site corresponds to the search query.

4. Relevant-(score 3) - the site does not fully correspond to the search query. For example, if you respond to the query “Guardians of the Galaxy sessions”, the search engine shows links to pages about the film, but without a show schedule or with an out-of-date schedule.

5. Irrelevant(score 2) - the site does not respond to the search query. For example: a user is looking for information on one hostel, and the search engine gives him a page of a completely different one.

To promote a site for general or informational search queries, you need to optimize its content in such a way that the search engine assigns a rating of “useful” during ranking.

What is the operating principle of any search engine based on?

The search engine traditionally allows you to:

The operating principles of a search engine are based on the interaction of three main elements. First, a search is carried out based on the user entered keywords or phrases, then in process of mathematical formation The results are grouped by links and sites. And finally, to read information from selected Internet resources, it is used search robot or other tools. The main search robots that are currently popular:

crawler(another name is “crawler”) - a program “walking” on the Internet. It visits only those sites where it finds at least a minimal match to what is specified in the search query. The work begins with a list of addresses from available databases or an index.
Index. The crawler transmits all received information to the search index. Thus, the latter always has up-to-date information about the sites and web pages found. If updates are downloaded to a resource or page, then this information is updated in the index.
Search engine (server)- this is special software, the main function of which is to analyze the information collected in the search index. The search engine algorithm works on the principle of supporting only the final search results for Internet resources. The search engine itself decides how to distribute pages in search results.

Any search engine aims to provide the user with the most relevant and useful sites that match the search query. In technical terms, this is called “response relevance.” For example, to promote an online store, it is of great importance that the content posted on it matches the needs of users. Optimizing your site will improve its position in search results.

Let's consider the main characteristics of Internet search engines and the principles of their operation:

Completeness- a key characteristic of a search engine. It is calculated as the ratio of the number of documents selected at the user's request to the total number of documents on the Internet that match the search query. For example, there are 200 pages on the Internet where the phrase “how to choose a refrigerator” is used, the search engine for this query returned only 40, therefore, the completeness of the search is 0.2. The higher the completeness score, the more likely it is that the user will find what he was looking for (assuming this information is available on the Web).
Accuracy- second, but no less important characteristic search engine operation. Shows the correspondence of found documents to the user's search query. Let’s say in our example there are 200 pages for the query “how to choose a refrigerator”, 80 of them contain the phrase “how to choose a refrigerator”, and the rest simply contain individual words (for example, “how to ergonomically place a kitchen set and choose a place for a refrigerator” ). In this case, the search accuracy will be equal to: 80 / 200 = 0.4. The higher the search accuracy, the faster the user will find what he needs, and the less “spam” he will encounter along the way.
Relevance- another important parameter of the search engine. It reflects the time elapsed between the publication of material on the Internet and its entry into the search engine’s index database. For example, a few hours after the plane crash, a large number of users searched the Internet for information about the incident. Despite the fact that little time has passed since the publication of the first messages on this topic, search engines managed to index them and users were able to find out the details of the disaster.
Speed The performance of a search engine directly depends on its resistance to loads. For example, according to Rambler Internet Holding LLC, the search engine processes about 60 search queries per second every day. This speed is ensured by the reduced processing time of each individual user request.
Visibility presentation of results makes working with the search engine user-friendly. A search engine can find hundreds and even thousands of sites based on a search query. If the query is not composed entirely correctly, then even the first page of the search results may contain pages that do not quite correspond to what the user wants to find. As a result, a person is forced to filter information within the resulting list. Individual elements of the search engine results page help you navigate the search results. Detailed explanations on the search results page, for example, on Yandex, can be found at the link http://help.yandex.ru/search/?id=481937.

In order for a site to be on the first page of search results with a high probability, you need to:

Use anchor links that redirect users from thematic Internet sites to your company’s website. This increases the visibility of an Internet resource for a search engine, since not only the text with a link to the site, but also its URL can be included in the results.
Use meta tags along with the right keywords. This will make the website summary more unique and effective.
Apply Title header.
Correctly compose the semantic core of the site. It is not enough to distribute keywords throughout the site’s content; it is important to do it carefully and unobtrusively. You should not insert keywords in every sentence. A search engine may rate this as spam.
Use the site URL (the address of its location on the Internet). Correct spelling of an address affects its ranking by the search engine.

About the principles of operation of all search engines in simple words

Search system - special program with a convenient web interface, with which users can quickly and easily search for the necessary information on the Internet. Let's consider the principle of operation of any search engine without going into details and technical terminology.

In order for a search engine to provide the user with a list of page links that contain information on a search query, it must know the content of all sections of each site. How does the search engine collect this data?

The Internet is a special network consisting of individual pages that link to each other. The search engine does not have to go to each of them; it is enough to have information about the sites and directories with the highest ratings in order to accumulate data about the pages for subsequent downloading.

The search engine, in essence, creates a subject index in which the addresses of all Internet pages are grouped in a special way. If a new site has appeared on the Internet that is not linked to by other resources, then it will be difficult for a search engine to find it in order to index it and add it to its database.

After the search engine has generated a list of pages, the indexing process begins (the work of downloading all the data from them). Using programs specially created for these purposes, the search system records new information or overwrites old information, deleting irrelevant information. Work on indexing sites on the Internet is ongoing.

Programs collect data into temporary storage. There they accumulate up to a certain amount, after which the update process is launched, the main principle of which is updating the information in the main database of the search engine.

There are a huge number of pages in the search engine index. When a user enters a query in the search bar, the search engine selects relevant links from its database. In other words, it compiles a list of sites whose pages mention keywords specified by the user.

Since the internal database of the search engine is huge, search results can contain dozens of pages. How does a search engine rank them? By what principle does it determine which pages to show to the user first? All pages are sorted by content matching the search query. The higher the level of completeness of information contained on a page, the closer the page will be to the top of the list.

Nowadays, search engines use machine learning to rank pages in search results. The principle of operation of this process can be seen using an abstract example.

Let's say we need to train a robot to distinguish ripe apples from unripe ones. A program for determining the properties of a fruit is based on its characteristics:

color;
size;
hardness;
sugar content;
acid content.

For analysis, the robot is given two apples: unripe and ready. He compares them. Then we train the machine - we show which characteristics of the apple are positive and which are negative. We explain by what set of parameters a fruit can be classified as ripe or not.

Thus, we have an algorithm for analyzing apples, with which the robot can sort them independently. Now you can give not two fruits, but much more; the machine itself will divide them into ripe and unripe.

A similar principle of operation is inherent in the operation of the search engine. There are specialists who train machines to sort content algorithms. First, they independently analyze the pages that appear in the search results for a search query, dividing them into relevant and irrelevant. Then the robot is trained to sort the pages.

Page relevance is a fractional number. Each page is assigned a different relevance value. Then all resources are sorted in descending order of this indicator. The most relevant sites appear in the TOP of search results.

In addition to the main sorting algorithm, search engines use various additional ones, which can also affect search results. For example, with their help you can filter unscrupulous sites that use various “gray” schemes for promotion.

Basic principles of search engines: formulas

Each search engine uses its own unique algorithms for searching and ranking pages and sites, but the operating principles of all search engines are the same.

The process of finding information that matches a user's request consists of several stages: collecting data on the Internet, indexing sites, searching by keywords and ranking the results. Let's take a closer look at each stage.

Data collection.

After the site is ready, you need to make sure that search engine robots know about its appearance. You can place external links to your Internet resource or use other methods. As soon as the robot enters the site, it will begin to collect data on each page. This process is called crawling. The collection of information from a site occurs not only after its creation. The robot will periodically scan the Internet resource to check the relevance of the information and update the available data.

For both you and the bot (robot), such interaction should be mutually beneficial and comfortable. You, as the site owner, are interested in the bot doing its job quickly, without overloading the server, while collecting data from all pages as completely as possible. It is also important for the bot to do everything as quickly as possible so that it can move on to collecting data from the next site on its list. For your part, you can check that the site is working, there are no problems with navigation, there are no pages giving a 404 error, etc.

Indexing.

Even if a robot has visited your site more than once, this does not mean that the Internet resource will instantly become visible to the search engine and appear in the results. After collecting data, the next stage of the site processing process is its indexing (creating an inverted index file for each page). The index is needed for quick searching. As a rule, it consists of a list of words from the text and information about them (positions in the text, weight, etc.).

After indexing is completed, the site and individual pages appear in the search engine results for user search queries. Typically the indexing process does not take much time.

Search for information.

At this stage, information is directly searched based on user search queries. First, the search engine analyzes the request and determines the weight of each keyword. Then it searches for matches using inverted indexes, selecting all documents in the search engine database that meet the search query.

The compliance of a document with a request is determined using a special formula:

similatiry(Q,D) = SUM(w qk * w dk),

Where similatiry(Q,D)- similarity of the request Q document D; w qk- weight of the k-th word in the query; wdk- weight of the k-th word in the document.

Documents that are most similar to the user's request are reflected in the search results.

Ranging.

At the last stage, the search engine groups the results so that the user first sees links to the most relevant pages. Each search engine has its own unique ranking formula, which takes into account the influence of the following parameters:

page weight (citation index, PageRank);
domain authority;
relevance of the text to the request;
relevance of external link texts to the query;
as well as many other ranking factors.

For example, consider a simplified ranking formula:

RA(x) = (m * TA(x) + p * LA(x)) * F(PRa),

Where Ra(x)- final compliance of the document A request x, Ta(x)- relevance of the text (code) of the document A request x, La(x)- relevance of the text of links from other documents to the document A request x, PRa- page authority indicator A, constant relative X,
F(PRa) is a monotonically non-decreasing function, and F(0) = 1, we can assume that F(PRa) = (1 + q * PRa), m, p, q- some coefficients.

Thus, the place of a page in search results is influenced by various factors that are both related to the search query and have nothing to do with it.

How information retrieval systems work: ranking criteria

If you want your Internet resource to be among the top three or at least ten in search results, you need to know the principles of search engines and ranking criteria in order to constantly optimize the site to meet their requirements. There are two main groups of such criteria:

Search engine text criteria.

The search engine in this case ranks pages based on the quality of their text content. Optimizing this component of the site involves working with the semantic core at the stage of creating and filling the Internet resource.

The search engine, processing the user's request, will show the most relevant results on the first page of the search results. In the process of searching for documents, the search engine analyzes the correctness of filling out the title phrase (title), page description (description) and the presence of a key query in the headings (H1, H2, etc.).

Non-text search engine criteria.

The search engine works to analyze these criteria after the site is published and indexed. The main principle of ranking according to the criteria of this group is to evaluate not the quality of their content, but the external link profile.

The search engine analyzes the number of links to the site from other Internet resources, evaluates their authority, and looks at registrations in directories. If we draw an analogy, then a search engine, like a bank that decides to issue a loan to a company, collects reviews about it from counterparties, suppliers, and other creditors.

Knowing how search engines work will help you create and optimize websites that will easily rank first in search results and stay there for a long time, as they correspond to user search queries.

How the Yandex search engine works

The operation of such large and well-known search engines as Google and Yandex is based on a system of clusters. They group all information by certain areas, tied to one or another cluster. Special robot scanners are used to index websites and individual pages and collect data from them. They come in two types: the main robot scanner (designed to collect data from regularly updated Internet resources) and the robot scanner (needed to update the list of indexed sites and their indexes in the shortest possible time). In order for the Yandex search system to collect information on the Internet as completely as possible, the search database is regularly updated and program code:

The search information database is updated several times a month, and users receive updated data from Internet resources when entering queries in the search bar. This data is added by the main robot scanner.
Updating the program code or, as programmers call it, the “engine” is intended to find and eliminate deficiencies in the operation of the algorithms involved in ranking pages in search results. Yandex usually warns users about upcoming changes.

The main advantage of the Yandex search system, which explains its popularity in Runet, is the ability to find different word forms, taking into account the morphological features of the Russian language. Geotargeting and the search formula allow you to get the most accurate wording as possible. Yandex also has its own unique algorithm for ranking pages and sites. The indisputable advantage of the system is the speed of processing user search requests and the stable operation of the servers.

As already mentioned, when indexing resources, the search engine looks at dynamic links, the presence of which may cause the bot to refuse to determine the site index.

The operating principle of Yandex is based on the analysis of text content in documents with various extensions (.pdf, .rtf, .doc, .xls, .ppt, etc.).

In the process of indexing an Internet resource, the search engine takes data from the robots.txt file, while the Allow attribute and some meta tags are supported, but the Revisit-After and Keywords meta tags are not taken into account.

Snippets (short descriptions) text documents) consist of phrases on the page you are looking for, so it is not at all necessary to write tags in the description, but they can be added if there is such a need.

According to many developers, the code of indexed documents is determined in automatic mode, so the encoding meta tag doesn't play a big role.

Yandex pays great attention to the indicator last change information (Last-Modified). If the server stops transmitting this data to the search engine, then the site will be indexed much less frequently.

If an Internet resource has its own “mirrors” (for example, http://www.site.ru, http://site.ru, https://www.site.ru), you need to make sure that the search engine not indexed. If this is not possible, then it is possible to merge such sites by making appropriate changes to the robots.txt document.

After an Internet resource gets into Yandex.Catalog, the search engine will classify it as a site that requires special attention, which will affect its promotion. This will also simplify the procedure for determining the topic of the site, which is undoubtedly a plus, since it will have a significant external link.

Yandex developers do not disclose the IP addresses of their robots. However, in the log files on different sites you can find text marks that belong to the robots of this search engine.

The most important of all search robots is the main one. The significance of the site for Yandex depends on the results of its work on indexing the page.

Each robot has its own schedule for indexing Internet resources. The working time of different robots with each of the sites in the search engine database may not coincide.

In addition to the main robots, the search engine has additional ones that regularly visit the pages of Internet resources to check their availability. For example, these are robots from Yandex.Catalog and the Yandex advertising network.

The Yandex search engine focuses on the following key indicators of external optimization:

TCI (public thematic citation index) - shows the average number of links that access the site. Does not directly affect ranking results; it is used to determine positions in the thematic group of Yandex.Catalog. Used to promote Internet resources.
vCI (weighted citation index) is a special algorithm designed to count the number of external links to a site. It is of paramount importance when ranking pages by search engines.
Presence of the site in Yandex.Catalogue.
The total number of site pages that have been indexed.
Frequency of indexing the content of an Internet resource.
The presence and absence of links from the site, the presence of the site in search filters.

The citation index underlies the subject and weighted citation index.

Citation index(CI) - an indicator of the number of citations (or links to the original source), helps to determine which of the recently created documents refer to earlier publications. IC is used both for the analysis of articles and authors (for example, in the scientific community).

In Yandex, as in other search engines, the citation index is considered as the number of backlinks without taking into account links from the following types of sites: unmoderated directories, message boards, network conferences, server statistics pages, XSS links, etc., the number of which can constantly increase without the participation of the resource owner.

It should be clarified that in the Aport catalog IC is considered as a weighted citation index.

To calculate this index, a link graph is used: if sites are the vertices of the graph, and links to other sites are connections between the graph vertices or edges, then the link graph appears in the form of a diagram shown in the figure:

Where A, B, ..., F are certain sites in the Yandex search engine index, and the arrows indicate the directions of connections between them (one-way or two-way).

The citation index plays a big role in the ranking of documents by a search engine, but the final results depend not only on this indicator.

It is believed that the citation index characterizes the significance of the publication, but it does not reflect the site’s link structure; as a result, resources with different numbers of external links can be indexed equally.

To eliminate this shortcoming, a weighted citation index is used, which characterizes not only the quantity, but also the quality of referring resources. The use of link search and static link popularity facilitates the work of search engines, saving them from various text spam. In the search engine Google system The PageRank indicator, similar to the weighted citation index, is used.

To calculate the VIC, as well as other factors influencing ranking, a reference graph is used. The site owner can independently estimate the VIC of his Internet resource by checking its PageRank value using any of the available online services. But it is worth keeping in mind that the Yandex index contains only documents in Russian, and only a few popular ones from foreign ones, thus, the value of the Yandex VIC will differ from the Google PageRank.

Array ( => 26 [~ID] => 26 => 10/22/2019 21:34:24 [~TIMESTAMP_X] => 10/22/2019 21:34:24 => 1 [~MODIFIED_BY] => 1 => 10/22. 2019 21:34:24 [~DATE_CREATE] => 10/22/2019 21:34:24 => 1 [~CREATED_BY] => 1 => 6 [~IBLOCK_ID] => 6 => [~IBLOCK_SECTION_ID] => => Y [~ACTIVE] => Y => Y [~GLOBAL_ACTIVE] => Y => 500 [~SORT] => 500 => Articles by Sergei Nezhnikov [~NAME] => Articles by Sergei Nezhnikov => 12013 [~PICTURE] = > 12013 => 17 [~LEFT_MARGIN] => 17 => 18 [~RIGHT_MARGIN] => 18 => 1 [~DEPTH_LEVEL] => 1 => Sergey Nezhnikov [~DESCRIPTION] => Sergey Nezhnikov => text [~DESCRIPTION_TYPE ] => text => Articles by Sergei Nezhnikova Sergei Nezhnikova [~SEARCHABLE_CONTENT] => Articles by Sergei Nezhnikov Sergei Nezhnikova => stati-sergeya-nezhnikova [~CODE] => stati-sergeya-nezhnikova => [~XML_ID] => => [~TMP_ID] => => [~DETAIL_PICTURE] => => [~SOCNET_GROUP_ID] => => /blog/index.php?ID=6 [~LIST_PAGE_URL] => /blog/index.php?ID=6 => /blog/list.php?SECTION_ID=26 [~SECTION_PAGE_URL] => /blog/list.php?SECTION_ID=26 => blog [~IBLOCK_TYPE_ID] => blog => blog [~IBLOCK_CODE] => blog => [~IBLOCK_EXTERNAL_ID] => => [~EXTERNAL_ID] =>)

They are one of the main and most important Internet services.

With the help of search engines, billions of Internet users find the information they need.

What is a search engine?

A search engine is a software and hardware complex that uses special algorithms to process a huge amount of information about a wide variety of sites, their content, down to each page.

A search engine, from the point of view of ordinary visitors, is such a smart site that contains a lot of information and provides answers to any user queries.

Internet users use different search engines in different countries. In the English-speaking segment of the Internet, the most popular search engine is Google.

Search engines in RuNet

In Russia, more than half of users prefer the Yandex search engine, and Google accounts for about 35% of queries. Other users use Rambler, Mail.ru, Nigma and other services.

In Ukraine, about 60% of users use Google, Yandex accounts for slightly more than 25% of processed requests.

Therefore, when promoting sites on the Runet, specialists try to promote the site, focusing on the search engines Yandex and Google.

Search engine tasks

In order to answer visitors' questions as accurately as possible, search engines must perform the following tasks:

Quickly and efficiently collect information about various pages of different sites.
Process information about these pages and determine which query or queries they correspond to.
Generate and provide search results in response to user requests.

Components of search engines

Search engines are a complex software complex that consists of the following main blocks:

Data collection.
Indexing.
Calculation.
Ranging.

This division is conditional, since the work of different search engines is somewhat different from each other.

1. Data collection

At this stage, the task is to find new documents, make a plan for visiting and scanning them.

Webmasters need to let search engines know about the appearance of new materials by placing the page address in the add-on page or broadcasting the page announcement on social networks.

Personally I use the last way and I think that this is quite enough.

A comment. I’ll digress a little and talk about the effectiveness of placing announcements in in social networks on the speed of indexing of new website pages.

I use the text.ru service to control and record the uniqueness of text on the pages of my website.

It qualitatively checks uniqueness, records it and makes it possible to place a uniqueness banner on the pages of your website.

But sometimes there is a long queue for processing on this service. I have had several cases where I did not wait for the uniqueness check, posted an article on the site and circulated it on social networks.

If the uniqueness check was delayed for about an hour or more, then the uniqueness percentage was always 0%. This means that in less than an hour after posting, the page was already indexed and entered into the search engine database.

2. Indexing

Search engines, having collected data about new web pages, place them in their database. In this case, an index is formed, that is, a key for quick access to data about this page, if such a need arises.

3. Calculation

After entering the database, the pages of our sites go through the stage of calculating various parameters and indicators.

No one except the developers of search engine algorithms themselves can say exactly how many of these indicators are and how they are calculated.

4. Ranking

Then, based on the calculated parameters and indicators, the relevance of the page to certain queries is determined and the page is ranked.

This will be important for the quick and high-quality generation of search results pages for these queries.

Search engines generate answers to user queries and generate results for them in the form of a search results page.

It should be noted that algorithms for processing page data, generating indicators and ranking methods are constantly being improved. The priorities by which ranking occurs change.
Search engines strive to answer user requests as accurately as possible, trying to take into account the nature of the request, the interests of a particular user, his place of residence, age, gender, habits, and inclinations.

Many of us use search engines such as Google, Yandex, Yahoo, etc., however, does everyone understand how the search engine mechanism works? Despite the fact that each search engine has its own characteristics in search algorithms and ranking results, the principles of operation of all search engines are common.

If we consider the process of searching for information on the Internet, it can be divided into the following stages: collecting information from the pages of sites on the Internet, indexing sites, searching for a query and ranking the results. Let's consider each stage separately.

Data collection

As soon as you have launched your site and let the robot of some search engine understand that a new resource has appeared (using external links to your site, adding it to or using other methods), the robot comes to you, begins to walk through the pages and collect information. their data (this can be text content, pictures, videos and other files). This process is called data collection. crawling) and it can happen not only when the site is launched. The robot creates a schedule for the site when it should visit it next time, check old information and add new pages, if any.

It is important that the communication between your site and the bot is pleasant for both parties. It is in your interests that the bot does not stay on the site for a long time, so as not to overload the server, and at the same time it is necessary that it correctly collects all the data from all the necessary pages. It is also in the robot's interest to make the collection quick enough to begin processing the next site in the schedule table. To do this, you need to make sure that the site is accessible, that there are no problems with navigating the site (flash and javascript menus are still poorly recognized by robots), that there are no broken pages (giving 404 errors), and do not force the bot to go through pages that are accessible only to registered users and so on. You should also remember that for web spiders there is a limit on the penetration depth (nesting level) and the maximum size of the scanned text (usually 256kb).

You can control access to different resources for the search robot using the robots.txt file. The sitemap.xml can also help the robot if for some reason it is difficult for him to navigate the site.

Useful link: About the search robot
http://ru.wikipedia.org/wiki/Search_robot

Indexing

A robot can walk around your site for a long time, but this does not mean that it will immediately appear in search results. Site pages need to go through the following stage: indexing– compiling a reverse (inverted) index file for each page. The index is used to quickly search through it and usually consists of a list of words from the text and information about them (positions in the text, weight, etc.).

After the site or individual pages have been indexed, they appear in the main search engine results and can be found using keywords present in the text. The indexing process usually happens quite quickly after the robot pulls information from your site.

You can also read: How search engines work
http://download.yandex.ru/company/iworld-3.pdf
Inverted file
http://wiki.liveinternet.ru/IR/InvertirovannyjjFajjl

Search for information

When searching, first of all, the query entered by the user is analyzed (the query is preprocessed), as a result of which the weights for each of the words are calculated (the so-called re-enchantment in Yandex).

Next, the search is carried out using inverted indexes, all documents in the collection (search engine database) are found that are most suitable for this request. In other words, the similarity of a document to a query is calculated using approximately the following formula

similatiry(Q,D) = SUM (w qk *w dk),

Where similatiry(Q,D)— similarity of the request Q document D;
w qk— weight of the k-th word in the query;
wdk— the weight of the k-th word in the document.

Documents that are most similar to the query are included in the search results.

Useful material: AUTOMATIC TEXT ANALYSIS
http://www.dcs.gla.ac.uk/Keith/Chapter.2/Ch.2.html
Explanation of the Yandex algorithm from Minych
http://www.minich.ru/business/seo/

Ranging

Once the most similar documents have been selected from the core collection, they should be ranked so that the top results reflect the resources most useful to the user. For this, a special ranking formula is used, which has a different form for different search engines, but for all of them the main ranking factors are:

pageweight ( , );
domain authority;
relevance of the text to the request;
relevance of external link texts to the query;
as well as many other ranking factors.

There is a simplified ranking formula that can be found in some optimizer articles:

Ra(x)=(m*Tа(x)+p*Lа(x))* F(PRa),

Where:
Ra(x)– final compliance of the document A request x,
Ta(x)– relevance of the text (code) of the document A request x,
La(x)– relevance of the text of links from other documents to the document A request x,
PRa– page authority indicator A, constant relative X,
F(PRa) is a monotonically non-decreasing function, and F(0)=1, we can assume that F(PRa) = (1+q*PRa),
m, p, q- some coefficients.

That is, we must know that when ranking documents, both internal and external factors are used. You can also divide them into factors dependent on the request (relevance of the document text or links) and factors independent of the request. Of course, this formula gives very general idea about algorithms for ranking documents in search results.

For a more detailed understanding of the principles of operation of search engines, I advise you to read the materials on the links provided on this page.

Useful links about ranking: ROMIP 2004
http://company.yandex.ru/articles/romip2004.xml
Yandex text ranking algorithm at ROMIP-2006
http://download.yandex.ru/company/03_yandex.pdf
Key Factors Affecting Relevance