Want to Know How Google “Reads” the Internet?
The two primary functions of Google, Bing, or any search engine is to:
1) Build an index of all interlinked websites that are using automated robots (bots) which are called crawlers, and;
2) Provide searchers (end-users) with a relevant, ranked list of pages that are determined by the search engine’s algorithm to be most pertinent to a given keyword or search query
Search Engine Crawling and Indexing
The index of any search engine is comparable to that of a physical book; albeit with billions upon billions of keywords and phrases next to their corresponding page number.
A page or website is indexed after its been crawled by a search engine bot. For the years following the advent of search engines, bots relied on metadata like keywords, meta descriptions, title tags, and other structured data elements to determine the content of a web page.
As technological advances in search progress, search engines (currently led—unquestionably—by Google) rely on natural language processing to index and analyze all of the body copy on a given web page, rather than just the associated metadata.
Improvements in natural language processing capabilities allow search engines to contextualize the content of websites better to provide the most relevant result in response to an end user’s query. Search engines are continuing to increase the relevancy of pages on Search Engine Result Pages (SERPs) through the deployment of NLP tools with semantic analysis capabilities. Semantic analysis is particularly beneficial in languages like English, where many words exist with several different definitions (but maintain the same spelling).
Once a search engine bot locates a new website or web page, its’ crawlers process the source code and store selected pieces in database storage facilities so that they may be recalled rapidly when a user enters the relevant search query. Failure to load relevant content in under a second can lead to user dissatisfaction, and user growth can falter. Even a one- or two-second delay may result in thousands of users abandoning your website for a site with more robust server speed.
A search engine index is similar—concerning concept, at least—to that of a physical book’s index. However, it is the search engine rather than the user in this case that “scans” the index and provides a curated summary of relevant websites that meet a specific and complex set of requirements, including a ratio of target keywords to total word volume. These so-called “ranking factors” are used by a search engine’s algorithm to determine where a given website ranks for a particular keyword in the SERP hierarchy.
How do Search Engine Bots Crawl Pages?
The backbone of the indexed Internet is the vast link structure that binds and connects websites and pages to one another. Similar to how a city’s metro or underground spreads out underneath the city, connecting many different neighborhoods; this system of links provides the “routes” bots use to index content.
Determining Relevancy and the Popularity Paradox
Remember the good old days of AskJeeves.com? The fundamental principle behind a search engine has not changed. Google… Yahoo!… Bing… Baidu… All of these search engines are designed to provide users with answers to their questions; provided the search engine has indexed relevant content.
When a user queries a particular search engine, the search engine (for efficacy’s sake we’ll say Google) scans its index in an attempt to match the content of the search query with a document or website in its search index. Any articles or content that renders on the SERP are first and foremost relevant to the search query and then ranked subsequently according to the popularity of the websites or pages that are responsible for serving the information. Search engine algorithms determine to rank through a complex combination of both topical relevance and audience popularity.
How does a search engine even determine relevancy or popularity for a given web page or website?
Content relevancy means much more than merely returning a page on the SERP with a high percentage of keywords matching the search query. At first, it was uncommon for search engines to do anything but keyword matching; the technology just wasn’t advanced enough.
Over the years, however, computer engineers and technologists have devised increasingly better ways to match results with searcher’s queries. So today, there are hundreds—if not thousands—of factors that influence relevance.
Second, the majority of search engines operate under the assumption that the higher the popularity of a site, page, app, or document, the more valuable and sought after the content it contains is.
Search engines utilize algorithms (complex mathematical equations) first to filter out irrelevant content and then rank relevant content based on the perceived quality of the content (measured by popularity) for a given query.
It sounds simple, but the equations that determine “popularity” and “relevance” are made up of hundreds of ranking factors including user experience, semantic intention, and bounce rate.