Twitter co-founder Jack Dorsey has announced that he has resigned from the company. This means that not only is he stepping down as the…
The SEO industry has been thriving for a number of years, thanks to the fact that search ranking algorithms are constantly evolving and Google’s results ranking system is probably one of the internet’s most closely guarded secrets.
Any insight into what Google considers to be an innovative approach to search ranking, and the direction that Google is taking with regard to how it indexes pages, is a gem in the rubble of obscure guesswork that most of us are forced to undertake. That’s why a recent publication titled ‘Indexing the World Wide Web: The Journey So Far’ makes for some of the most interesting reading in the SEO world to date.
In the article, Google engineers explain some of the techniques that can be used to improve search result relevancy and the trade-offs that must be accounted for in terms of the machine resources required to implement them. Right up front, in the introduction, Google presents a fantastic picture of all of the major innovations that it considers to have changed the way that search engines have functioned since their inception in 1994.
Google gives a nod in the direction of Cuil and even to Bing, one of its major competitors. The latest innovation that the paper identifies, however, is “Realtime and Social Search,” which it credits partly to Facebook and to Twitter, but also includes Bing and Google as major players in this new arena.
While most of these innovations are public knowledge, it is useful to be able to pin down the developments that Google considers to have shifted the way that search engines function. The search monolith is more than likely attempting to incorporate as many of these innovations as possible into their own approach.
The article quickly jumps into a more technical analysis of how machine resources are used to handle indexing and the resolving of user queries. While much of this information seems overly complex, it is clear that the authors consider something that they call an ‘inverted index’ to be the most efficient method of storing and index structure.
This inverted index effectively keeps a dictionary, which contains a list of all of the documents that contain a word or term, along with the number of times the term is used within the document, to put against searched words and terms.
The authors go on to mention some of the shortcomings of this approach, such as the fact that the Internet is multilingual and that words have multiple forms or variations. The document then goes on to describe some of the techniques put in place to handle these problems.
Another point of interest is how Google’s research describes user intent. In order to return more relevant results, the authors describe how important it is that terms used in a search query have greater proximity, meaning that a search engine will seek to find pages where the query terms appear closer together within the document.
In order to avoid the huge processing and storage costs involved in indexing pages in this way, much research has gone into actually indexing whole phrases and their relationships. The paper states that Google has been experimenting with this in its TeraGoogle project, and lists some of the advantages and disadvantages of the approach.
It is clear that Google loves this indexing technique, with the only listed disadvantages being that it is difficult to implement and manage. Google is not known for shying away from anything difficult, so we can be pretty certain that phrase-based indexing will be the way forward.
The most interesting part of the publication is toward the end, where the authors start exploring how social media can be mined to help improve the relevance of search rankings. By building graphs of user followings and user influence, they suggest adding UserRank and UserTopicRank as additional features to start understanding how important links and information presented in social media actually are.
Furthermore, by using real time data, the search engine can perform something called topic clustering, so that search result relevance can be improved through awareness of the topics that seem to have a lot of social media ‘buzz’.
Finally, by using natural language processing, the paper suggests that it may be possible to work out user sentiment from social network postings. This means that your result relevance may be skewed by a majority sentiment, or by the sentiment of people within your own social networks.
It is rare to stumble across a document that presents such a complete overview of the search industry and the general direction it is taking, especially when that document comes from one of the biggest players in the market.
While I have tried to cover some of the major points in the document that really stood out to me, the document is packed with information. If you’re interested in SEO or simply in the technology that Google employs, I highly recommend that you download the article and take a look yourself.
Pic: Robert Scoble