En Español

PageRank Algorithm.

Text taken from www.thesmokesellers.com under licencia de Creative Commons.

Method for node ranking in a linked database


The patent most famous Google is one of the main competitive advantages that allowed this company to crush its competitors in the field of Internet search and made ​​the giant it is today *. The Page Rank, as we all know, is a great idea to find the value or "importance" has a particular Web page. This "importance" is then used to show higher quality results when performing a Google search. The quality of Google results using this method (combined, of course, with other algorithms) is what made ​​us all leave our old search engines (Altavista, Metacrawler) and spend the seeker of Larry and Sergei. In this post we will explain the algorithm until the end attempting employ mathematical minimum amount possible.

If you ever interested in the topic, you've read that:

1. The "importance" of a site depends only on web pages that link.

If you have a website and this is linked from important pages (high Page Rank, say www.microsiervos.com ) You will receive a portion of that importance. All pages linked from your website (this blog to your colleague with only two posts, for example) will, in turn, part of the importance of your site. To be precise:

2. A website shared equally between all your important pages that link.

In other words: If you link an important page that links to 3 or 4 pages of yours is much better than if you link a page just as important as link 30 or 40 (plays more Page Rank to distribute).

Also have heard of the Spiders (spiders). This is not only fast automatic programs that are covering the Internet as if they were a human user, pressing all possible links, thus extending the "red" (hence the name) and creating a map of it. So we have:

3. Google The Spiders provide a map of the network where you can see what page points to page.

This does not mean that we know as the Page Rank. In fact, this is all very nice but ... as calculate the Page Rank?. Why start page?. Assuming that we started one, if we have the Page Rank of the pages linking to it, as we can calculate something?. And what's worse: On the Internet there venticincomil million pages by joining together (number rising rapidly), how to create an algorithm that is capable of dealing with such brutality links. In the worst case all pages are pointing to each other and the total number of links is venticincomil million squared!.

Here is where you really get the artillery mathematics. We promise that if you know what a matrix as add and multiply like (and have a little faith) and you can understand the algorithm of Larry and Sergei to the end.

Matrix H PageRank distribution

Okay, do not know what the page rank of any site before, but if there's one thing we know: The PageRank of your unknown distributed a page that links between pages. As stated in (2), if a page uses 5 pages transmit a fifth of its Page Rank to each. Due to (3) the number of pages that links each page know. Moreover, we can construct a table H of twenty-five billion rows by columns twenty-five billion (no, not fit on an A4), which contains all possible links. For any two pages (one as a linker and the other as linked) have a box in the table indicates that the proportion PageRank transferred to the bound linker. To guide a bit: The diagonal of this table represent what the page is transmitted to itself (if enlazase). Any box below the diagonal and above the symmetrical diagonal indicate respectively the two pages are transmitted when one acts as a linker and the other as linked and vice versa. If a page is not linked to another, put a 0 in the box (not logically nothing can pass Page Rank).

Matrix (Vector) Invariant I

What comes next is no idea of ​​Larry Page and Sergey Brin, a century ago is known, but if it requires little faith book you ordered. This table (read matrix), which was created with the help of information provided by the Spiders, actually represents the difficulty (or ease) for the "flow" of PageRank from one page to another. We can see the water flow as happens with more or less difficulty from one page to another according to the value corresponding to Table H box. This water / transfer PageRank would flow from page to page through your links steadily and could eventually reach an equilibrium (if it would not have reached any PageRank). Well mathematics, namely the theorem Ruelle-Perron-Frobenius assures us that:

4. Under certain conditions, we will see, it will eventually reach that balance. Not that Frobenius knew what a website in 1900, if the problem is mathematically identical to a known problem of system dynamics. Then there are people who say that Larry and Sergei are philosophy graduates.

5. The balance is represented by the vector I invariant. This is a single-column table (an array, specifically vector) of values ​​venticincomil million, which holds that when multiplied by the matrix H cast gives herself again (I). What would express:


This vector invariant I venticincomil million values ​​that chance, one for each web page is the Page Rank. Lack refining, scaling it from 1 to 10, and not of discretizarlo for intermediate values. I sense that the discretized value (1-10 without decimal), shown in the google toolbar, it is only with the public and internally employ decimals also leaving.

Yes, but very cool and the 25 billion pages?

True, true. People who have suffered algebra I first have recognized as an eigenvector of eigenvalue 1 of the matrix H. And surely remember with horror that must be solved to calculate a polynomial which in this case would grade 25 billion. We do not calculate or blas well. Fortunately, especially for the people above them that sounded like Chinese, there is a method to calculate I iteratively (in successive steps) and very simple. So simple that is that we make up a table of 25 billion PageRank values ​​broadcasting (I0 vector randomly generated), we multiply by H and the result will be another table of values ​​I1 25 billion but closer to the correct value I invariant vector. Repeat this a bunch of times until the result of multiplying by H and no change and no longer is. Now we have the vector invariant. This algorithm, called the method of powers, be expressed mathematically as:


Where k is no more than the rate that indicates how many times we have multiplied by the matrix H. The first vector, we would create a would boleo k = 0, the second from H would be multiplied by k = 1, etc. To express generally that each term is obtained by a previous transformation rates are employed k +1 and k. Keep in mind that iterative methods have the advantage that they need not accumulate too many values, which reduces the amount of memory that need to compute PageRank and accelerates the whole process of calculation. They are still a bit silly numbers but at least feasible.

Big Problem

That easy, no?. Obviously misses something and that something is the point (4). It is not the conditions of convergence of Theorem Ruelle-Perron-Frobenius. This means that using the method explained above there is no guarantee that we get the vector invariant. Not go into details, it does not. Using the analogy of the "flow" of PageRank can be understood well that is what is wrong and how it can be fixed.

Sump page: 

sumideroWhat happens when the flow of Page Rank reaches a page like the two that has no links to any site?. Well just not out there. That page becomes a sink PageRank algorithm and give incorrect results. As we solve it?. If we make the page 2 link all the web pages alike (imagine millions of tiny arrows leaving 2 to all pages), it outputs the flow of PageRank but the influence on the results is minimal, since each page receives only 1/ Pagerank of 2. Mathematically, this is equivalent to adding to H a matrix A that has all 0s fewer columns in the pages that will sink the entire column filled 1/ Thus instead of the array would use the matrix H = H S + A in the method of the powers.



A similar case is that of sub-networks within the network pages, such as 5-7-6-8, which have no back links. These networks become-sink networks. The problem is that these pages if other pages link and we can not just charge us that information and link all web pages from them. To output flow of Page Rank, we will resort to a solution-style "engineer".

Great solution

We need to ensure the outflow of Page Rank of any page or sub-network, ie, that every page pointing to another page. We better to create a link to any page to bowling because (besides being falsifying the Page Rank), if it results in a closed network like we have not solved anything 5-7-6-8. Now, imagine an ideal case where all apuntasen pages to all pages. There would always have the Page Rank any link where to escape, even the sub-networks, and the algorithm would work. But of course, you lose all the hierarchy giving the links, would cast matrix whose elements are all 1/ and all pages have the same Page Rank.

Nothing, high matrix actual distribution, calculated from the information of the Spiders with the ideal in which all pages are pointed to each other and divided by two. The resulting matrix will always saliedo links on each page and have PageRank flow guaranteed. That mixing matrix equally real and I get great results too uncertain? (Under the influence of ideal). Well, instead of half and half mix them with 85% of the real matrix and 15% of the ideal track. And this gentlemen. With you the famous Google matrix:


Where recall that S = H + A is the real matrix with individual sinks problem solved, 1 / n × 1, n = 25 billion is the ideal matrix and α = 0.85 gives the said mixture to 85%. Any astute reader may say: But ... the ahi poke 15% of randomness, not somehow distorts the Page Rank? ... Welcome to the world of engineering!.

Finally if we use G instead of H in the method of the powers and played a bit with the terms we obtain the formula appearing at the top of the post. Using the right formula to get each new vector equal Ik +1 in each iteration.


Final Notes

This item is made ​​from this wonderful page (which also took "borrowed" images) and the indispensable help of Wikipedia. The page also includes links to the original pdf Larry and Sergei as well as a book on the subject. If someone takes the time to check it out, here are some clarifications:

- The optimal value of the parameter α is determined experimentalemente and also regulates the rate of convergence of the power method , a higher percentage of real matrix, lower convergence speed. Google says that with just k = 50-100 iterations for calculating the Page Rank, which takes several days. Imagine working with several computers in parallel. This is known as Google Dance and TSS does us no grace.

 - Mathematically, the condition of convergence of the algorithm used by Google is that all array elements sharing PageRank estictamente be greater than 0, which satisfies the tinkerer we just saw. This is not a condition of convergence of the power method if not a condition for the existence of invariant vector by Theorem of Ruelle-Perron-Frobenius.


Want a FREE button with your PageRank?

Go to our section of buttons, choose the one you like, you stick it on your website and get your PageRank each time you load your site.