Google has defined an algorithm to index the ranks of web pages by crawling efficiently than any other search engines. Ranking hundred million pages is a challenging task. Lawrence Page and Sergey Brin have defined Google page Ranking Algorithm in their Stanford University paper .
Web search engines : Needed a change
Search engine technology needed a drastic change from 1994 to 2000,By Nov 1997 Search engines listed millions of web documents to be indexed during a search which started just from 1,10,000 in 1994. The number of queries per day from the users for the search also tremendously increased from a rough 1500 per day to 20 million queries per day and was estimated to hundreds of million queries by 2000. WWW needed an efficient algorithm for the search engine to function accurately. Google  had invested properly on the hardware performance and storage space but was in need of an algorithm to efficiently use its resources to maximum potentiality.
The Algorithm
The concept of anchoring is the root need for the definition of the algorithm. The linking of pages in the web is taken for prioritizing the pages thereby not requiredly categorizing only text based web documents.
PgRk(A) = ( 1-d ) + d(PgRk(K1)/(N1) + …. + PgRk(Kn)/(Nn))
The formula is not much complicated as it looks. Remember that A page is not manually ranked or by its total number of earlier visits by the surfers. The above formula looks recursive. Yes it is! So one can say by looking at the formula that page rank of a page is also decided by page ranks of someother pages. Those other pages are the pages  that contain a link to the page A.
( Follow the Notation: | A – Random WebPage | Ki – ‘i’th page that link to A | d – damping factor | PgRk – PageRank | Ni – Number of outbound links on page Ki | )
Next important thing is that the contribution of the pages that direct to page A in improving the A’s rank certainly differs.
1. Page rank of A relates directly with the page ranks of K.
2. Page rank contributed by each K will be divided by the number of outbound links in that page.
We infer that , a page of higher page rank, say like facebook, when has a link to your webpage then your webpage’s rank will improve considerably high. Consider when a low recognition website has a link to your webpage, still your page rank is going to increase but not considerably. What interferes again is the number of outbound links present in the site. When a site having a good page rank is linked to say 100 sites, then its page rank contribution to each site is split into 100 parts making its contribution not much effective. So only because when a higher page ranked webpage has a link to your website it doesnot mean that your page also will be ranked higher.
Justification for the algorithm
Lawrence page and Sergey brin gave a proper justification for the algorithm considering the importance of a page from the view of a random surfer. Random surfer denotes here that the clicking of links by them has no regard with the content of the page. So in this case just forget about the content. The probability for the surfer to visit a page depends on page rank of the page and also the probability that the surfer will click on a link in the page depends on number of links present in the page. This is why the page rank of the parent page has been divided by total number of links in the page when contributing to its child page. Thus the sum of the probability of random surfer clicking on the links following this page.
Now comes the factor why we include a damping factor further reducing the parent page’s contribution. Its obvious that in real world situation a surfer does not keep on clicking links continuously one after the other. When a surfer feels he is enough with the surfing or interested in some other page at random, he will stop clicking the links. We introduce this constant d, which is probability of a surfer to continue clicking the links ( lies 0 to 1 ). This is the reason why constant d is multiplied with the term in the formula. We can also conclude that not only that a page has a page rank contribution from the links of parent page but also the probability that a page can be clicked randomly by the surfer without using links ( remember the case we considered when the surfer stops clicking the link continuously?? ). So every page will have some minimum page rank even if there no links for the page directed from a parent page.
Next version of Page Rank Algorithm
In the other version of page rank algorithm submitted by Lawrence and Sergey ( different paper ), the algorithm is put as,
PR(A) = ((1-d)/N) + d(PgRk(K1)/(N1) + …. + PgRk(Kn)/(Nn))
N – the number pages in the web
Except for the new notation N, all other variables has the same effect in the formula. The notation is included because the probability of action on random surfer clicking a random page after he is bored of clicking links is equally divided among all the pages in the web. This version of pagerank deals with the probability for the surfer to click on a page when he restarts the search for pages. That is, in a web comprising of 5 web pages, if a page has a rank of 3 then it denotes that this page is being clicked 3 times every 5 times the surfer restarts the random page click.
Model of Page Rank Algorithm
A model will explain clearly the functioning of the algorithm. Consider 3 pages page A, page B and page C where A links to both B and C, B links to C and C links to A again.The damping factor is usually set to 0.85 according to lawrence and sergey, but owing to the ease of calculation lets make it 0.5. This does not affect the fundamental principles of the algo but will surely affect when the value is altered in real world cases. We use first version of algorithm as it is easier for calculation both in this case and real world situation.
consider x, y and z denote PR(A), PR(B) and PR(C)..,
z = 0.5 + 0.5*( x/2 Â + Â y/1 )
These are 3 variable with 3 equations, thus can be solved…
x = 1.07692308 y = 0.76923077 z = 1.15384615
As this is just a three variable equation it can be solved easily. But considering real web page situation, there are billions of web pages leading to billion variable problem to solve… This keeps our mouth wide open!! Are google servers that are set up enough to calculate the problem?? or even if it calculates can it provide speed in its web search?? Even if it gives speed to one request can it give same speed to thousands of request coming to the server every second??
Approximated computation of Page Rank
Answers to all the above questions are obviously not. Google search engine uses a approximative and iterative computation in finding the page rank. Thus each page is assigned an initial page rank that can be used as a initiater for the iteration. Again considering the same example of three pages where each page is initially assigned a page rank of 1,
Iteration |
x |
y |
z |
0 |
1 |
1 |
1 |
1 |
1 |
0.75 |
1.125 |
2 |
1.0625 |
0.765625 |
1.1484375 |
3 |
1.07421875 |
0.76855469 |
1.1484375 |
4 |
1.07641602 |
0.76910400 |
1.15365601 |
5 |
1.07682800 |
0.76920700 |
1.15381050 |
6 |
1.07691973 |
0.76922631 |
1.25383947 |
We brought the approximated result to 3 places as compared to the value of original computation as in the above method. An approximation on 8 places can be achieved by 12 iterations. According to Lawrence and Sergey, a good approximation of page rank can be obtained after 100 iterations itself in the real web world situation.Highly linked pages’ page rank convergingly increases from 1 and lowly linked pages’s page rank convergingly decreases from 1.
Results of the Algorithm
Most important feedback of this algorithm is improvement in the quality of the search. The best and common example was “bill clinton” as a Query for the search. The earlier search engines resulted in pages of content containing “bill clinton sucks” and “bill clinton joke of the day” in its highly indexed top search results. But this should not have been a standard search result for such a Query. The query only wanted results of information that consist general details of Bill Clinton. This was very well rectified by the new search engine where priority is given in terms of page rank of the page where the information is hosted. Results were “whitehouse.gov” site pages in the top page of the search result. Thus the quality of the search has been achieved.
More mathematical conclusions and facts
- Sum of all page ranks is still ( approximately atleast ) equal to the number of webpages indexed , this means the average of the pageranks is 1.
- Next, From the algorithm formula the minimum page rank is obtained when
PgRk(K1)/(N1) + …. + PgRk(Kn)/(Nn) = 0
this can be obtained when no pages link to the page A. But still the page possesses a minimum page rank ( 1-d ).
- Thirdly, Maximum page rank can be obtained only when
PgRk(K1)/(N1) + …. + PgRk(Kn)/(Nn) = N
this happens only when each term in the summation is equal to 1. Hence this is theoretically possible only               when all pages in the web are solely linked only to one page and this page also must link to itself (practically   impossible).
- The PageRank value difference from pgRk 1 to pgRk 10 is not constant, Researchers claim the scale of increase to be logorithmic., Hence its possible that even if a page of PgRk 8 having lot of outbound links can have higher contribution that a Page of PgRk4 with lesser outbound links. Its also true that its difficult for a page to improve to next page rank than the difficulty it took to improve from previous pagerank.
- Also know that when a page contributes to as many child pages in increasing their pageranks, the page rank of the contributor does not get reduced. In short words its ‘sharing without losing them’.
*surfer – denote ‘random surfer’
References
Contributors of the Algorithm
|
Lawrence Page |
|
Sergey Brin |