The links between the various web pages canbe seen as a directed graph. The importance of a web page is to vote by linkingto other pages on the page. A more linked page will have a higher level,whereas if a page has no chain entry or chain entry is lower, the higher the PRvalue of the page, the more important the page will be.
PageRank is a static algorithm that isindependent of the query, so all PageRank values of the web pages can beobtained offline. This reduces the sorting time required to retrieve the userand greatly reduces the query response time. But PageRank has two drawbacks:first, the PageRank algorithm seriously discriminates against the newly addedweb pages, because the new web page links and links are usually very few, andthe PageRank value is very low. PageRank algorithm only depend on the number ofexternal links and other important degree to rank, while ignoring thecorrelation between the theme of the page, so that some topics are not relatedto web pages (such as AD pages) to obtain the bigger PageRank value, whichaffects the precision of the search results.
Because the original PageRank algorithm isnot consider topics related factors, the department of computer science atStanford Taher Haveli - wala presents a Sensitive Topic (Topic - Sensitive) ofPageRank algorithm can solve the problem of "theme drift". Thisalgorithm considers that some pages are considered important in some areas, butit is not important in other areas.
Page A links to page B, which can beregarded as the score of page A for page B. If web page A belongs to the sametheme as page B, it can be considered that A is more reliable for B. Because Aand B can be seen as peers, peers tend to know more about their peers than theydo not, so their peers tend to be more reliable than their peers. UnfortunatelyTSPR does not use the relevance of the topic to improve the accuracy of linkscoring.
As the search engine calculates webrankings, it relies heavily on links, and the quality of the links becomesincreasingly important. In this case, the quality of the source site of theconnection needs to be judged. And, more importantly, used to rely on links andcorrelation to determine the manner of ranking has met with various cheatingprovocation, Spam, led directly to Google had to find a new anti-cheatingmechanism, to ensure the high quality site for search engines to attention.Sandbox and TrustRank have been mentioned in this situation. Intend to ensurethat good sites get higher search performance and enhance site review. GuGe'sown initial talk about TrustRank also mentions this.
Although TrustRank was originally used as amethod of detecting waste, TrustRank's concept is more widely used in thecurrent search engine ranking algorithm, which often affects the overallranking of most websites. So TrustRank is really important and really worthpaying attention to.
HillTop is the patent that a Googleengineer, Bharat, acquired in 2001. HillTop is a query correlation linkanalysis algorithm which overcomes the shortcomings of PageRank's queryindependence. The HillTop algorithm says that links to related documents thathave the same theme will have greater value to the searcher. In Hilltop, you'reonly thinking about those expert pages that are used to guide people throughtheir resources. Hilltop in receives a query request, the first calculatedaccording to the theme of the query a list of relevant experts page, and thenaccording to the number of outside experts not affiliated with the affiliatepage pointing to the target page, and ordered by relevance to the target page.
The HillTop algorithm has replaced the factthat the basic sorting of the web site and search keywords has replaced thevalue of relying too much on the value of PageRank to find those authoritypages, and it avoids a lot of ways of trying to increase the value of PageRankby adding a lot of invalid links. The HillTop algorithm makes sure that theresults of the evaluation and the relevance of the key word to the key word,the correlation to the key word, and the correlation between the differentpositions, and the number of phrases that can be used to prevent the number ofwords in the subject line.
However, experts of the page to search anddetermine the key effect to the algorithm, expert page quality plays a decisiverole to the accuracy of the algorithm, also ignore the influence of most of theexperts page. Experts page on the Internet accounts for the proportion of verylow (1.79%), not on behalf of all the Internet web page, so the limitation tothe HillTop. At the same time, unlike the PageRank algorithm, the algorithm ofHillTop algorithm is run online and has great pressure on the response time ofthe system.
HillTop algorithm based on the"experts" document is the biggest difficulty is the first document"experts", from the current observation: Google apparently gaveeducation first (edu), the government (. Gov) and non-profit organizations (.Org) site high priority. At run time: Google stores the index of the mostsearched keywords in large memory, so that the searcher continues to search forthe same keyword phrase in the short term. The high frequency keywords there'sanother role in "Florida" many people have noticed before update:contains the spate of search keyword update frequency of the web site will befaster. As for: \ "SARS", there are millions of searches per day: Googlewill give priority to updating the site related to this topic.
Go back and look at each of the previousmonth "Google Dance", also can draw the following conclusion: Googleis obviously for a keyword to give a random "weight", dynamic basedon keyword query statistics found that these popular keywords, and then basedon the HillTop algorithm for topic to find these pages containing keywords,make these pages as "experts" document of relevant keywords,according to these index update frequency of the entrance to keep high: this isobviously very effective for dealing with emergencies. Those pages with lowerquery frequencies may only be updated once a month. Simply put, Google willdynamically adjust the index strength of the corresponding site according tothe hot level of the topic. However, the proportion of Google Chinese users inthe total user and the total number of pages in the index of the Chinese webpage in the index is also related to some extent.
In fact, the guiding ideology of HillTopalgorithm and PageRank are consistent, which is to determine the ranking weightof search results by the number and quality of links. But HillTop think onlycalculated from relevant document links with the same theme will be bigger, thevalue of the searchers, namely the topic page links between for weighting thecontribution of the higher than links aren't relevant at this value. If thesite is to introduce "clothing", there are 10 links from websitesrelated to "dress", that the 10 links than 10 other contributionsfrom "appliances" related website links.. In 1999 and 2000, when thealgorithm developed by Bharat and other Google developers, Bharat says theinfluential document of the subject matter of "experts" document,experts from these documents to the target page links to determine the"weighted score" the main part of the linked web pages.
Combining the HillTop algorithm withPageRank to determine the basic ordering process of the matching degree of webpages and search terms, instead of relying too much on PageRank's value to findthe methods of those authoritative pages. The HillTop algorithm is veryimportant in the sorting of two web pages with the same theme and similar PR.HillTop also avoids the temptation to increase web PageRank by adding manyinvalid links.
Google first HillTop algorithm to definethe relevant web site: a web site with another site, in fact, the HillTopalgorithm in Google as a recognition of cross-site link exchange interference(spam) and to identify similar technology. HillTop algorithm requirements: ifthere are two or more related to the theme of the website link to your website,then your web site in the search results will be a greater chance, if theHillTop algorithm does not find at least two correlation website, opportunityis the result of the search returns 0.
HillTop algorithm is actually rejectedparts by using the method of random exchange links to disrupt the Googleranking rules and get good rankings, while in the HillTop paper also mentioneda lot about recognition of the union of web links exchange design: such as,according to the first three of the IPv4 address according to domain aliasspeculated that: 1
The PR value does not play a large role inthe matching of search keywords: because of the high PR value in many websitescontaining non-related topics that contain the corresponding keywords. This iswhat GuGe is trying to avoid in the HillTop algorithm: it should do everythingit can to list results related to search terms.
In general, from the past to today, manysearch engines have stopped the practice of using only one valuable algorithmto determine rankings. Such as: meta keyword tag, etc. This is just thebeginning, and GuGe has completely ignored the meta tags in the HTML header inthe first step. Compared with the non-visible meta tags, the visual part of awebsite USES the distraction technique less than the meta tag, because thevisible part has to face the majority of actual visitors after all.
Google have server architecture is thegrade distribution of ten thousand pentium server on the network. And onceafter understanding the Hilltop algorithm, it is hard to believe the pentiumserver can have such processing capacity: imagine, must first found in the thousandsof thematic file file "experts", then the target web pages from theseexperts file link points, and then will return to Google numerical algorithm ofother ranking system, and further processing - all of which is about 0.07seconds - this let Google world-famous search speed to finish. It's incredible.
The search and determination of expertpages play a key role in the algorithm, and the quality of expert pagesdetermines the accuracy of the algorithm. However, the quality and fairness ofexpert pages cannot be guaranteed to some extent. Hiltop ignored the impact ofmost non-expert pages.
In Hilltop's prototype system, the expertpage occupies only 1 of the page. 79% do not fully reflect public opinion.
The Hilltop algorithm (less than two expertpages) is unable to get sufficient subset of expert pages (less than two expertpages), which means that Hilltop is best suited for sorting query sortingwithout overwriting. This means that Hilltop can be combined with some sort ofpage sorting algorithm to improve accuracy, rather than being an independentpage-sorting algorithm.
According to the query subject from thecollection of expert pages in Hilltop, the selected subset of the topic is alsorun online, which will affect the query response time as mentioned in theprevious HITS algorithm. With the increase of the collection of expert pages,the scalability of the algorithm is deficient.
Google as early as February 2003, won thepatent, but in the actual before put into use, the need to first ensure thatthe new algorithm and Google page rank and used by the correlation system fullcompatibility, so I need to do a lot to its compatibility tests, and thenevaluate the result of the algorithm is provided after integration, seikoadjustment again, then is further complicated test... I think all of this willtake a lot of time.
The search and determination of expertpages play a key role in the algorithm, and the quality of expert pagesdetermines the accuracy of the algorithm. However, the quality and fairness ofexpert pages cannot be guaranteed to some extent. Hiltop ignored the impact ofmost non-expert pages.
In Hilltop's prototype system, the expertpage occupies only 1 of the page. 79% do not fully reflect public opinion.
The Hilltop algorithm (less than two expertpages) is unable to get sufficient subset of expert pages (less than two expertpages), which means that Hilltop is best suited for sorting query sortingwithout overwriting. This means that Hilltop can be combined with some sort ofpage sorting algorithm to improve accuracy, rather than being an independentpage-sorting algorithm.
According to the query subject from thecollection of expert pages in Hilltop, the selected subset of the topic is alsorun online, which will affect the query response time as mentioned in theprevious HITS algorithm. With the increase of the collection of expert pages,the scalability of the algorithm is deficient.
The HITS (Hyperlink Induced Topic Search)algorithm, which was proposed by Kleinberg in 1998, is one of the most famousalgorithms in the Hyperlink analysis sorting algorithm. The algorithm, in thedirection of hyperlinks, divides the page into two types of pages: theAuthority page and the Hub page. Authority page page, also called Authority, itrefers to a query keywords and combination of the most relevant pages, Hub pageis also called the page directory, the page content is mainly a large number oflinks to the page Authority, its main function is to make the Authority pagestogether. For the Authority page P, the more the Hub pages that point to P, thehigher the quality, the greater the value of P's Authority; For Hub page H, themore pages of Authority that H points to, the higher the quality of theAuthority page, and the greater the Hub value of H. For the entire Web set, theAuthority and Hub are interdependent, mutually reinforcing, and mutuallyreinforcing relationships. The optimal relationship between Authority and Hubis the basis of HITS algorithm.
HITS's basic idea is that the algorithmmeasures the importance of a web page based on the degree of entry of a webpage (the hyperlink to this page) and the degree of (from this page to anotherpage). After the limits according to the web page out of and into the degreesto establish a matrix, by iterative computation of the matrix and defines thethreshold of convergence to two vectors Authority and Hub values updated untilconvergence.
Experimental data show that HITS rankinghigh accuracy than PageRank, HITS algorithm design web users to evaluate thequality of the network resources common standard, therefore to be able tobetter use of network information retrieval tools for the user to access theInternet resources.
However, it has the following defects:first, the HITS algorithm only computes the main feature vector, which can nothandle the problem of topic drift. Secondly, the topic generalization can begenerated when the topic query is narrow. Thirdly, the HITS algorithm can besaid to be an experimental experiment. It must be calculated based on thecontent retrieval results page and the links between the pages directly linkedafter the network information retrieval system is conducted for the contentretrieval operation. Although some attempts to the Server through thecalculation of algorithm improvement and set up link structure (ConnectivityServer), and other operations, can achieve a certain degree of online real-timecomputation, but its computation cost is still not acceptable.
Sorting algorithms are particularlyimportant in search engines, and many search engines are now exploring newsorting methods to improve user satisfaction. However, there are twodeficiencies in the second generation search engine. In this context, the thirdgeneration search engine based on intelligent sorting is also born.
1) relevance
Relevance refers to the degree ofcorrelation between the search term and the page. Due to the complexity oflanguage, it is one-sided to judge the correlation between the search terms andthe page only through link analysis and the surface characteristics of the webpage. For example, "rice blast" was retrieved, and there was a webpage to introduce the information of rice diseases and pests, but there was no"rice blast" in the paper, and the search engine could not retrieveit at all. For these reasons, there is no solution to the large number ofsearch engine cheating. The method to solve the correlation should be toincrease semantic understanding, analyze the relevant degree of keywords andwebpage, and the more accurate the correlation analysis, the better the user'ssearch effect will be. At the same time, low relevance pages can be removed toeffectively prevent the search engine from cheating. The correlation betweenkeywords and web pages is run online, which can give the system a great deal oftime. The distributed architecture can improve the scale and performance of thesystem.
2) simplification of search results
On the search engine, anyone searches forthe same word. This does not satisfy the user's needs. Different users havedifferent requirements for retrieval results. Ordinary farmers, for example,retrieve the "blast", just want to get information of rice blast andthe method of prevention and cure, but agriculture experts or technologyworkers might want to have the papers related to rice blast.
One way to solve the search results is toprovide personalized service for intelligent search. Through Web data mining, usermodels (such as user background, interest, behavior, style) are established toprovide personalized services.