2017.11.6开始

The links between the various web pages canbe seen as a directed graph. The importance of a web page is to vote by linkingto other pages on the page. A more linked page will have a higher level,whereas if a page has no chain entry or chain entry is lower, the higher the PRvalue of the page, the more important the page will be.

PageRank is a static algorithm that isindependent of the query, so all PageRank values of the web pages can beobtained offline. This reduces the sorting time required to retrieve the userand greatly reduces the query response time. But PageRank has two drawbacks:first, the PageRank algorithm seriously discriminates against the newly addedweb pages, because the new web page links and links are usually very few, andthe PageRank value is very low. PageRank algorithm only depend on the number ofexternal links and other important degree to rank, while ignoring thecorrelation between the theme of the page, so that some topics are not relatedto web pages (such as AD pages) to obtain the bigger PageRank value, whichaffects the precision of the search results.

Because the original PageRank algorithm isnot consider topics related factors, the department of computer science atStanford Taher Haveli - wala presents a Sensitive Topic (Topic - Sensitive) ofPageRank algorithm can solve the problem of "theme drift". Thisalgorithm considers that some pages are considered important in some areas, butit is not important in other areas.

Page A links to page B, which can beregarded as the score of page A for page B. If web page A belongs to the sametheme as page B, it can be considered that A is more reliable for B. Because Aand B can be seen as peers, peers tend to know more about their peers than theydo not, so their peers tend to be more reliable than their peers. UnfortunatelyTSPR does not use the relevance of the topic to improve the accuracy of linkscoring.

As the search engine calculates webrankings, it relies heavily on links, and the quality of the links becomesincreasingly important. In this case, the quality of the source site of theconnection needs to be judged. And, more importantly, used to rely on links andcorrelation to determine the manner of ranking has met with various cheatingprovocation, Spam, led directly to Google had to find a new anti-cheatingmechanism, to ensure the high quality site for search engines to attention.Sandbox and TrustRank have been mentioned in this situation. Intend to ensurethat good sites get higher search performance and enhance site review. GuGe'sown initial talk about TrustRank also mentions this.

Although TrustRank was originally used as amethod of detecting waste, TrustRank's concept is more widely used in thecurrent search engine ranking algorithm, which often affects the overallranking of most websites. So TrustRank is really important and really worthpaying attention to.

HillTop is the patent that a Googleengineer, Bharat, acquired in 2001. HillTop is a query correlation linkanalysis algorithm which overcomes the shortcomings of PageRank's queryindependence. The HillTop algorithm says that links to related documents thathave the same theme will have greater value to the searcher. In Hilltop, you'reonly thinking about those expert pages that are used to guide people throughtheir resources. Hilltop in receives a query request, the first calculatedaccording to the theme of the query a list of relevant experts page, and thenaccording to the number of outside experts not affiliated with the affiliatepage pointing to the target page, and ordered by relevance to the target page.

The HillTop algorithm has replaced the factthat the basic sorting of the web site and search keywords has replaced thevalue of relying too much on the value of PageRank to find those authoritypages, and it avoids a lot of ways of trying to increase the value of PageRankby adding a lot of invalid links. The HillTop algorithm makes sure that theresults of the evaluation and the relevance of the key word to the key word,the correlation to the key word, and the correlation between the differentpositions, and the number of phrases that can be used to prevent the number ofwords in the subject line.

However, experts of the page to search anddetermine the key effect to the algorithm, expert page quality plays a decisiverole to the accuracy of the algorithm, also ignore the influence of most of theexperts page. Experts page on the Internet accounts for the proportion of verylow (1.79%), not on behalf of all the Internet web page, so the limitation tothe HillTop. At the same time, unlike the PageRank algorithm, the algorithm ofHillTop algorithm is run online and has great pressure on the response time ofthe system.

HillTop algorithm based on the"experts" document is the biggest difficulty is the first document"experts", from the current observation: Google apparently gaveeducation first (edu), the government (. Gov) and non-profit organizations (.Org) site high priority. At run time: Google stores the index of the mostsearched keywords in large memory, so that the searcher continues to search forthe same keyword phrase in the short term. The high frequency keywords there'sanother role in "Florida" many people have noticed before update:contains the spate of search keyword update frequency of the web site will befaster. As for: \ "SARS", there are millions of searches per day: Googlewill give priority to updating the site related to this topic.

Go back and look at each of the previousmonth "Google Dance", also can draw the following conclusion: Googleis obviously for a keyword to give a random "weight", dynamic basedon keyword query statistics found that these popular keywords, and then basedon the HillTop algorithm for topic to find these pages containing keywords,make these pages as "experts" document of relevant keywords,according to these index update frequency of the entrance to keep high: this isobviously very effective for dealing with emergencies. Those pages with lowerquery frequencies may only be updated once a month. Simply put, Google willdynamically adjust the index strength of the corresponding site according tothe hot level of the topic. However, the proportion of Google Chinese users inthe total user and the total number of pages in the index of the Chinese webpage in the index is also related to some extent.

In fact, the guiding ideology of HillTopalgorithm and PageRank are consistent, which is to determine the ranking weightof search results by the number and quality of links. But HillTop think onlycalculated from relevant document links with the same theme will be bigger, thevalue of the searchers, namely the topic page links between for weighting thecontribution of the higher than links aren't relevant at this value. If thesite is to introduce "clothing", there are 10 links from websitesrelated to "dress", that the 10 links than 10 other contributionsfrom "appliances" related website links.. In 1999 and 2000, when thealgorithm developed by Bharat and other Google developers, Bharat says theinfluential document of the subject matter of "experts" document,experts from these documents to the target page links to determine the"weighted score" the main part of the linked web pages.

Combining the HillTop algorithm withPageRank to determine the basic ordering process of the matching degree of webpages and search terms, instead of relying too much on PageRank's value to findthe methods of those authoritative pages. The HillTop algorithm is veryimportant in the sorting of two web pages with the same theme and similar PR.HillTop also avoids the temptation to increase web PageRank by adding manyinvalid links.

Google first HillTop algorithm to definethe relevant web site: a web site with another site, in fact, the HillTopalgorithm in Google as a recognition of cross-site link exchange interference(spam) and to identify similar technology. HillTop algorithm requirements: ifthere are two or more related to the theme of the website link to your website,then your web site in the search results will be a greater chance, if theHillTop algorithm does not find at least two correlation website, opportunityis the result of the search returns 0.

HillTop algorithm is actually rejectedparts by using the method of random exchange links to disrupt the Googleranking rules and get good rankings, while in the HillTop paper also mentioneda lot about recognition of the union of web links exchange design: such as,according to the first three of the IPv4 address according to domain aliasspeculated that: 1

The PR value does not play a large role inthe matching of search keywords: because of the high PR value in many websitescontaining non-related topics that contain the corresponding keywords. This iswhat GuGe is trying to avoid in the HillTop algorithm: it should do everythingit can to list results related to search terms.

In general, from the past to today, manysearch engines have stopped the practice of using only one valuable algorithmto determine rankings. Such as: meta keyword tag, etc. This is just thebeginning, and GuGe has completely ignored the meta tags in the HTML header inthe first step. Compared with the non-visible meta tags, the visual part of awebsite USES the distraction technique less than the meta tag, because thevisible part has to face the majority of actual visitors after all.

Google have server architecture is thegrade distribution of ten thousand pentium server on the network. And onceafter understanding the Hilltop algorithm, it is hard to believe the pentiumserver can have such processing capacity: imagine, must first found in the thousandsof thematic file file "experts", then the target web pages from theseexperts file link points, and then will return to Google numerical algorithm ofother ranking system, and further processing - all of which is about 0.07seconds - this let Google world-famous search speed to finish. It's incredible.

The search and determination of expertpages play a key role in the algorithm, and the quality of expert pagesdetermines the accuracy of the algorithm. However, the quality and fairness ofexpert pages cannot be guaranteed to some extent. Hiltop ignored the impact ofmost non-expert pages.

In Hilltop's prototype system, the expertpage occupies only 1 of the page. 79% do not fully reflect public opinion.

The Hilltop algorithm (less than two expertpages) is unable to get sufficient subset of expert pages (less than two expertpages), which means that Hilltop is best suited for sorting query sortingwithout overwriting. This means that Hilltop can be combined with some sort ofpage sorting algorithm to improve accuracy, rather than being an independentpage-sorting algorithm.

According to the query subject from thecollection of expert pages in Hilltop, the selected subset of the topic is alsorun online, which will affect the query response time as mentioned in theprevious HITS algorithm. With the increase of the collection of expert pages,the scalability of the algorithm is deficient.

Google as early as February 2003, won thepatent, but in the actual before put into use, the need to first ensure thatthe new algorithm and Google page rank and used by the correlation system fullcompatibility, so I need to do a lot to its compatibility tests, and thenevaluate the result of the algorithm is provided after integration, seikoadjustment again, then is further complicated test... I think all of this willtake a lot of time.

In Hilltop's prototype system, the expertpage occupies only 1 of the page. 79% do not fully reflect public opinion.

The HITS (Hyperlink Induced Topic Search)algorithm, which was proposed by Kleinberg in 1998, is one of the most famousalgorithms in the Hyperlink analysis sorting algorithm. The algorithm, in thedirection of hyperlinks, divides the page into two types of pages: theAuthority page and the Hub page. Authority page page, also called Authority, itrefers to a query keywords and combination of the most relevant pages, Hub pageis also called the page directory, the page content is mainly a large number oflinks to the page Authority, its main function is to make the Authority pagestogether. For the Authority page P, the more the Hub pages that point to P, thehigher the quality, the greater the value of P's Authority; For Hub page H, themore pages of Authority that H points to, the higher the quality of theAuthority page, and the greater the Hub value of H. For the entire Web set, theAuthority and Hub are interdependent, mutually reinforcing, and mutuallyreinforcing relationships. The optimal relationship between Authority and Hubis the basis of HITS algorithm.

HITS's basic idea is that the algorithmmeasures the importance of a web page based on the degree of entry of a webpage (the hyperlink to this page) and the degree of (from this page to anotherpage). After the limits according to the web page out of and into the degreesto establish a matrix, by iterative computation of the matrix and defines thethreshold of convergence to two vectors Authority and Hub values updated untilconvergence.

Experimental data show that HITS rankinghigh accuracy than PageRank, HITS algorithm design web users to evaluate thequality of the network resources common standard, therefore to be able tobetter use of network information retrieval tools for the user to access theInternet resources.

However, it has the following defects:first, the HITS algorithm only computes the main feature vector, which can nothandle the problem of topic drift. Secondly, the topic generalization can begenerated when the topic query is narrow. Thirdly, the HITS algorithm can besaid to be an experimental experiment. It must be calculated based on thecontent retrieval results page and the links between the pages directly linkedafter the network information retrieval system is conducted for the contentretrieval operation. Although some attempts to the Server through thecalculation of algorithm improvement and set up link structure (ConnectivityServer), and other operations, can achieve a certain degree of online real-timecomputation, but its computation cost is still not acceptable.

Sorting algorithms are particularlyimportant in search engines, and many search engines are now exploring newsorting methods to improve user satisfaction. However, there are twodeficiencies in the second generation search engine. In this context, the thirdgeneration search engine based on intelligent sorting is also born.

1) relevance

Relevance refers to the degree ofcorrelation between the search term and the page. Due to the complexity oflanguage, it is one-sided to judge the correlation between the search terms andthe page only through link analysis and the surface characteristics of the webpage. For example, "rice blast" was retrieved, and there was a webpage to introduce the information of rice diseases and pests, but there was no"rice blast" in the paper, and the search engine could not retrieveit at all. For these reasons, there is no solution to the large number ofsearch engine cheating. The method to solve the correlation should be toincrease semantic understanding, analyze the relevant degree of keywords andwebpage, and the more accurate the correlation analysis, the better the user'ssearch effect will be. At the same time, low relevance pages can be removed toeffectively prevent the search engine from cheating. The correlation betweenkeywords and web pages is run online, which can give the system a great deal oftime. The distributed architecture can improve the scale and performance of thesystem.

2) simplification of search results

On the search engine, anyone searches forthe same word. This does not satisfy the user's needs. Different users havedifferent requirements for retrieval results. Ordinary farmers, for example,retrieve the "blast", just want to get information of rice blast andthe method of prevention and cure, but agriculture experts or technologyworkers might want to have the papers related to rice blast.

One way to solve the search results is toprovide personalized service for intelligent search. Through Web data mining, usermodels (such as user background, interest, behavior, style) are established toprovide personalized services.

最后编辑于：2017.12.28 12:44:44

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 216,496评论 6赞 501
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 92,407评论 3赞 392
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 162,632评论 0赞 353
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,180评论 1赞 292
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,198评论 6赞 388
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,165评论 1赞 299
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,052评论 3赞 418
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 38,910评论 0赞 274
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,324评论 1赞 310
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,542评论 2赞 332
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,711评论 1赞 348
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,424评论 5赞 343
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,017评论 3赞 326
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,668评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,823评论 1赞 269
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 47,722评论 2赞 368
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,611评论 2赞 353

2017.11.6开始

推荐阅读更多精彩内容