![]() However, it will not consolidate value with any other pages, and it will not stop value from flowing through the page. nofollow), and it will still assign value to links pointing to that page. The search engine will still collect links on the page to add to the database (unless a directive on the page also indicates not to follow them, i.e. It works similarly to robots.txt except that instead of being prevented from crawling that page, the search engine is able to access it, but then is told to go away. If the canonical is accepted, the values of both elements on the pages and links pointing to the pages are combined. If the “cornflower widget” page is rel=canonical-ed to the “blue widgets” page, but the cornflower widget page has more valuable links pointing to it, the search engine may choose to use the cornflower widget page instead. For example, if two pages are very similar, a webmaster may decide to use rel=canonical to signal the search engine that only one of those pages has value. This is further complicated by things webmasters might do to manipulate values. The overall value then determines where in the results the page will appear. If the site is being filtered for poor behavior like Panda or Penguin, that is also taken into account. The search engine uses its algorithm to determine which pages in the index have those words assigned to them, evaluates links pointing to the page and the domain, and processes dozens of other known and unknown metrics to arrive at a value. If the search engine also considers widget (singular) and cornflower (a type of blue) to be synonyms, it may evaluate pages with those words on the page as well. Do a search for “blue widgets.” The search engine uses the database to find pages that are related to blue, widgets, and blue widgets. ![]() The index identifies words and elements on a page that match with words and elements in the database. They get stored for later use if that page comes back.Īt some other point (and likely by a different set of servers!), priority pages that are crawled get assigned to an index. This means that when the link graph is processed, links to that page just go away. So the search engine would try to access it, but get a clear message that it’s not there anymore. Next, let’s assume that instead of blocking that page with robots.txt, we simply removed it. It won’t be able to see what pages that page links to, but it will be able to add link value metrics for the page - which affects the domain as a whole. The search engine is still going to take all of the links to that page and count them. Let’s suppose that the page on the top right was that page that was blocked by robots.txt: Now, go back to that super simple link graph example. That means that while the search engine was crawling through the pages and making lists of links, it wouldn’t have any data about that page that was included in the robots.txt file. Suppose the robots.txt file had told the search engine not to access one of those pages. Add to that the complications that some S’s and some G’s are worth more than others, and you have a very simplified view of how the link graph works. If the S’s outweighed the G’s, the page would earn a fairly poor score. A page with only G’s would earn a better score. Therefore, it would earn a fairly good score. The page on the top right has more G’s than S’s. If that page is linking to other pages, it may be passing some bad link value on to those pages. Let’s imagine, for example, that one of the pages is spamming. The values may be positive, or they may be negative. Later on, when the link graph gets processed, the search engine pulls all those links out of the database and connects them, assigning relative values to them. ![]() If they’re external, they get put into a database for later. If they’re internal links, the crawler will probably follow them to other pages. While it’s there, it collects a list of all the pages each page links to. Strictly, it’s a crawl scheduling system that de-duplicates and shuffles pages by priority to index later. The crawler collects information about all of those pages and feeds it back into a database. Let’s assume that file either doesn’t exist or says it’s okay to crawl the whole site. The first thing it collects is the robots.txt file. The search engine crawler (let’s make it a spider for fun) visits a site. Have you ever wondered why 404s, rel=canonicals, noindex, nofollow, and robots.txt work the way they do? Or have you never been clear on quite how they do all work? To help you understand, here is a very basic interpretation of how search engines crawl pages and add links to the link graph. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |