← To articles
ARTICLES

How Seed Sites Are Chosen for Search Engine Crawlers

Author
3 min read

The internet updates at incredible speed.

Every day new pages appear, old ones get deleted, news, blogs, and sites get updated, online stores add new products. New sites launch and old ones disappear.

Search engines try to update the index on time and keep it current, so their spiders check for updates every day. When a search robot starts the web crawling process, the starting point is the so-called seed site, whose links the spider follows first. But how do search engines choose seed sites? This is fairly important for anyone doing site promotion.

Which sites are more beneficial to start crawling from? Facebook or Twitter? Yahoo Directory or DMOZ? Or perhaps Wikipedia?

Choosing seed sites is very important, as it significantly affects search engine quality and the diversity of pages in the index — by topic and geography. If seed sites are chosen incorrectly, search quality and relevance decline.

A Yahoo patent describes the process by which crawlers choose seed sites to discover other page addresses. A seed-site choice is considered good if it allows discovering many new links, crawling more important documents, and distributing sites across markets or categories.

Most discussions of the web crawling process use Yahoo Directory or DMOZ as examples of entry points and means of discovering new pages. But are they always good enough for crawling? Could other seed sites be used?

The seed site selection process is based on a host-based selection algorithm. This algorithm involves identifying a subset of hosts the crawler accesses, based on their importance, quality, and potential return.

Site importance is determined by the “host trust” level or other parameters showing the host’s popularity, reliability, and quality. One indicator could be PageRank, one of the most important parameters in SEO.

A site’s quality (or lack of it) as a potential seed is determined by the number of outgoing links, presence of pornographic content, links to spam pages, or spam itself. To get quality output, the crawler needs to index high-quality sites.

Potential return — the potential for discovering new addresses or the document yield — is determined relative to previous passes through the site.

The patent also notes that the seed site selection process usually varies by country and region, since each region may have its own specifics. Furthermore, some markets contain fewer hosts and fewer important ones, so to prevent dominant markets from crowding out everything else, part of the web crawling is allocated to those smaller markets as well.

GoodWeb blog author.