Spider

From SiteRay wiki

Jump to: navigation, search

A spider (also known as a crawler) is a program that browses webpages automatically. Search engines such as Google use spidering to discover webpages it can search, and SiteRay spiders sites to test them.

A website is said to be spiderable if it can be spidered successfully, i.e. if the pages in the website can be discovered through automated means.

Contents

How a spider works

The basic principle is simple:

  1. Download our first page
  2. Find any links on that page
  3. Repeat for each new link we find

For the majority of spiders, step 2 consists of finding HTML links on a webpage, and any meta-refresh tags. As a result, any links in technologies such as Flash, JavaScript or Java are not found, and the spider will fail to notice them. Similarly content that can only be accessed behind a Form is typically unspiderable.

Why it matters

SEO

Pages which are non-spiderable typically won't be discovered by search engines and as a result won't positively influence search engine rankings. One of the most basic requirements of SEO is that as much content as possible should be visible - and hence spiderable - to search engines. A site that cannot be spidered at all is effectively invisible to search engines.

See SEO.

Accessibility

Pages which can't be spidered usually depend on technologies which are fundamentally inaccessible, or not guaranteed to be accessible. As a result, a non-spiderable site often suffers from accessibility problems.

See Accessibility.

SiteRay spider

The SiteRay spider is relatively advanced and specialised for testing websites more aggressively than spiders used by search engines. In particular, it can spider aspects of a site considered traditionally non-spiderable, such as Flash and JavaScript.

See SiteRay spider.

In SiteRay

Spiderability is tested by SiteRay via the Spiderability test.

Further reading

Personal tools