Glossary Term

Web Crawler

A web crawler is a program that systematically visits web pages — following links, loading content, and optionally capturing screenshots or extracting data from each page it visits.

How web crawlers work

A web crawler starts with one or more seed URLs. It fetches the HTML of each seed page, parses it for hyperlinks, and adds those links to a queue. The crawler then visits each queued URL, discovers new links, and repeats the process — building a map of the site's structure as it goes.

This process is governed by several controls. Depth limits determine how many link levels the crawler follows from the seed URL. URL filters restrict crawling to specific domains, subdomains, or path patterns. Rate limiting ensures the crawler does not overwhelm the target server with too many simultaneous requests.

Simple crawlers use HTTP clients to fetch raw HTML. More advanced crawlers run a headless browser to render JavaScript, wait for dynamic content to load, and then process the fully rendered page. This distinction matters for modern single-page applications where much of the content is loaded after the initial HTML response.

The crawler maintains a visited set to avoid processing the same URL twice and typically respects the robots.txt file, which tells automated agents which paths they may or may not access.

Crawling vs scraping

Crawling and scraping are related but distinct activities. A crawler's primary job is discovery — finding pages and mapping the structure of a website. A scraper's job is extraction — pulling specific data (text, prices, images, metadata) from individual pages.

In practice, the two are often combined. A crawler discovers all product pages on an e-commerce site, and a scraper extracts the name, price, and availability from each page. Similarly, a crawler can discover pages across a marketing site while a screenshot engine captures a visual snapshot of each one.

The key difference is purpose: crawling answers "what pages exist?" while scraping answers "what is on each page?"

In screenshot tooling, crawling is usually the discovery layer rather than the end product. The crawler finds the URLs worth covering, and the capture system turns that URL list into visual evidence, audits, or archives.

Where web crawlers are used

  • Search engines — Google, Bing, and other search engines use massive crawlers (Googlebot, Bingbot) to discover and index web pages across the internet.
  • Site audits and SEO — crawlers check for broken links, missing meta tags, redirect chains, and accessibility issues across an entire website.
  • Visual monitoring — some screenshot tools include a built-in web crawler that captures pages from a list of URLs, making it possible to visually audit hundreds of pages without manual navigation.
  • Archival and compliance — organizations crawl their own sites or third-party pages to create visual and data records for regulatory or legal purposes.
  • Competitive intelligence — crawlers monitor competitor websites for pricing changes, new product launches, and content updates.

Common mistakes

  • Ignoring robots.txt. The robots.txt file specifies which paths a crawler should avoid. Ignoring it can result in IP bans, legal issues, and unnecessary server load. Always check and respect these directives.
  • Crawling without rate limits. Sending hundreds of concurrent requests can overwhelm a server, trigger anti-bot defenses, and get your IP blocked. Introduce delays between requests and limit concurrent connections.
  • Not handling duplicate URLs. The same page can be reached through multiple URL variations (with or without trailing slashes, query parameters, or fragment identifiers). Normalize URLs and maintain a visited set to avoid processing duplicates.
  • Skipping JavaScript rendering. Modern websites load content dynamically. A crawler that only fetches raw HTML will miss content rendered by JavaScript frameworks like React, Vue, or Angular. Use a headless browser for JavaScript-heavy sites.
  • Failing to set a crawl scope. Without depth limits or URL filters, a crawler can spiral into infinite pagination, external domains, or dynamically generated URLs. Define clear boundaries before starting the crawl.

Common Questions

What is the difference between a web crawler and a web scraper?

A crawler discovers pages by following links — its job is to find URLs. A scraper extracts specific data from those pages — its job is to pull content. Many tools combine both: the crawler finds pages, and the scraper (or screenshot engine) processes each one.

Do web crawlers execute JavaScript?

Traditional crawlers fetch raw HTML without executing JavaScript. Modern crawlers can use a headless browser to render JavaScript-heavy pages, which is necessary for single-page applications and dynamically loaded content.

How do I control which pages a crawler visits?

Use depth limits to restrict how many links deep the crawler follows. URL pattern filters (allow/deny lists) let you include or exclude specific paths. Respecting the site's robots.txt file is both a technical and ethical best practice.

Can a web crawler capture screenshots of every page it visits?

Yes. Some screenshot tools include a built-in crawler that visits each discovered page and captures a screenshot automatically. This is useful for visual audits, archival, and monitoring across an entire site.

Is web crawling legal?

Crawling publicly accessible pages is generally permissible, but you should respect robots.txt directives, avoid overloading servers with rapid requests, and review the site's terms of service. Crawling behind authentication or collecting personal data raises additional legal considerations.

Sources

Related Resources