Of course! Building a web crawler (or spider) in Python is a fantastic project. I'll guide you through the entire process, from the basic concepts to a complete, production-ready code example.

We'll cover:
- Core Concepts: What a crawler does and the challenges it faces.
- Essential Libraries: The best tools for the job in Python.
- A Simple Crawler: A step-by-step guide to building a basic, single-threaded crawler.
- A Robust, Production-Ready Crawler: A more advanced, multi-threaded crawler with rate-limiting, politeness, and data extraction.
- Legal and Ethical Considerations: The most important part!
Core Concepts: What is a Web Crawler?
A web crawler is a program that automatically browses the web, typically for the purpose of web indexing (like Google or Bing) or web scraping (extracting data).
The basic workflow is a cycle:
- Start with a list of URLs called the "frontier."
- Take a URL from the frontier.
- Fetch the content of that web page (the HTML).
- Parse the HTML to find new links and extract any desired data.
- Add the new, unique links back to the frontier.
- Save the extracted data.
- Repeat until the frontier is empty or a stopping condition is met.
Key Challenges:
- Politeness: Don't overload a server with too many requests. This is crucial to avoid getting your IP address blocked.
- Scalability: How do you handle millions of URLs? A simple list won't work.
- Data Extraction: How do you reliably find the data you want within the messy structure of HTML?
- Handling Dynamic Content: Modern websites often load content using JavaScript, which basic crawlers can't see.
Essential Python Libraries
You don't need to reinvent the wheel. Here are the most popular and effective libraries:

| Library | Purpose | Installation |
|---|---|---|
requests |
The standard for making HTTP requests. It's simple and powerful. | pip install requests |
Beautiful Soup |
A library for parsing HTML and XML. It creates a parse tree from page content, making it easy to navigate and search. | pip install beautifulsoup4 |
lxml |
A high-performance parser that can be used as an alternative to Beautiful Soup's default parser. It's much faster. | pip install lxml |
Scrapy |
A full-fledged web crawling framework. It handles asynchronous requests, data pipelines, scheduling, and more. It's overkill for simple tasks but excellent for large-scale projects. | pip install scrapy |
For this guide, we'll use requests for fetching and Beautiful Soup for parsing, as they are the perfect combination for learning and building most custom crawlers.
A Simple Crawler: Step-by-Step
Let's build a crawler that starts on a Wikipedia page and follows all the links to other Wikipedia pages, printing the titles as it goes.
Step 1: Fetch and Parse a Page
First, let's fetch the content of a page and parse it.
import requests
from bs4 import BeautifulSoup
# The URL we want to start with
url = 'https://en.wikipedia.org/wiki/Web_crawler'
try:
# 1. Fetch the content
response = requests.get(url)
# Raise an exception if the request was unsuccessful (e.g., 404, 500)
response.raise_for_status()
# 2. Parse the HTML content
soup = BeautifulSoup(response.text, 'lxml')
# 3. Extract the page title= soup.find('title').text
print(f"Successfully fetched: {title}")
# 4. Find all links on the page
# We look for <a> tags with an 'href' attribute
for link in soup.find_all('a', href=True):
href = link['href']
print(f"Found link: {href}")
except requests.exceptions.RequestException as e:
print(f"Error fetching the URL: {e}")
Step 2: Add Logic to Follow Links
Now, let's make it a real crawler by adding a "frontier" and a way to track visited URLs to avoid infinite loops.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
# --- Configuration ---
START_URL = 'https://en.wikipedia.org/wiki/Web_crawler'
MAX_PAGES = 10 # Limit the number of pages to crawl for this example
# --- Frontier and Visited Sets ---
frontier = [START_URL]
visited_urls = set()
# --- Crawler Loop ---
while frontier and len(visited_urls) < MAX_PAGES:
# Get the next URL from the frontier
current_url = frontier.pop(0)
# Skip if we've already visited this URL
if current_url in visited_urls:
continue
print(f"Crawling: {current_url}")
visited_urls.add(current_url)
try:
response = requests.get(current_url, timeout=5)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
# Find all links that point to other Wikipedia pages
for link in soup.find_all('a', href=True):
href = link['href']
# Convert relative URLs to absolute URLs
absolute_url = urljoin(current_url, href)
# Check if the link is another Wikipedia article
# We use a simple domain check to keep it focused
if 'en.wikipedia.org' in absolute_url and '/wiki/' in absolute_url:
# Normalize the URL to avoid duplicates (e.g., /wiki/Page and /wiki/Page#section)
parsed_url = urlparse(absolute_url)
clean_url = parsed_url._replace(fragment="", params="").geturl()
if clean_url not in visited_urls and clean_url not in frontier:
frontier.append(clean_url)
except requests.exceptions.RequestException as e:
print(f"Failed to crawl {current_url}: {e}")
print("\nCrawling finished.")
print(f"Total pages visited: {len(visited_urls)}")
This simple crawler has a major flaw: it's slow because it processes one URL at a time. For a real project, you'd use multi-threading.
A Robust, Production-Ready Crawler
This version is more advanced. It uses a thread pool for concurrency, implements rate-limiting, and separates data extraction from the crawling logic.
Key Features:
- Concurrency: Uses
concurrent.futures.ThreadPoolExecutorto crawl multiple pages simultaneously. - Rate Limiting: A
time.sleep()call prevents overwhelming servers. - Politeness: Respects
robots.txt(a file that tells crawlers which parts of a site they can't access). - Modular Code: Clear separation between the crawler and the data extractor.
- Robust Error Handling: Catches and logs various network and parsing errors.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import time
import concurrent.futures
import os
# --- Configuration ---
START_URL = 'https://en.wikipedia.org/wiki/Web_crawler'
MAX_THREADS = 5 # Number of concurrent threads
MAX_PAGES = 50 # Total number of pages to crawl
REQUEST_DELAY = 1 # Seconds to wait between requests (politeness)
OUTPUT_FILE = 'crawled_data.txt'
# --- Frontier and Visited Sets ---
frontier = [START_URL]
visited_urls = set()
# --- Data Storage ---
crawled_data = []
# --- Robots.txt Check ---
def is_allowed(url):
try:
# robots.txt is usually at the root of the domain
base_url = f"{urlparse(url).scheme}://{urlparse(url).netloc}"
robots_url = urljoin(base_url, 'robots.txt')
response = requests.get(robots_url, timeout=5)
if response.status_code == 200:
# A simple check for the most common disallow
# In a real crawler, you'd use a proper parser like `urllib.robotparser`
if 'Disallow: /' in response.text:
return False
return True
except:
# If we can't fetch robots.txt, we'll be cautious and disallow
return False
# --- Data Extraction Logic ---
def extract_data(soup, url):
"""Extracts the title and all paragraph text from a page."""tag = soup.find('title')= title_tag.text if title_tag else "No Title Found"
paragraphs = []
for p in soup.find_all('p'):
# Clean up the text a bit
text = p.get_text(strip=True)
if text:
paragraphs.append(text)
return {
'url': url,
'title': title,
'paragraphs': paragraphs
}
# --- The Crawler Worker Function ---
def crawl_worker(url):
"""A single worker that fetches, parses, and extracts data from a URL."""
if url in visited_urls:
return
print(f"Thread {os.getpid()}: Crawling {url}")
visited_urls.add(url)
try:
# Be polite!
time.sleep(REQUEST_DELAY)
response = requests.get(url, timeout=10, headers={'User-Agent': 'MySimpleCrawler/1.0'})
response.raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
# 1. Extract data
data = extract_data(soup, url)
crawled_data.append(data)
print(f"Thread {os.getpid()}: Extracted data from {data['title']}")
# 2. Find new links
for link in soup.find_all('a', href=True):
href = link['href']
absolute_url = urljoin(url, href)
# Filter for Wikipedia articles
if 'en.wikipedia.org' in absolute_url and '/wiki/' in absolute_url:
parsed_url = urlparse(absolute_url)
clean_url = parsed_url._replace(fragment="", params="").geturl()
if clean_url not in visited_urls and clean_url not in frontier:
frontier.append(clean_url)
except requests.exceptions.RequestException as e:
print(f"Thread {os.getpid()}: Error crawling {url}: {e}")
except Exception as e:
print(f"Thread {os.getpid()}: An unexpected error occurred with {url}: {e}")
# --- Main Execution ---
if __name__ == "__main__":
print("Starting web crawler...")
# Check if we are allowed to crawl the start URL
if not is_allowed(START_URL):
print(f"Crawling {START_URL} is disallowed by robots.txt. Exiting.")
else:
with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:
while frontier and len(visited_urls) < MAX_PAGES:
# Submit a batch of URLs to the thread pool
urls_to_crawl = []
for _ in range(min(MAX_THREADS, len(frontier))):
if frontier:
urls_to_crawl.append(frontier.pop(0))
# Map the worker function to the URLs
executor.map(crawl_worker, urls_to_crawl)
# Optional: Add a small delay between batches
# time.sleep(0.1)
print("\nCrawling finished.")
print(f"Total pages visited: {len(visited_urls)}")
# Save the extracted data
with open(OUTPUT_FILE, 'w', encoding='utf-8') as f:
for data in crawled_data:
f.write(f"URL: {data['url']}\n")
f.write(f"Title: {data['title']}\n")
f.write("Content:\n")
f.write("\n".join(data['paragraphs']))
f.write("\n" + "="*80 + "\n")
print(f"Data saved to {OUTPUT_FILE}")
Legal and Ethical Considerations (CRITICAL!)
Before you crawl any website, you must follow these rules to avoid legal trouble and be a good internet citizen.
-
Check
robots.txt: Every website should have a file atwww.example.com/robots.txt. This file tells you which parts of the site the owner does not want you to crawl. Always respect it. Our robust crawler has a basic check for this. -
Read the Terms of Service (ToS): The website's ToS will almost certainly have a clause about automated access. Violating it can get you banned or lead to legal action.
-
Be Polite:
- Identify Your Bot: Use a descriptive
User-Agentstring in your request headers (e.g.,MyCoolProject-Crawler/1.0; +http://mywebsite.com). This allows site administrators to contact you if there's a problem. - Limit Your Request Rate: Do not send requests too quickly. Our crawler uses
time.sleep()for this. A good rule of thumb is one request per second, but check the site'srobots.txtfor specificCrawl-delayinstructions. - Don't Hit Sensitive Endpoints: Avoid logging in, submitting forms, or accessing pages that are clearly for user interaction.
- Identify Your Bot: Use a descriptive
-
Don't Re-scrape Unnecessarily: If you only need data that doesn't change often, download it once and store it. Don't re-crawl the same site every day.
-
Respect Copyright: The data you scrape is likely owned by someone. Use it responsibly and don't republish it in a way that infringes on copyright.
For large-scale, commercial crawling, consider using official APIs if they are available, as they are the most reliable and legal way to get data.
