Crawler¶
URL frontier (priority queue) for pyfetcher.crawler.
- Purpose:
Manage the URL crawl queue backed by Postgres. Implements the dual-queue pattern: priority-based selection with per-host politeness enforcement.
- class pyfetcher.crawler.frontier.Frontier(deduplicator=None)[source]¶
Postgres-backed URL frontier with dedup and priority.
Combines job creation, dedup checking, and priority management into a single interface for the crawl stage.
- Parameters:
deduplicator (URLDeduplicator | None) – URL dedup checker.
- async add_url(session, url, *, priority=0, parent_job_id=None)[source]¶
Add a URL to the frontier if not already seen.
Spider and router for pyfetcher.crawler.
- Purpose:
Provide a base spider class with URL pattern routing for handling different page types during crawling.
- class pyfetcher.crawler.spider.SpiderResult(discovered_urls=<factory>, items=<factory>, media_urls=<factory>)[source]¶
Result of processing a crawled page.
- class pyfetcher.crawler.spider.Router[source]¶
URL pattern router for spider handlers.
Maps URL regex patterns to async handler functions. The first matching pattern wins.
- add(pattern, handler)[source]¶
Register a handler for a URL pattern.
- Parameters:
pattern (str) – Regex pattern to match URLs against.
handler (Callable[[str, FetchResponse], Coroutine[Any, Any, SpiderResult]]) – Async function handling matching URLs.
- Return type:
None
- default(handler)[source]¶
Set the default handler for unmatched URLs.
- Parameters:
handler (Callable[[str, FetchResponse], Coroutine[Any, Any, SpiderResult]]) – Async function for URLs matching no pattern.
- Return type:
None
- resolve(url)[source]¶
Find the handler for a URL.
- Parameters:
url (str) – The URL to route.
- Returns:
The matching handler, or the default handler, or
None.- Return type:
Callable[[str, FetchResponse], Coroutine[Any, Any, SpiderResult]] | None
- class pyfetcher.crawler.spider.Spider(name='default')[source]¶
Base spider with URL routing.
Provides a router for dispatching URLs to handler functions that extract data and discover new URLs.
- Parameters:
name (str) – Spider name for logging/identification.
- async handle(url, response)[source]¶
Route a URL to its handler and return the result.
- Parameters:
url (str) – The crawled URL.
response (FetchResponse) – The fetch response.
- Returns:
A
SpiderResultwith discovered URLs and items.- Return type:
URL deduplication for pyfetcher.crawler.
- Purpose:
Normalize URLs and check/record seen status using xxhash64 for fast Postgres-backed deduplication.
- pyfetcher.crawler.dedup.normalize_url(url)[source]¶
Normalize a URL for deduplication.
Strips fragments, sorts query params, lowercases scheme/host, removes trailing slashes on paths, and removes default ports.
- pyfetcher.crawler.dedup.url_hash(url)[source]¶
Compute a hash for a normalized URL.
Uses SHA-256 truncated to 8 bytes (64 bits) for a BigInteger-compatible hash suitable for Postgres primary keys.
- class pyfetcher.crawler.dedup.URLDeduplicator[source]¶
URL deduplication checker backed by Postgres.
Normalizes URLs, hashes them, and checks/records them in the
seen_urlstable via the repository layer.
Politeness enforcement for pyfetcher.crawler.
- Purpose:
Enforce per-host crawl delays using robots.txt directives and configurable minimum request intervals.
- class pyfetcher.crawler.politeness.PolitenessEnforcer(default_delay_seconds=1.0)[source]¶
Enforces crawl politeness per-host.
Checks robots.txt rules and enforces minimum delays between requests to the same host.
- Parameters:
default_delay_seconds (float) – Default delay when no crawl-delay directive exists.
- check_robots(robots_txt, path, *, user_agent='*')[source]¶
Check if a path is allowed by robots.txt.
RSS/Atom feed monitor for pyfetcher.crawler.
- Purpose:
Monitor RSS/Atom feeds for new entries with adaptive polling intervals based on publication frequency.
- class pyfetcher.crawler.feeds.FeedEntry(url, title=None, published=None, summary=None)[source]¶
A single feed entry.
- class pyfetcher.crawler.feeds.FeedPollResult(new_entries=<factory>, latest_entry_hash=None, suggested_interval_minutes=60)[source]¶
Result of polling a feed.
- pyfetcher.crawler.feeds.compute_entry_hash(entry)[source]¶
Compute a hash for feed entry change detection.
- pyfetcher.crawler.feeds.calculate_poll_interval(entry_count, *, current_interval=60, min_interval=10, max_interval=1440)[source]¶
Calculate an adaptive polling interval based on new entry count.
More new entries = shorter interval. No new entries = longer interval.
URL discovery (sitemaps + seeds) for pyfetcher.crawler.
- Purpose:
Discover URLs from sitemaps, robots.txt sitemap directives, and seed URL lists for populating the crawl frontier.
- pyfetcher.crawler.discovery.discover_sitemaps_from_robots(robots_txt)[source]¶
Extract sitemap URLs from robots.txt content.
- pyfetcher.crawler.discovery.discover_urls_from_sitemap(sitemap_xml)[source]¶
Extract URLs from a sitemap XML document.
Handles both URL sitemaps and sitemap index files. For index files, returns the child sitemap URLs (not final page URLs).