Crawler¶

URL frontier (priority queue) for pyfetcher.crawler.

Purpose:: Manage the URL crawl queue backed by Postgres. Implements the dual-queue pattern: priority-based selection with per-host politeness enforcement.

class pyfetcher.crawler.frontier.Frontier(deduplicator=None)[source]¶

Postgres-backed URL frontier with dedup and priority.

Combines job creation, dedup checking, and priority management into a single interface for the crawl stage.

Parameters:: deduplicator (URLDeduplicator | None) – URL dedup checker.

async add_url(session, url, *, priority=0, parent_job_id=None)[source]¶

Add a URL to the frontier if not already seen.

Parameters:

session (object) – Async database session.
url (str) – The URL to add.
priority (int) – Crawl priority (higher = more urgent).
parent_job_id (UUID | None) – Optional parent job for traceability.

Returns:

The new job UUID, or None if the URL was already seen.

Return type:

UUID | None

async add_urls(session, urls, *, priority=0, parent_job_id=None)[source]¶

Add multiple URLs, skipping duplicates.

Parameters:

session (object) – Async database session.
urls (list[str]) – URLs to add.
priority (int) – Crawl priority.
parent_job_id (UUID | None) – Optional parent job.

Returns:

List of created job UUIDs (excludes dupes).

Return type:

list[UUID]

Spider and router for pyfetcher.crawler.

Purpose:: Provide a base spider class with URL pattern routing for handling different page types during crawling.

class pyfetcher.crawler.spider.SpiderResult(discovered_urls=<factory>, items=<factory>, media_urls=<factory>)[source]¶

Result of processing a crawled page.

Parameters:

discovered_urls (list[str]) – New URLs found on the page.
items (list[dict[str, Any]]) – Extracted structured data items.
media_urls (list[str]) – Media URLs found for downloading.

class pyfetcher.crawler.spider.Router[source]¶

URL pattern router for spider handlers.

Maps URL regex patterns to async handler functions. The first matching pattern wins.

add(pattern, handler)[source]¶

Parameters:

pattern (str) – Regex pattern to match URLs against.
handler (Callable[[str, FetchResponse], Coroutine[Any, Any, SpiderResult]]) – Async function handling matching URLs.

Return type:

None

default(handler)[source]¶

Set the default handler for unmatched URLs.

Parameters:: handler (Callable[[str, FetchResponse], Coroutine[Any, Any, SpiderResult]]) – Async function for URLs matching no pattern.
Return type:: None

resolve(url)[source]¶

Find the handler for a URL.

Parameters:: url (str) – The URL to route.
Returns:: The matching handler, or the default handler, or None.
Return type:: Callable[[str, FetchResponse], Coroutine[Any, Any, SpiderResult]] | None

class pyfetcher.crawler.spider.Spider(name='default')[source]¶

Base spider with URL routing.

Provides a router for dispatching URLs to handler functions that extract data and discover new URLs.

Parameters:: name (str) – Spider name for logging/identification.

async handle(url, response)[source]¶

Route a URL to its handler and return the result.

Parameters:

url (str) – The crawled URL.
response (FetchResponse) – The fetch response.

Returns:

A SpiderResult with discovered URLs and items.

Return type:

SpiderResult

URL deduplication for pyfetcher.crawler.

Purpose:: Normalize URLs and check/record seen status using xxhash64 for fast Postgres-backed deduplication.

pyfetcher.crawler.dedup.normalize_url(url)[source]¶

Normalize a URL for deduplication.

Strips fragments, sorts query params, lowercases scheme/host, removes trailing slashes on paths, and removes default ports.

Parameters:: url (str) – The URL to normalize.
Returns:: The normalized URL string.
Return type:: str

pyfetcher.crawler.dedup.url_hash(url)[source]¶

Compute a hash for a normalized URL.

Uses SHA-256 truncated to 8 bytes (64 bits) for a BigInteger-compatible hash suitable for Postgres primary keys.

Parameters:: url (str) – The URL to hash (should be pre-normalized).
Returns:: A 64-bit integer hash.
Return type:: int

class pyfetcher.crawler.dedup.URLDeduplicator[source]¶

URL deduplication checker backed by Postgres.

Normalizes URLs, hashes them, and checks/records them in the seen_urls table via the repository layer.

async is_seen(session, url)[source]¶

Check if a URL has been seen before.

Parameters:

session (object) – Async database session.
url (str) – The URL to check.

Returns:

True if the URL has been seen.

Return type:

bool

async mark_seen(session, url)[source]¶

Mark a URL as seen.

Parameters:

session (object) – Async database session.
url (str) – The URL to mark.

Return type:

None

Politeness enforcement for pyfetcher.crawler.

Purpose:: Enforce per-host crawl delays using robots.txt directives and configurable minimum request intervals.

class pyfetcher.crawler.politeness.PolitenessEnforcer(default_delay_seconds=1.0)[source]¶

Enforces crawl politeness per-host.

Checks robots.txt rules and enforces minimum delays between requests to the same host.

Parameters:: default_delay_seconds (float) – Default delay when no crawl-delay directive exists.

extract_hostname(url)[source]¶

Extract hostname from a URL.

Parameters:: url (str) – The URL.
Returns:: The hostname string.
Return type:: str

check_robots(robots_txt, path, *, user_agent='*')[source]¶

Check if a path is allowed by robots.txt.

Parameters:

robots_txt (str | None) – Raw robots.txt content (None means allowed).
path (str) – The URL path to check.
user_agent (str) – User-agent string.

Returns:

True if allowed.

Return type:

bool

get_crawl_delay(robots_txt)[source]¶

Get the crawl delay from robots.txt or use default.

Parameters:: robots_txt (str | None) – Raw robots.txt content.
Returns:: Delay in seconds.
Return type:: float

async wait_for_host(hostname, delay_seconds)[source]¶

Wait until it’s safe to fetch from a host.

Parameters:

hostname (str) – The target hostname.
delay_seconds (float) – Minimum delay between requests.

Return type:

None

RSS/Atom feed monitor for pyfetcher.crawler.

Purpose:: Monitor RSS/Atom feeds for new entries with adaptive polling intervals based on publication frequency.

class pyfetcher.crawler.feeds.FeedEntry(url, title=None, published=None, summary=None)[source]¶

A single feed entry.

Parameters:

url (str)
title (str | None)
published (str | None)
summary (str | None)

class pyfetcher.crawler.feeds.FeedPollResult(new_entries=<factory>, latest_entry_hash=None, suggested_interval_minutes=60)[source]¶

Result of polling a feed.

Parameters:

new_entries (list[FeedEntry])
latest_entry_hash (str | None)
suggested_interval_minutes (int)

pyfetcher.crawler.feeds.parse_feed(content)[source]¶

Parse RSS/Atom feed content into entries.

Parameters:: content (str) – Raw feed XML/content.
Returns:: A list of FeedEntry objects.
Return type:: list[FeedEntry]

pyfetcher.crawler.feeds.compute_entry_hash(entry)[source]¶

Compute a hash for feed entry change detection.

Parameters:: entry (FeedEntry) – The feed entry.
Returns:: A hex digest string.
Return type:: str

pyfetcher.crawler.feeds.calculate_poll_interval(entry_count, *, current_interval=60, min_interval=10, max_interval=1440)[source]¶

Calculate an adaptive polling interval based on new entry count.

More new entries = shorter interval. No new entries = longer interval.

Parameters:

entry_count (int) – Number of new entries found.
current_interval (int) – Current polling interval in minutes.
min_interval (int) – Minimum interval in minutes.
max_interval (int) – Maximum interval in minutes.

Returns:

Suggested interval in minutes.

Return type:

int

URL discovery (sitemaps + seeds) for pyfetcher.crawler.

Purpose:: Discover URLs from sitemaps, robots.txt sitemap directives, and seed URL lists for populating the crawl frontier.

pyfetcher.crawler.discovery.discover_sitemaps_from_robots(robots_txt)[source]¶

Extract sitemap URLs from robots.txt content.

Parameters:: robots_txt (str) – Raw robots.txt content.
Returns:: A list of sitemap URLs.
Return type:: list[str]

pyfetcher.crawler.discovery.discover_urls_from_sitemap(sitemap_xml)[source]¶

Extract URLs from a sitemap XML document.

Handles both URL sitemaps and sitemap index files. For index files, returns the child sitemap URLs (not final page URLs).

Parameters:: sitemap_xml (str) – Raw sitemap XML content.
Returns:: A list of discovered URLs.
Return type:: list[str]

pyfetcher.crawler.discovery.build_seed_urls(*, urls=None, robots_txt=None, sitemap_xml=None)[source]¶

Build a combined list of seed URLs from multiple sources.

Parameters:

urls (list[str] | None) – Explicit seed URLs.
robots_txt (str | None) – robots.txt content (extracts sitemap URLs).
sitemap_xml (str | None) – Sitemap XML content (extracts page URLs).

Returns:

A deduplicated list of seed URLs.

Return type:

list[str]