Crawler

URL frontier (priority queue) for pyfetcher.crawler.

Purpose:

Manage the URL crawl queue backed by Postgres. Implements the dual-queue pattern: priority-based selection with per-host politeness enforcement.

class pyfetcher.crawler.frontier.Frontier(deduplicator=None)[source]

Postgres-backed URL frontier with dedup and priority.

Combines job creation, dedup checking, and priority management into a single interface for the crawl stage.

Parameters:

deduplicator (URLDeduplicator | None) – URL dedup checker.

async add_url(session, url, *, priority=0, parent_job_id=None)[source]

Add a URL to the frontier if not already seen.

Parameters:
  • session (object) – Async database session.

  • url (str) – The URL to add.

  • priority (int) – Crawl priority (higher = more urgent).

  • parent_job_id (UUID | None) – Optional parent job for traceability.

Returns:

The new job UUID, or None if the URL was already seen.

Return type:

UUID | None

async add_urls(session, urls, *, priority=0, parent_job_id=None)[source]

Add multiple URLs, skipping duplicates.

Parameters:
  • session (object) – Async database session.

  • urls (list[str]) – URLs to add.

  • priority (int) – Crawl priority.

  • parent_job_id (UUID | None) – Optional parent job.

Returns:

List of created job UUIDs (excludes dupes).

Return type:

list[UUID]

Spider and router for pyfetcher.crawler.

Purpose:

Provide a base spider class with URL pattern routing for handling different page types during crawling.

class pyfetcher.crawler.spider.SpiderResult(discovered_urls=<factory>, items=<factory>, media_urls=<factory>)[source]

Result of processing a crawled page.

Parameters:
  • discovered_urls (list[str]) – New URLs found on the page.

  • items (list[dict[str, Any]]) – Extracted structured data items.

  • media_urls (list[str]) – Media URLs found for downloading.

class pyfetcher.crawler.spider.Router[source]

URL pattern router for spider handlers.

Maps URL regex patterns to async handler functions. The first matching pattern wins.

add(pattern, handler)[source]

Register a handler for a URL pattern.

Parameters:
Return type:

None

default(handler)[source]

Set the default handler for unmatched URLs.

Parameters:

handler (Callable[[str, FetchResponse], Coroutine[Any, Any, SpiderResult]]) – Async function for URLs matching no pattern.

Return type:

None

resolve(url)[source]

Find the handler for a URL.

Parameters:

url (str) – The URL to route.

Returns:

The matching handler, or the default handler, or None.

Return type:

Callable[[str, FetchResponse], Coroutine[Any, Any, SpiderResult]] | None

class pyfetcher.crawler.spider.Spider(name='default')[source]

Base spider with URL routing.

Provides a router for dispatching URLs to handler functions that extract data and discover new URLs.

Parameters:

name (str) – Spider name for logging/identification.

async handle(url, response)[source]

Route a URL to its handler and return the result.

Parameters:
  • url (str) – The crawled URL.

  • response (FetchResponse) – The fetch response.

Returns:

A SpiderResult with discovered URLs and items.

Return type:

SpiderResult

URL deduplication for pyfetcher.crawler.

Purpose:

Normalize URLs and check/record seen status using xxhash64 for fast Postgres-backed deduplication.

pyfetcher.crawler.dedup.normalize_url(url)[source]

Normalize a URL for deduplication.

Strips fragments, sorts query params, lowercases scheme/host, removes trailing slashes on paths, and removes default ports.

Parameters:

url (str) – The URL to normalize.

Returns:

The normalized URL string.

Return type:

str

pyfetcher.crawler.dedup.url_hash(url)[source]

Compute a hash for a normalized URL.

Uses SHA-256 truncated to 8 bytes (64 bits) for a BigInteger-compatible hash suitable for Postgres primary keys.

Parameters:

url (str) – The URL to hash (should be pre-normalized).

Returns:

A 64-bit integer hash.

Return type:

int

class pyfetcher.crawler.dedup.URLDeduplicator[source]

URL deduplication checker backed by Postgres.

Normalizes URLs, hashes them, and checks/records them in the seen_urls table via the repository layer.

async is_seen(session, url)[source]

Check if a URL has been seen before.

Parameters:
  • session (object) – Async database session.

  • url (str) – The URL to check.

Returns:

True if the URL has been seen.

Return type:

bool

async mark_seen(session, url)[source]

Mark a URL as seen.

Parameters:
  • session (object) – Async database session.

  • url (str) – The URL to mark.

Return type:

None

Politeness enforcement for pyfetcher.crawler.

Purpose:

Enforce per-host crawl delays using robots.txt directives and configurable minimum request intervals.

class pyfetcher.crawler.politeness.PolitenessEnforcer(default_delay_seconds=1.0)[source]

Enforces crawl politeness per-host.

Checks robots.txt rules and enforces minimum delays between requests to the same host.

Parameters:

default_delay_seconds (float) – Default delay when no crawl-delay directive exists.

extract_hostname(url)[source]

Extract hostname from a URL.

Parameters:

url (str) – The URL.

Returns:

The hostname string.

Return type:

str

check_robots(robots_txt, path, *, user_agent='*')[source]

Check if a path is allowed by robots.txt.

Parameters:
  • robots_txt (str | None) – Raw robots.txt content (None means allowed).

  • path (str) – The URL path to check.

  • user_agent (str) – User-agent string.

Returns:

True if allowed.

Return type:

bool

get_crawl_delay(robots_txt)[source]

Get the crawl delay from robots.txt or use default.

Parameters:

robots_txt (str | None) – Raw robots.txt content.

Returns:

Delay in seconds.

Return type:

float

async wait_for_host(hostname, delay_seconds)[source]

Wait until it’s safe to fetch from a host.

Parameters:
  • hostname (str) – The target hostname.

  • delay_seconds (float) – Minimum delay between requests.

Return type:

None

RSS/Atom feed monitor for pyfetcher.crawler.

Purpose:

Monitor RSS/Atom feeds for new entries with adaptive polling intervals based on publication frequency.

class pyfetcher.crawler.feeds.FeedEntry(url, title=None, published=None, summary=None)[source]

A single feed entry.

Parameters:
  • url (str)

  • title (str | None)

  • published (str | None)

  • summary (str | None)

class pyfetcher.crawler.feeds.FeedPollResult(new_entries=<factory>, latest_entry_hash=None, suggested_interval_minutes=60)[source]

Result of polling a feed.

Parameters:
  • new_entries (list[FeedEntry])

  • latest_entry_hash (str | None)

  • suggested_interval_minutes (int)

pyfetcher.crawler.feeds.parse_feed(content)[source]

Parse RSS/Atom feed content into entries.

Parameters:

content (str) – Raw feed XML/content.

Returns:

A list of FeedEntry objects.

Return type:

list[FeedEntry]

pyfetcher.crawler.feeds.compute_entry_hash(entry)[source]

Compute a hash for feed entry change detection.

Parameters:

entry (FeedEntry) – The feed entry.

Returns:

A hex digest string.

Return type:

str

pyfetcher.crawler.feeds.calculate_poll_interval(entry_count, *, current_interval=60, min_interval=10, max_interval=1440)[source]

Calculate an adaptive polling interval based on new entry count.

More new entries = shorter interval. No new entries = longer interval.

Parameters:
  • entry_count (int) – Number of new entries found.

  • current_interval (int) – Current polling interval in minutes.

  • min_interval (int) – Minimum interval in minutes.

  • max_interval (int) – Maximum interval in minutes.

Returns:

Suggested interval in minutes.

Return type:

int

URL discovery (sitemaps + seeds) for pyfetcher.crawler.

Purpose:

Discover URLs from sitemaps, robots.txt sitemap directives, and seed URL lists for populating the crawl frontier.

pyfetcher.crawler.discovery.discover_sitemaps_from_robots(robots_txt)[source]

Extract sitemap URLs from robots.txt content.

Parameters:

robots_txt (str) – Raw robots.txt content.

Returns:

A list of sitemap URLs.

Return type:

list[str]

pyfetcher.crawler.discovery.discover_urls_from_sitemap(sitemap_xml)[source]

Extract URLs from a sitemap XML document.

Handles both URL sitemaps and sitemap index files. For index files, returns the child sitemap URLs (not final page URLs).

Parameters:

sitemap_xml (str) – Raw sitemap XML content.

Returns:

A list of discovered URLs.

Return type:

list[str]

pyfetcher.crawler.discovery.build_seed_urls(*, urls=None, robots_txt=None, sitemap_xml=None)[source]

Build a combined list of seed URLs from multiple sources.

Parameters:
  • urls (list[str] | None) – Explicit seed URLs.

  • robots_txt (str | None) – robots.txt content (extracts sitemap URLs).

  • sitemap_xml (str | None) – Sitemap XML content (extracts page URLs).

Returns:

A deduplicated list of seed URLs.

Return type:

list[str]