Scrape

CSS selector-based extraction for pyfetcher.

Purpose:

Provide ergonomic functions for extracting data from HTML using CSS selectors via BeautifulSoup. Covers common patterns: selecting elements, extracting text, extracting attributes, and parsing HTML tables.

Examples

>>> html = "<div class='item'>Hello</div><div class='item'>World</div>"
>>> extract_text(html, ".item")
['Hello', 'World']
pyfetcher.scrape.selectors.select(html, selector)[source]

Select all elements matching a CSS selector.

Parameters:
  • html (str) – Raw HTML string to parse.

  • selector (str) – CSS selector string.

Returns:

A list of matching bs4.Tag objects.

Return type:

list[Tag]

Examples

>>> html = "<ul><li>A</li><li>B</li></ul>"
>>> tags = select(html, "li")
>>> len(tags)
2
pyfetcher.scrape.selectors.select_one(html, selector)[source]

Select the first element matching a CSS selector.

Parameters:
  • html (str) – Raw HTML string to parse.

  • selector (str) – CSS selector string.

Returns:

The first matching bs4.Tag, or None if not found.

Return type:

Tag | None

Examples

>>> html = "<h1>Title</h1><h1>Subtitle</h1>"
>>> tag = select_one(html, "h1")
>>> tag.get_text()
'Title'
pyfetcher.scrape.selectors.extract_text(html, selector, *, strip=True)[source]

Extract text content from all elements matching a CSS selector.

Parameters:
  • html (str) – Raw HTML string to parse.

  • selector (str) – CSS selector string.

  • strip (bool) – Whether to strip whitespace from each text result.

Returns:

A list of text strings from matching elements.

Return type:

list[str]

Examples

>>> html = "<p>Hello</p><p>World</p>"
>>> extract_text(html, "p")
['Hello', 'World']
pyfetcher.scrape.selectors.extract_attrs(html, selector, *, attrs=None)[source]

Extract attributes from all elements matching a CSS selector.

If attrs is not specified, all attributes of each element are returned. If attrs is a list of attribute names, only those attributes are included (with None for missing attributes).

Parameters:
  • html (str) – Raw HTML string to parse.

  • selector (str) – CSS selector string.

  • attrs (list[str] | None) – Optional list of attribute names to extract.

Returns:

A list of dictionaries mapping attribute names to values.

Return type:

list[dict[str, Any]]

Examples

>>> html = '<a href="/about">About</a><a href="/home">Home</a>'
>>> extract_attrs(html, "a", attrs=["href"])
[{'href': '/about'}, {'href': '/home'}]
pyfetcher.scrape.selectors.extract_table(html, selector='table', *, include_headers=True)[source]

Extract data from an HTML table as a list of rows.

Parses the first <table> element matching the selector. If include_headers is True, the first row will contain header cell (<th>) text. Subsequent rows contain data cell (<td>) text.

Parameters:
  • html (str) – Raw HTML string to parse.

  • selector (str) – CSS selector targeting the table element.

  • include_headers (bool) – Whether to include <th> cells as the first row.

Returns:

A list of rows, where each row is a list of cell text strings.

Return type:

list[list[str]]

Examples

>>> html = "<table><tr><th>Name</th></tr><tr><td>Alice</td></tr></table>"
>>> extract_table(html)
[['Name'], ['Alice']]
Purpose:

Harvest and normalize links from HTML documents, supporting filtering by domain, scheme, and link attributes.

Examples

>>> html = '<a href="https://example.com">Example</a>'
>>> links = extract_links(html, base_url="https://example.com")
>>> links[0].url
'https://example.com'
class pyfetcher.scrape.links.LinkInfo(url, text, rel, is_external)[source]

Extracted link information.

Parameters:
  • url (str) – The resolved absolute URL.

  • text (str) – The link’s visible text content.

  • rel (str | None) – The rel attribute value, if present.

  • is_external (bool) – Whether the link points to a different domain.

Examples

>>> link = LinkInfo(
...     url="https://example.com", text="Example",
...     rel=None, is_external=False,
... )
>>> link.url
'https://example.com'

Extract and normalize links from HTML.

Parses all <a> tags with href attributes and resolves relative URLs against base_url. Optionally filters to same-domain links only and controls whether fragment-only links are included.

Parameters:
  • html (str) – Raw HTML string to parse.

  • base_url (str | None) – Base URL for resolving relative hrefs. Required for accurate is_external detection and relative URL resolution.

  • same_domain_only (bool) – If True, only return links pointing to the same domain as base_url.

  • include_fragments (bool) – If True, include fragment-only links (e.g. #section). Defaults to False.

Returns:

A list of LinkInfo objects for each extracted link.

Return type:

list[LinkInfo]

Examples

>>> html = '<a href="/about">About</a><a href="https://other.com">Other</a>'
>>> links = extract_links(html, base_url="https://example.com")
>>> len(links)
2
>>> links = extract_links(html, base_url="https://example.com", same_domain_only=True)
>>> len(links)
1

HTML form extraction for pyfetcher.

Purpose:

Parse <form> elements from HTML and extract their fields, making it easy to build form submission requests programmatically.

Examples

>>> html = '<form action="/login" method="post"><input name="user"/></form>'
>>> forms = extract_forms(html, base_url="https://example.com")
>>> forms[0].action
'https://example.com/login'
class pyfetcher.scrape.forms.FormField(name, type, value, options=<factory>)[source]

A single form input field.

Parameters:
  • name (str) – The field’s name attribute.

  • type (str) – The field’s type attribute (e.g. 'text', 'hidden').

  • value (str) – The field’s default value attribute.

  • options (list[str]) – For <select> elements, the list of <option> values.

Examples

>>> field = FormField(name="user", type="text", value="")
>>> field.name
'user'
class pyfetcher.scrape.forms.FormData(action, method, fields, id=None, name=None)[source]

Parsed HTML form.

Parameters:
  • action (str) – The resolved form action URL.

  • method (str) – The HTTP method (uppercased, e.g. 'GET', 'POST').

  • fields (list[FormField]) – List of form fields.

  • id (str | None) – The form’s id attribute, if present.

  • name (str | None) – The form’s name attribute, if present.

Examples

>>> form = FormData(action="https://example.com/login", method="POST", fields=[])
>>> form.method
'POST'
to_dict()[source]

Convert form fields to a submission dictionary.

Returns a dictionary mapping field names to their default values, suitable for use as POST data or query parameters.

Returns:

A dictionary of field names to values.

Return type:

dict[str, str]

Examples

>>> form = FormData(
...     action="/submit", method="POST",
...     fields=[FormField(name="q", type="text", value="hello")],
... )
>>> form.to_dict()
{'q': 'hello'}
pyfetcher.scrape.forms.extract_forms(html, *, base_url=None)[source]

Extract all forms from HTML.

Parses <form> elements and their input fields (<input>, <textarea>, <select>) to produce structured form data.

Parameters:
  • html (str) – Raw HTML string to parse.

  • base_url (str | None) – Base URL for resolving relative form action URLs.

Returns:

A list of FormData objects.

Return type:

list[FormData]

Examples

>>> html = '<form action="/search"><input name="q" value=""/></form>'
>>> forms = extract_forms(html, base_url="https://example.com")
>>> forms[0].action
'https://example.com/search'

Robots.txt parser for pyfetcher.

Purpose:

Parse robots.txt files and check URL access permissions for a given user-agent. Supports Allow, Disallow, Crawl-delay, and Sitemap directives.

Examples

>>> txt = "User-agent: *\\nDisallow: /admin"
>>> rules = parse_robots_txt(txt)
>>> is_allowed(rules, "/admin", user_agent="*")
False
class pyfetcher.scrape.robots.RobotsRules(rules=<factory>, sitemaps=<factory>, crawl_delays=<factory>)[source]

Parsed robots.txt rules.

Parameters:
  • rules (dict[str, list[tuple[bool, str]]]) – Mapping of user-agent patterns to lists of (allow, path) tuples.

  • sitemaps (list[str]) – List of sitemap URLs found in the robots.txt.

  • crawl_delays (dict[str, float]) – Mapping of user-agent patterns to crawl delay seconds.

Examples

>>> rules = RobotsRules()
>>> rules.sitemaps
[]
pyfetcher.scrape.robots.parse_robots_txt(content)[source]

Parse a robots.txt file content.

Extracts User-agent, Allow, Disallow, Crawl-delay, and Sitemap directives into a structured RobotsRules object.

Parameters:

content (str) – The raw text content of a robots.txt file.

Returns:

A RobotsRules object containing parsed directives.

Return type:

RobotsRules

Examples

>>> txt = "User-agent: *\\nDisallow: /secret\\nAllow: /public"
>>> rules = parse_robots_txt(txt)
>>> len(rules.rules.get("*", []))
2
pyfetcher.scrape.robots.is_allowed(rules, path, *, user_agent='*')[source]

Check if a path is allowed for the given user-agent.

Evaluates the parsed robots.txt rules for the most specific matching user-agent. Allow directives take precedence over Disallow when paths have equal specificity (longer path prefix wins).

Parameters:
  • rules (RobotsRules) – Parsed robots.txt rules from parse_robots_txt().

  • path (str) – The URL path to check (e.g. '/admin/settings').

  • user_agent (str) – The user-agent string to check against. Defaults to '*' (wildcard).

Returns:

True if the path is allowed, False if disallowed.

Return type:

bool

Examples

>>> txt = "User-agent: *\\nDisallow: /admin\\nAllow: /admin/public"
>>> rules = parse_robots_txt(txt)
>>> is_allowed(rules, "/admin/settings")
False
>>> is_allowed(rules, "/admin/public")
True

Sitemap parser for pyfetcher.

Purpose:

Parse XML sitemaps (both sitemap index files and URL set files) and extract URL entries with their metadata.

Examples

>>> xml = '<?xml version="1.0"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><url><loc>https://example.com/</loc></url></urlset>'
>>> entries = parse_sitemap(xml)
>>> entries[0].loc
'https://example.com/'
class pyfetcher.scrape.sitemap.SitemapEntry(loc, lastmod=None, changefreq=None, priority=None, is_sitemap=False)[source]

A single URL entry from a sitemap.

Parameters:
  • loc (str) – The URL location.

  • lastmod (str | None) – The last modification date string, if present.

  • changefreq (str | None) – The change frequency hint, if present.

  • priority (str | None) – The priority value as a string, if present.

  • is_sitemap (bool) – Whether this entry is a sitemap index reference.

Examples

>>> entry = SitemapEntry(loc="https://example.com/")
>>> entry.loc
'https://example.com/'
pyfetcher.scrape.sitemap.parse_sitemap(xml_content)[source]

Parse an XML sitemap or sitemap index.

Handles both <urlset> (URL sitemaps) and <sitemapindex> (sitemap index files). Returns a flat list of entries with the is_sitemap flag set for index entries.

Parameters:

xml_content (str) – Raw XML string content of the sitemap.

Returns:

A list of SitemapEntry objects.

Return type:

list[SitemapEntry]

Examples

>>> xml = (
...     '<?xml version="1.0"?>'
...     '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">'
...     '<url><loc>https://example.com/</loc><priority>1.0</priority></url>'
...     '</urlset>'
... )
>>> entries = parse_sitemap(xml)
>>> entries[0].priority
'1.0'

Content extraction for pyfetcher.

Purpose:

Extract readable text content from HTML by stripping scripts, styles, and navigation elements to isolate the main body text.

Examples

>>> html = "<html><body><p>Hello World</p><script>var x=1;</script></body></html>"
>>> extract_readable_text(html)
'Hello World'
pyfetcher.scrape.content.extract_readable_text(html, *, strip_tags=None, selector=None)[source]

Extract readable text content from HTML.

Removes scripts, styles, navigation, and other non-content elements from the HTML, then extracts and normalizes the text content. Optionally targets a specific element via CSS selector.

Parameters:
  • html (str) – Raw HTML string to process.

  • strip_tags (frozenset[str] | None) – Set of tag names to remove before text extraction. Defaults to scripts, styles, noscript, iframe, svg, nav, footer, and header.

  • selector (str | None) – Optional CSS selector to narrow extraction to a specific element (e.g. 'article', 'main', '.content').

Returns:

Cleaned, readable text with normalized whitespace.

Return type:

str

Examples

>>> html = "<div><p>First.</p><p>Second.</p><script>x=1</script></div>"
>>> extract_readable_text(html)
'First.\\nSecond.'