Metadata¶

Basic HTML metadata extraction for pyfetcher.

Purpose:

Provide lightweight HTML parsing for titles, descriptions, canonical links, and icon links using BeautifulSoup.

Design:

Parsing uses bs4 for readability and robustness.
This module intentionally handles only the most common HTML-level fields.
Open Graph extraction is delegated to pyfetcher.metadata.opengraph.

Examples

>>> html = "<html><head><title>Example</title></head></html>"
>>> extract_basic_html_metadata(html).title
'Example'

pyfetcher.metadata.html.extract_basic_html_metadata(html, *, base_url=None)[source]¶

Extract basic HTML page metadata.

Parses the <title>, <meta name="description">, <link rel="canonical">, and favicon <link> elements from the given HTML string. Relative URLs are resolved against base_url when provided.

Parameters:

html (str) – Raw HTML string to parse.
base_url (str | None) – Optional base URL for resolving relative link hrefs.

Returns:

A PageMetadata populated with the extracted fields.

Return type:

PageMetadata

Examples

>>> html = (
...     "<html><head><title>Example</title>"
...     "<meta name='description' content='Desc' />"
...     "<link rel='icon' href='/favicon.ico' />"
...     "</head></html>"
... )
>>> meta = extract_basic_html_metadata(html, base_url="https://example.com")
>>> meta.title
'Example'

Open Graph metadata extraction for pyfetcher.

Purpose:: Extract common Open Graph fields from HTML <meta property="og:*"> tags.

Examples

>>> html = "<meta property='og:title' content='Example' />"
>>> extract_open_graph_metadata(html).title
'Example'

pyfetcher.metadata.opengraph.extract_open_graph_metadata(html)[source]¶

Extract Open Graph metadata from HTML.

Parses og:title, og:description, og:image, og:site_name, og:url, and og:type meta tags from the provided HTML. Returns None if no Open Graph fields are found.

Parameters:: html (str) – Raw HTML string to parse.
Returns:: An OpenGraphMetadata instance, or None if no OG fields exist.
Return type:: OpenGraphMetadata | None

Examples

>>> html = "<html><head><meta property='og:title' content='Example' /></head></html>"
>>> extract_open_graph_metadata(html).title
'Example'

Structured metadata extraction for pyfetcher.

Purpose:

Run extruct against HTML and combine it with lighter HTML/Open Graph parsing helpers for comprehensive metadata extraction.

Design:

extruct is imported lazily so users can keep it optional.
Relative URLs are resolved with w3lib.html.get_base_url.
The output is normalized into PageMetadata.

Examples

>>> html = "<html><head><title>Example</title></head></html>"
>>> meta = extract_extruct_metadata(html, page_url="https://example.com")
>>> meta.title
'Example'

pyfetcher.metadata.extruct.extract_extruct_metadata(html, *, page_url)[source]¶

Extract combined page metadata using extruct plus HTML fallbacks.

Runs basic HTML metadata extraction and Open Graph parsing, then augments the result with structured data (JSON-LD, microdata, microformat, RDFa, Dublin Core, Open Graph) via extruct.

Parameters:

html (str) – Raw HTML string to parse.
page_url (str) – Page URL used as the base for resolving relative URLs.

Returns:

A PageMetadata with all available metadata fields populated.

Raises:

ImportError – If extruct or w3lib is not installed.

Return type:

PageMetadata

Examples

>>> meta = extract_extruct_metadata(
...     "<html><head><title>Example</title></head></html>",
...     page_url="https://example.com",
... )
>>> meta.title
'Example'

Metadata models for pyfetcher.

Purpose:: Provide reusable Pydantic models for page-level metadata parsed from HTML, Open Graph tags, and structured metadata extraction.

Examples

>>> PageMetadata(title="Home").title
'Home'

class pyfetcher.metadata.models.FaviconLink(*, href, rel, sizes=None, mime_type=None)[source]¶

Single favicon or related icon link.

Represents a <link> element from an HTML document that references an icon resource (favicon, apple-touch-icon, mask-icon).

Parameters:

href (str) – Resolved icon URL.
rel (str) – HTML rel attribute value (e.g. 'icon').
sizes (str | None) – Optional HTML sizes attribute value (e.g. '32x32').
mime_type (str | None) – Optional content type (e.g. 'image/png').

Examples

>>> FaviconLink(href="https://example.com/favicon.ico", rel="icon").rel
'icon'

model_config = {'extra': 'forbid', 'frozen': True}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pyfetcher.metadata.models.OpenGraphMetadata(*, title=None, description=None, image=None, site_name=None, url=None, type=None)[source]¶

Open Graph metadata model.

Captures the most common Open Graph (og:) meta tag values from an HTML document.

Parameters:

title (str | None) – og:title value.
description (str | None) – og:description value.
image (str | None) – og:image URL.
site_name (str | None) – og:site_name value.
url (str | None) – og:url canonical URL.
type (str | None) – og:type value (e.g. 'website', 'article').

Examples

>>> OpenGraphMetadata(title="Example").title
'Example'

model_config = {'extra': 'forbid', 'frozen': True}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pyfetcher.metadata.models.PageMetadata(*, title=None, description=None, canonical_url=None, open_graph=None, favicons=<factory>, structured=None)[source]¶

Combined page metadata model.

Aggregates metadata from multiple sources (HTML tags, Open Graph, structured data) into a single unified model.

Parameters:

title (str | None) – Best-effort page title from <title> or og:title.
description (str | None) – Best-effort description from <meta> or og:description.
canonical_url (str | None) – Canonical URL from <link rel="canonical">.
open_graph (OpenGraphMetadata | None) – Parsed Open Graph metadata, if present.
favicons (list[FaviconLink]) – Collected favicon/icon links.
structured (dict[str, object] | None) – Raw structured metadata payload from extruct.

Examples

>>> PageMetadata(title="Home").title
'Home'

model_config = {'extra': 'forbid', 'frozen': True}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].