Metadata

Basic HTML metadata extraction for pyfetcher.

Purpose:

Provide lightweight HTML parsing for titles, descriptions, canonical links, and icon links using BeautifulSoup.

Design:
  • Parsing uses bs4 for readability and robustness.

  • This module intentionally handles only the most common HTML-level fields.

  • Open Graph extraction is delegated to pyfetcher.metadata.opengraph.

Examples

>>> html = "<html><head><title>Example</title></head></html>"
>>> extract_basic_html_metadata(html).title
'Example'
pyfetcher.metadata.html.extract_basic_html_metadata(html, *, base_url=None)[source]

Extract basic HTML page metadata.

Parses the <title>, <meta name="description">, <link rel="canonical">, and favicon <link> elements from the given HTML string. Relative URLs are resolved against base_url when provided.

Parameters:
  • html (str) – Raw HTML string to parse.

  • base_url (str | None) – Optional base URL for resolving relative link hrefs.

Returns:

A PageMetadata populated with the extracted fields.

Return type:

PageMetadata

Examples

>>> html = (
...     "<html><head><title>Example</title>"
...     "<meta name='description' content='Desc' />"
...     "<link rel='icon' href='/favicon.ico' />"
...     "</head></html>"
... )
>>> meta = extract_basic_html_metadata(html, base_url="https://example.com")
>>> meta.title
'Example'

Open Graph metadata extraction for pyfetcher.

Purpose:

Extract common Open Graph fields from HTML <meta property="og:*"> tags.

Examples

>>> html = "<meta property='og:title' content='Example' />"
>>> extract_open_graph_metadata(html).title
'Example'
pyfetcher.metadata.opengraph.extract_open_graph_metadata(html)[source]

Extract Open Graph metadata from HTML.

Parses og:title, og:description, og:image, og:site_name, og:url, and og:type meta tags from the provided HTML. Returns None if no Open Graph fields are found.

Parameters:

html (str) – Raw HTML string to parse.

Returns:

An OpenGraphMetadata instance, or None if no OG fields exist.

Return type:

OpenGraphMetadata | None

Examples

>>> html = "<html><head><meta property='og:title' content='Example' /></head></html>"
>>> extract_open_graph_metadata(html).title
'Example'

Structured metadata extraction for pyfetcher.

Purpose:

Run extruct against HTML and combine it with lighter HTML/Open Graph parsing helpers for comprehensive metadata extraction.

Design:
  • extruct is imported lazily so users can keep it optional.

  • Relative URLs are resolved with w3lib.html.get_base_url.

  • The output is normalized into PageMetadata.

Examples

>>> html = "<html><head><title>Example</title></head></html>"
>>> meta = extract_extruct_metadata(html, page_url="https://example.com")
>>> meta.title
'Example'
pyfetcher.metadata.extruct.extract_extruct_metadata(html, *, page_url)[source]

Extract combined page metadata using extruct plus HTML fallbacks.

Runs basic HTML metadata extraction and Open Graph parsing, then augments the result with structured data (JSON-LD, microdata, microformat, RDFa, Dublin Core, Open Graph) via extruct.

Parameters:
  • html (str) – Raw HTML string to parse.

  • page_url (str) – Page URL used as the base for resolving relative URLs.

Returns:

A PageMetadata with all available metadata fields populated.

Raises:

ImportError – If extruct or w3lib is not installed.

Return type:

PageMetadata

Examples

>>> meta = extract_extruct_metadata(
...     "<html><head><title>Example</title></head></html>",
...     page_url="https://example.com",
... )
>>> meta.title
'Example'

Metadata models for pyfetcher.

Purpose:

Provide reusable Pydantic models for page-level metadata parsed from HTML, Open Graph tags, and structured metadata extraction.

Examples

>>> PageMetadata(title="Home").title
'Home'

Single favicon or related icon link.

Represents a <link> element from an HTML document that references an icon resource (favicon, apple-touch-icon, mask-icon).

Parameters:
  • href (str) – Resolved icon URL.

  • rel (str) – HTML rel attribute value (e.g. 'icon').

  • sizes (str | None) – Optional HTML sizes attribute value (e.g. '32x32').

  • mime_type (str | None) – Optional content type (e.g. 'image/png').

Examples

>>> FaviconLink(href="https://example.com/favicon.ico", rel="icon").rel
'icon'
model_config = {'extra': 'forbid', 'frozen': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pyfetcher.metadata.models.OpenGraphMetadata(*, title=None, description=None, image=None, site_name=None, url=None, type=None)[source]

Open Graph metadata model.

Captures the most common Open Graph (og:) meta tag values from an HTML document.

Parameters:
  • title (str | None) – og:title value.

  • description (str | None) – og:description value.

  • image (str | None) – og:image URL.

  • site_name (str | None) – og:site_name value.

  • url (str | None) – og:url canonical URL.

  • type (str | None) – og:type value (e.g. 'website', 'article').

Examples

>>> OpenGraphMetadata(title="Example").title
'Example'
model_config = {'extra': 'forbid', 'frozen': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pyfetcher.metadata.models.PageMetadata(*, title=None, description=None, canonical_url=None, open_graph=None, favicons=<factory>, structured=None)[source]

Combined page metadata model.

Aggregates metadata from multiple sources (HTML tags, Open Graph, structured data) into a single unified model.

Parameters:
  • title (str | None) – Best-effort page title from <title> or og:title.

  • description (str | None) – Best-effort description from <meta> or og:description.

  • canonical_url (str | None) – Canonical URL from <link rel="canonical">.

  • open_graph (OpenGraphMetadata | None) – Parsed Open Graph metadata, if present.

  • favicons (list[FaviconLink]) – Collected favicon/icon links.

  • structured (dict[str, object] | None) – Raw structured metadata payload from extruct.

Examples

>>> PageMetadata(title="Home").title
'Home'
model_config = {'extra': 'forbid', 'frozen': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].