Metadata¶
Basic HTML metadata extraction for pyfetcher.
- Purpose:
Provide lightweight HTML parsing for titles, descriptions, canonical links, and icon links using BeautifulSoup.
- Design:
Parsing uses
bs4for readability and robustness.This module intentionally handles only the most common HTML-level fields.
Open Graph extraction is delegated to
pyfetcher.metadata.opengraph.
Examples
>>> html = "<html><head><title>Example</title></head></html>"
>>> extract_basic_html_metadata(html).title
'Example'
- pyfetcher.metadata.html.extract_basic_html_metadata(html, *, base_url=None)[source]¶
Extract basic HTML page metadata.
Parses the
<title>,<meta name="description">,<link rel="canonical">, and favicon<link>elements from the given HTML string. Relative URLs are resolved againstbase_urlwhen provided.- Parameters:
- Returns:
A
PageMetadatapopulated with the extracted fields.- Return type:
Examples
>>> html = ( ... "<html><head><title>Example</title>" ... "<meta name='description' content='Desc' />" ... "<link rel='icon' href='/favicon.ico' />" ... "</head></html>" ... ) >>> meta = extract_basic_html_metadata(html, base_url="https://example.com") >>> meta.title 'Example'
Open Graph metadata extraction for pyfetcher.
- Purpose:
Extract common Open Graph fields from HTML
<meta property="og:*">tags.
Examples
>>> html = "<meta property='og:title' content='Example' />"
>>> extract_open_graph_metadata(html).title
'Example'
- pyfetcher.metadata.opengraph.extract_open_graph_metadata(html)[source]¶
Extract Open Graph metadata from HTML.
Parses
og:title,og:description,og:image,og:site_name,og:url, andog:typemeta tags from the provided HTML. ReturnsNoneif no Open Graph fields are found.- Parameters:
html (str) – Raw HTML string to parse.
- Returns:
An
OpenGraphMetadatainstance, orNoneif no OG fields exist.- Return type:
OpenGraphMetadata | None
Examples
>>> html = "<html><head><meta property='og:title' content='Example' /></head></html>" >>> extract_open_graph_metadata(html).title 'Example'
Structured metadata extraction for pyfetcher.
- Purpose:
Run
extructagainst HTML and combine it with lighter HTML/Open Graph parsing helpers for comprehensive metadata extraction.- Design:
extructis imported lazily so users can keep it optional.Relative URLs are resolved with
w3lib.html.get_base_url.The output is normalized into
PageMetadata.
Examples
>>> html = "<html><head><title>Example</title></head></html>"
>>> meta = extract_extruct_metadata(html, page_url="https://example.com")
>>> meta.title
'Example'
- pyfetcher.metadata.extruct.extract_extruct_metadata(html, *, page_url)[source]¶
Extract combined page metadata using
extructplus HTML fallbacks.Runs basic HTML metadata extraction and Open Graph parsing, then augments the result with structured data (JSON-LD, microdata, microformat, RDFa, Dublin Core, Open Graph) via
extruct.- Parameters:
- Returns:
A
PageMetadatawith all available metadata fields populated.- Raises:
ImportError – If
extructorw3libis not installed.- Return type:
Examples
>>> meta = extract_extruct_metadata( ... "<html><head><title>Example</title></head></html>", ... page_url="https://example.com", ... ) >>> meta.title 'Example'
Metadata models for pyfetcher.
- Purpose:
Provide reusable Pydantic models for page-level metadata parsed from HTML, Open Graph tags, and structured metadata extraction.
Examples
>>> PageMetadata(title="Home").title
'Home'
- class pyfetcher.metadata.models.FaviconLink(*, href, rel, sizes=None, mime_type=None)[source]¶
Single favicon or related icon link.
Represents a
<link>element from an HTML document that references an icon resource (favicon, apple-touch-icon, mask-icon).- Parameters:
Examples
>>> FaviconLink(href="https://example.com/favicon.ico", rel="icon").rel 'icon'
- model_config = {'extra': 'forbid', 'frozen': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class pyfetcher.metadata.models.OpenGraphMetadata(*, title=None, description=None, image=None, site_name=None, url=None, type=None)[source]¶
Open Graph metadata model.
Captures the most common Open Graph (
og:) meta tag values from an HTML document.- Parameters:
Examples
>>> OpenGraphMetadata(title="Example").title 'Example'
- model_config = {'extra': 'forbid', 'frozen': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class pyfetcher.metadata.models.PageMetadata(*, title=None, description=None, canonical_url=None, open_graph=None, favicons=<factory>, structured=None)[source]¶
Combined page metadata model.
Aggregates metadata from multiple sources (HTML tags, Open Graph, structured data) into a single unified model.
- Parameters:
title (str | None) – Best-effort page title from
<title>orog:title.description (str | None) – Best-effort description from
<meta>orog:description.canonical_url (str | None) – Canonical URL from
<link rel="canonical">.open_graph (OpenGraphMetadata | None) – Parsed Open Graph metadata, if present.
favicons (list[FaviconLink]) – Collected favicon/icon links.
structured (dict[str, object] | None) – Raw structured metadata payload from extruct.
Examples
>>> PageMetadata(title="Home").title 'Home'
- model_config = {'extra': 'forbid', 'frozen': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].