Extractors¶

Article content extraction with fallback chain for pyfetcher.extractors.

Purpose:: Extract readable article text from HTML using trafilatura as the primary extractor with readability-lxml as fallback.
Design:: trafilatura achieves the highest F1 score (0.945) in benchmarks. readability-lxml has the highest median reliability (0.970). We try trafilatura first, fall back to readability on failure.

pyfetcher.extractors.content.extract_article_text(html, *, url=None)[source]¶

Extract the main article text from HTML.

Uses trafilatura as the primary extractor with readability-lxml as fallback. Returns None if extraction fails entirely.

Parameters:

html (str) – Raw HTML string.
url (str | None) – Optional page URL for better extraction context.

Returns:

Extracted article text, or None.

Return type:

str | None

HTML conversion utilities for pyfetcher.extractors.

Purpose:: Convert HTML to markdown or plaintext using html2text and markdownify.

pyfetcher.extractors.convert.html_to_markdown(html)[source]¶

Convert HTML to Markdown using markdownify.

Parameters:: html (str) – Raw HTML string.
Returns:: Markdown-formatted text.
Return type:: str

pyfetcher.extractors.convert.html_to_plaintext(html)[source]¶

Convert HTML to plaintext using html2text.

Parameters:: html (str) – Raw HTML string.
Returns:: Plaintext with basic formatting preserved.
Return type:: str

Article metadata extraction for pyfetcher.extractors.

Purpose:: Extract article-specific metadata (author, publish date, summary) using newspaper3k for news articles.

class pyfetcher.extractors.article.ArticleMeta(title=None, authors=<factory>, publish_date=None, summary=None, top_image=None, keywords=<factory>)[source]¶

Extracted article metadata.

Parameters:

title (str | None)
authors (list[str])
publish_date (str | None)
summary (str | None)
top_image (str | None)
keywords (list[str])

pyfetcher.extractors.article.extract_article_metadata(html, *, url)[source]¶

Extract article metadata using newspaper3k.

Parameters:

html (str) – Raw HTML string.
url (str) – The article URL (required by newspaper3k).

Returns:

An ArticleMeta with extracted fields.

Return type:

ArticleMeta

Media file metadata extraction for pyfetcher.extractors.

Purpose:: Extract metadata from media files: audio (mutagen), video (pymediainfo), images (exifread), and PDFs (pypdf). Returns a unified dict.

pyfetcher.extractors.media_meta.extract_media_metadata(file_path)[source]¶

Extract metadata from a media file based on its type.

Dispatches to the appropriate library based on file extension.

Parameters:: file_path (str | Path) – Path to the media file.
Returns:: A dictionary of extracted metadata.
Return type:: dict[str, Any]