Extractors

Article content extraction with fallback chain for pyfetcher.extractors.

Purpose:

Extract readable article text from HTML using trafilatura as the primary extractor with readability-lxml as fallback.

Design:

trafilatura achieves the highest F1 score (0.945) in benchmarks. readability-lxml has the highest median reliability (0.970). We try trafilatura first, fall back to readability on failure.

pyfetcher.extractors.content.extract_article_text(html, *, url=None)[source]

Extract the main article text from HTML.

Uses trafilatura as the primary extractor with readability-lxml as fallback. Returns None if extraction fails entirely.

Parameters:
  • html (str) – Raw HTML string.

  • url (str | None) – Optional page URL for better extraction context.

Returns:

Extracted article text, or None.

Return type:

str | None

HTML conversion utilities for pyfetcher.extractors.

Purpose:

Convert HTML to markdown or plaintext using html2text and markdownify.

pyfetcher.extractors.convert.html_to_markdown(html)[source]

Convert HTML to Markdown using markdownify.

Parameters:

html (str) – Raw HTML string.

Returns:

Markdown-formatted text.

Return type:

str

pyfetcher.extractors.convert.html_to_plaintext(html)[source]

Convert HTML to plaintext using html2text.

Parameters:

html (str) – Raw HTML string.

Returns:

Plaintext with basic formatting preserved.

Return type:

str

Article metadata extraction for pyfetcher.extractors.

Purpose:

Extract article-specific metadata (author, publish date, summary) using newspaper3k for news articles.

class pyfetcher.extractors.article.ArticleMeta(title=None, authors=<factory>, publish_date=None, summary=None, top_image=None, keywords=<factory>)[source]

Extracted article metadata.

Parameters:
  • title (str | None)

  • authors (list[str])

  • publish_date (str | None)

  • summary (str | None)

  • top_image (str | None)

  • keywords (list[str])

pyfetcher.extractors.article.extract_article_metadata(html, *, url)[source]

Extract article metadata using newspaper3k.

Parameters:
  • html (str) – Raw HTML string.

  • url (str) – The article URL (required by newspaper3k).

Returns:

An ArticleMeta with extracted fields.

Return type:

ArticleMeta

Media file metadata extraction for pyfetcher.extractors.

Purpose:

Extract metadata from media files: audio (mutagen), video (pymediainfo), images (exifread), and PDFs (pypdf). Returns a unified dict.

pyfetcher.extractors.media_meta.extract_media_metadata(file_path)[source]

Extract metadata from a media file based on its type.

Dispatches to the appropriate library based on file extension.

Parameters:

file_path (str | Path) – Path to the media file.

Returns:

A dictionary of extracted metadata.

Return type:

dict[str, Any]