Extractors¶
Article content extraction with fallback chain for pyfetcher.extractors.
- Purpose:
Extract readable article text from HTML using trafilatura as the primary extractor with readability-lxml as fallback.
- Design:
trafilatura achieves the highest F1 score (0.945) in benchmarks. readability-lxml has the highest median reliability (0.970). We try trafilatura first, fall back to readability on failure.
- pyfetcher.extractors.content.extract_article_text(html, *, url=None)[source]¶
Extract the main article text from HTML.
Uses trafilatura as the primary extractor with readability-lxml as fallback. Returns
Noneif extraction fails entirely.
HTML conversion utilities for pyfetcher.extractors.
- Purpose:
Convert HTML to markdown or plaintext using html2text and markdownify.
- pyfetcher.extractors.convert.html_to_markdown(html)[source]¶
Convert HTML to Markdown using markdownify.
- pyfetcher.extractors.convert.html_to_plaintext(html)[source]¶
Convert HTML to plaintext using html2text.
Article metadata extraction for pyfetcher.extractors.
- Purpose:
Extract article-specific metadata (author, publish date, summary) using newspaper3k for news articles.
- class pyfetcher.extractors.article.ArticleMeta(title=None, authors=<factory>, publish_date=None, summary=None, top_image=None, keywords=<factory>)[source]¶
Extracted article metadata.
- pyfetcher.extractors.article.extract_article_metadata(html, *, url)[source]¶
Extract article metadata using newspaper3k.
- Parameters:
- Returns:
An
ArticleMetawith extracted fields.- Return type:
Media file metadata extraction for pyfetcher.extractors.
- Purpose:
Extract metadata from media files: audio (mutagen), video (pymediainfo), images (exifread), and PDFs (pypdf). Returns a unified dict.