fetchkit ======== **Agentic web infrastructure for autonomous fetching, scraping, and content acquisition.** .. mermaid:: flowchart LR Agent["🤖 AI Agent"] --> MCP["MCP Server
16 tools · 4 resources · 4 prompts"] MCP --> Core subgraph Core["fetchkit Core"] direction TB Fetch["⚡ Fetch
4 backends · headers · retry · rate limit"] Scrape["🔍 Scrape
CSS · links · forms · tables · robots"] Extract["📄 Extract
trafilatura · markdown · media meta"] Download["💾 Download
yt-dlp · gallery-dl · HTTP streaming"] end Core --> Infra subgraph Infra["Infrastructure"] direction TB PG[("PostgreSQL
jobs · pages · media")] MinIO[("MinIO
object storage")] end style MCP fill:#E74C3C,color:#fff style Fetch fill:#4A90D9,color:#fff style Scrape fill:#7B68EE,color:#fff style Extract fill:#2ECC71,color:#fff style Download fill:#F39C12,color:#fff style PG fill:#336791,color:#fff style MinIO fill:#C72C48,color:#fff | .. code-block:: bash pip install fetchkit # Core library pip install 'fetchkit[mcp]' # + MCP server for AI agents pip install 'fetchkit[pipeline]' # + Postgres + MinIO pipeline pip install 'fetchkit[full]' # Everything Import as ``import pyfetcher``. Key Features ------------ .. grid:: 2 .. grid-item-card:: 🤖 MCP Server :link: mcp :link-type: doc 16 tools for AI agents. Fetch, scrape, extract, download -- all with structured Pydantic outputs. Works with Claude Desktop, Claude Code, and LangChain. .. grid-item-card:: 🔍 Web Scraping :link: scraping :link-type: doc CSS selectors, link harvesting, form parsing, table extraction, robots.txt, sitemap parsing, and readable text extraction. .. grid-item-card:: 🌐 Browser Headers :link: headers :link-type: doc 11 browser profiles with consistent User-Agent, Client Hints, and Sec-Fetch-* headers. Market-share-weighted rotation. TLS fingerprinting via curl_cffi. .. grid-item-card:: ⚡ Pipeline :link: pipeline :link-type: doc Event-driven crawl → scrape → download stages via Postgres LISTEN/NOTIFY. URL frontier with dedup, politeness, and RSS/Atom feed monitoring. .. grid-item-card:: 📄 Content Extraction :link: api/extractors_api :link-type: doc trafilatura + readability-lxml fallback, HTML to markdown/plaintext, newspaper3k article metadata, audio/video/image/PDF metadata. .. grid-item-card:: 💾 Downloaders :link: api/downloaders :link-type: doc Deep yt-dlp integration with progress hooks. gallery-dl for 170+ image sites. Direct HTTP streaming with SHA-256 checksums. .. toctree:: :maxdepth: 2 :caption: Getting Started :hidden: quickstart mcp .. toctree:: :maxdepth: 2 :caption: Core Features :hidden: headers scraping .. toctree:: :maxdepth: 2 :caption: Infrastructure :hidden: pipeline infrastructure .. toctree:: :maxdepth: 2 :caption: Interfaces :hidden: cli tui .. toctree:: :maxdepth: 2 :caption: API Reference :hidden: api/index Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`