fetchkit

Agentic web infrastructure for autonomous fetching, scraping, and content acquisition.

        flowchart LR
    Agent["🤖 AI Agent"] --> MCP["MCP Server<br/>16 tools · 4 resources · 4 prompts"]
    MCP --> Core

    subgraph Core["fetchkit Core"]
        direction TB
        Fetch["⚡ Fetch<br/>4 backends · headers · retry · rate limit"]
        Scrape["🔍 Scrape<br/>CSS · links · forms · tables · robots"]
        Extract["📄 Extract<br/>trafilatura · markdown · media meta"]
        Download["💾 Download<br/>yt-dlp · gallery-dl · HTTP streaming"]
    end

    Core --> Infra

    subgraph Infra["Infrastructure"]
        direction TB
        PG[("PostgreSQL<br/>jobs · pages · media")]
        MinIO[("MinIO<br/>object storage")]
    end

    style MCP fill:#E74C3C,color:#fff
    style Fetch fill:#4A90D9,color:#fff
    style Scrape fill:#7B68EE,color:#fff
    style Extract fill:#2ECC71,color:#fff
    style Download fill:#F39C12,color:#fff
    style PG fill:#336791,color:#fff
    style MinIO fill:#C72C48,color:#fff
    

pip install fetchkit              # Core library
pip install 'fetchkit[mcp]'       # + MCP server for AI agents
pip install 'fetchkit[pipeline]'  # + Postgres + MinIO pipeline
pip install 'fetchkit[full]'      # Everything

Import as import pyfetcher.

Key Features

🤖 MCP Server

16 tools for AI agents. Fetch, scrape, extract, download – all with structured Pydantic outputs. Works with Claude Desktop, Claude Code, and LangChain.

MCP Server
🔍 Web Scraping

CSS selectors, link harvesting, form parsing, table extraction, robots.txt, sitemap parsing, and readable text extraction.

Scraping
🌐 Browser Headers

11 browser profiles with consistent User-Agent, Client Hints, and Sec-Fetch-* headers. Market-share-weighted rotation. TLS fingerprinting via curl_cffi.

Browser Headers
⚡ Pipeline

Event-driven crawl → scrape → download stages via Postgres LISTEN/NOTIFY. URL frontier with dedup, politeness, and RSS/Atom feed monitoring.

Pipeline
📄 Content Extraction

trafilatura + readability-lxml fallback, HTML to markdown/plaintext, newspaper3k article metadata, audio/video/image/PDF metadata.

Extractors
💾 Downloaders

Deep yt-dlp integration with progress hooks. gallery-dl for 170+ image sites. Direct HTTP streaming with SHA-256 checksums.

Downloaders

Indices and tables