fetchkit¶
Agentic web infrastructure for autonomous fetching, scraping, and content acquisition.
flowchart LR
Agent["🤖 AI Agent"] --> MCP["MCP Server<br/>16 tools · 4 resources · 4 prompts"]
MCP --> Core
subgraph Core["fetchkit Core"]
direction TB
Fetch["⚡ Fetch<br/>4 backends · headers · retry · rate limit"]
Scrape["🔍 Scrape<br/>CSS · links · forms · tables · robots"]
Extract["📄 Extract<br/>trafilatura · markdown · media meta"]
Download["💾 Download<br/>yt-dlp · gallery-dl · HTTP streaming"]
end
Core --> Infra
subgraph Infra["Infrastructure"]
direction TB
PG[("PostgreSQL<br/>jobs · pages · media")]
MinIO[("MinIO<br/>object storage")]
end
style MCP fill:#E74C3C,color:#fff
style Fetch fill:#4A90D9,color:#fff
style Scrape fill:#7B68EE,color:#fff
style Extract fill:#2ECC71,color:#fff
style Download fill:#F39C12,color:#fff
style PG fill:#336791,color:#fff
style MinIO fill:#C72C48,color:#fff
pip install fetchkit # Core library
pip install 'fetchkit[mcp]' # + MCP server for AI agents
pip install 'fetchkit[pipeline]' # + Postgres + MinIO pipeline
pip install 'fetchkit[full]' # Everything
Import as import pyfetcher.
Key Features¶
16 tools for AI agents. Fetch, scrape, extract, download – all with structured Pydantic outputs. Works with Claude Desktop, Claude Code, and LangChain.
CSS selectors, link harvesting, form parsing, table extraction, robots.txt, sitemap parsing, and readable text extraction.
11 browser profiles with consistent User-Agent, Client Hints, and Sec-Fetch-* headers. Market-share-weighted rotation. TLS fingerprinting via curl_cffi.
Event-driven crawl → scrape → download stages via Postgres LISTEN/NOTIFY. URL frontier with dedup, politeness, and RSS/Atom feed monitoring.
trafilatura + readability-lxml fallback, HTML to markdown/plaintext, newspaper3k article metadata, audio/video/image/PDF metadata.
Deep yt-dlp integration with progress hooks. gallery-dl for 170+ image sites. Direct HTTP streaming with SHA-256 checksums.