Downloaders

Base downloader protocol for pyfetcher.downloaders.

Purpose:

Define the common interface for all downloader implementations.

class pyfetcher.downloaders.base.MediaInfo(url, title=None, description=None, duration_seconds=None, thumbnail_url=None, uploader=None, upload_date=None, file_size_bytes=None, mime_type=None, ext=None, extra=<factory>)[source]

Extracted media metadata before download.

Parameters:
  • url (str)

  • title (str | None)

  • description (str | None)

  • duration_seconds (float | None)

  • thumbnail_url (str | None)

  • uploader (str | None)

  • upload_date (str | None)

  • file_size_bytes (int | None)

  • mime_type (str | None)

  • ext (str | None)

  • extra (dict[str, Any])

class pyfetcher.downloaders.base.DownloadResult(source_url, local_path=None, minio_key=None, minio_bucket=None, filename=None, file_size_bytes=None, mime_type=None, checksum_sha256=None, media_info=None, media_metadata=<factory>)[source]

Result of a completed download.

Parameters:
  • source_url (str)

  • local_path (str | None)

  • minio_key (str | None)

  • minio_bucket (str | None)

  • filename (str | None)

  • file_size_bytes (int | None)

  • mime_type (str | None)

  • checksum_sha256 (str | None)

  • media_info (MediaInfo | None)

  • media_metadata (dict[str, Any])

class pyfetcher.downloaders.base.DownloadProgress(status, downloaded_bytes=0, total_bytes=None, speed_bytes_per_sec=None, eta_seconds=None, filename=None, percent=None)[source]

Progress update during a download.

Parameters:
  • status (str)

  • downloaded_bytes (int)

  • total_bytes (int | None)

  • speed_bytes_per_sec (float | None)

  • eta_seconds (float | None)

  • filename (str | None)

  • percent (float | None)

class pyfetcher.downloaders.base.DownloaderProtocol(*args, **kwargs)[source]

Protocol for downloader implementations.

yt-dlp deep integration for pyfetcher.downloaders.

Purpose:

Wrap yt-dlp’s YoutubeDL Python API with progress hooks, metadata extraction, and structured output for pipeline integration.

class pyfetcher.downloaders.ytdlp.YtdlpDownloader(*, format_spec='bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best', extra_opts=None)[source]

Deep yt-dlp integration via the YoutubeDL Python API.

Hooks into progress_hooks for real-time download tracking and converts info_dict to structured MediaInfo/DownloadResult models.

Parameters:
  • format_spec (str) – yt-dlp format selection string.

  • extra_opts (dict[str, Any] | None) – Additional yt-dlp options dict.

async extract_info(url)[source]

Extract metadata without downloading.

Parameters:

url (str) – The URL to extract info from.

Returns:

A list of MediaInfo objects (one per video/track).

Return type:

list[MediaInfo]

async download(url, *, output_dir=None, progress_callback=None)[source]

Download media via yt-dlp.

Parameters:
  • url (str) – The URL to download from.

  • output_dir (str | None) – Directory for downloaded files. Uses temp dir if not provided.

  • progress_callback (Callable[[DownloadProgress], None] | None) – Optional callback for progress updates.

Returns:

A list of DownloadResult objects.

Return type:

list[DownloadResult]

gallery-dl deep integration for pyfetcher.downloaders.

Purpose:

Wrap gallery-dl’s job/config API for programmatic downloading with metadata capture and file interception.

class pyfetcher.downloaders.gallerydl.GalleryDlDownloader(*, extra_config=None)[source]

Deep gallery-dl integration via its Python API.

Uses gallery-dl’s configuration system and job runner to download images and galleries, capturing per-file metadata.

Parameters:

extra_config (dict[str, Any] | None) – Additional gallery-dl configuration dict.

async extract_info(url)[source]

Extract metadata for all downloadable items without downloading.

Parameters:

url (str) – Gallery or image URL.

Returns:

A list of MediaInfo objects.

Return type:

list[MediaInfo]

async download(url, *, output_dir=None, progress_callback=None)[source]

Download all items from a URL.

Parameters:
  • url (str) – Gallery or image URL.

  • output_dir (str | None) – Directory for downloaded files. Uses temp dir if not provided.

  • progress_callback (Callable[[DownloadProgress], None] | None) – Optional callback for progress updates.

Returns:

A list of DownloadResult objects.

Return type:

list[DownloadResult]

Direct HTTP download with MinIO upload for pyfetcher.downloaders.

Purpose:

Provide direct HTTP file downloads using pyfetcher’s existing fetch infrastructure, with optional streaming to MinIO.

class pyfetcher.downloaders.direct.DirectDownloader(*, fetch_service=None)[source]

Direct HTTP downloader using pyfetcher’s FetchService.

Streams files to disk using the existing streaming infrastructure, then optionally uploads to MinIO.

Parameters:

fetch_service (FetchService | None) – Optional FetchService instance.

async extract_info(url)[source]

Extract info via HEAD request.

Parameters:

url (str) – File URL.

Returns:

A list with one MediaInfo.

Return type:

list[MediaInfo]

async download(url, *, output_dir=None, progress_callback=None)[source]

Download a file via HTTP streaming.

Parameters:
  • url (str) – File URL.

  • output_dir (str | None) – Output directory. Uses temp dir if not provided.

  • progress_callback (object | None) – Not used for direct downloads.

Returns:

A list with one DownloadResult.

Return type:

list[DownloadResult]