haive.core.engine.document.loaders.specific.web¶

Web Loaders for Document Engine.

This module implements specialized web loaders for different types of web content including GitHub, ArXiv, Wikipedia, and general web pages.

Classes¶

ArXivSource

ArXiv research paper source.

BasicWebSource

Basic web source for simple HTML pages.

GitHubSource

GitHub repository and content source.

PlaywrightWebSource

Advanced web source using Playwright for JavaScript-heavy sites.

WikipediaSource

Wikipedia article source.

Module Contents¶

class haive.core.engine.document.loaders.specific.web.ArXivSource(query=None, paper_id=None, max_results=10, **kwargs)[source]¶

Bases: haive.core.engine.document.loaders.sources.implementation.WebUrlSource

ArXiv research paper source.

Init .

Parameters:
  • query (str | None) – [TODO: Add description]

  • paper_id (str | None) – [TODO: Add description]

  • max_results (int) – [TODO: Add description]

can_handle(path)[source]¶

Check if this is an ArXiv identifier or URL.

Parameters:

path (str)

Return type:

bool

create_loader()[source]¶

Create an ArXiv loader.

Return type:

langchain_core.document_loaders.base.BaseLoader | None

get_confidence_score(path)[source]¶

Get confidence score for ArXiv sources.

Parameters:

path (str)

Return type:

float

class haive.core.engine.document.loaders.specific.web.BasicWebSource(web_paths, requests_kwargs=None, **kwargs)[source]¶

Bases: haive.core.engine.document.loaders.sources.implementation.WebUrlSource

Basic web source for simple HTML pages.

Init .

Parameters:
  • web_paths (list[str]) – [TODO: Add description]

  • requests_kwargs (dict[str, Any] | None) – [TODO: Add description]

can_handle(path)[source]¶

Check if this is a web URL.

Parameters:

path (str)

Return type:

bool

create_loader()[source]¶

Create a basic web loader.

Return type:

langchain_core.document_loaders.base.BaseLoader | None

get_confidence_score(path)[source]¶

Get confidence score for web URLs.

Parameters:

path (str)

Return type:

float

class haive.core.engine.document.loaders.specific.web.GitHubSource(repo_url, file_filter=None, include_issues=False, include_pull_requests=False, **kwargs)[source]¶

Bases: haive.core.engine.document.loaders.sources.implementation.WebUrlSource

GitHub repository and content source.

Init .

Parameters:
  • repo_url (str) – [TODO: Add description]

  • file_filter (list[str] | None) – [TODO: Add description]

  • include_issues (bool) – [TODO: Add description]

  • include_pull_requests (bool) – [TODO: Add description]

can_handle(path)[source]¶

Check if this is a GitHub URL.

Parameters:

path (str)

Return type:

bool

create_loader()[source]¶

Create a GitHub loader.

Return type:

langchain_core.document_loaders.base.BaseLoader | None

get_confidence_score(path)[source]¶

Get confidence score for GitHub URLs.

Parameters:

path (str)

Return type:

float

get_credential_requirements()[source]¶

GitHub needs API token.

Return type:

list[haive.core.engine.document.loaders.sources.implementation.CredentialType]

requires_authentication()[source]¶

GitHub may require authentication for private repos.

Return type:

bool

class haive.core.engine.document.loaders.specific.web.PlaywrightWebSource(urls, wait_until='networkidle', headless=True, **kwargs)[source]¶

Bases: haive.core.engine.document.loaders.sources.implementation.WebUrlSource

Advanced web source using Playwright for JavaScript-heavy sites.

Init .

Parameters:
  • urls (list[str]) – [TODO: Add description]

  • wait_until (str) – [TODO: Add description]

  • headless (bool) – [TODO: Add description]

can_handle(path)[source]¶

Check if this is a web URL suitable for Playwright.

Parameters:

path (str)

Return type:

bool

create_loader()[source]¶

Create a Playwright web loader.

Return type:

langchain_core.document_loaders.base.BaseLoader | None

get_confidence_score(path)[source]¶

Get confidence score for web URLs (lower priority than basic web).

Parameters:

path (str)

Return type:

float

class haive.core.engine.document.loaders.specific.web.WikipediaSource(query=None, page_title=None, lang='en', load_max_docs=1, **kwargs)[source]¶

Bases: haive.core.engine.document.loaders.sources.implementation.WebUrlSource

Wikipedia article source.

Init .

Parameters:
  • query (str | None) – [TODO: Add description]

  • page_title (str | None) – [TODO: Add description]

  • lang (str) – [TODO: Add description]

  • load_max_docs (int) – [TODO: Add description]

can_handle(path)[source]¶

Check if this is a Wikipedia URL or identifier.

Parameters:

path (str)

Return type:

bool

create_loader()[source]¶

Create a Wikipedia loader.

Return type:

langchain_core.document_loaders.base.BaseLoader | None

get_confidence_score(path)[source]¶

Get confidence score for Wikipedia sources.

Parameters:

path (str)

Return type:

float