haive.core.engine.document.loaders.auto_loader¶

Ultimate Auto-Loader for Document Sources.

This module provides the ultimate auto-loader functionality that can automatically detect, instantiate, and load documents from any source type. It integrates with the enhanced registry and path analyzer to provide seamless document loading.

The AutoLoader is the main entry point for users who want to load documents without manually configuring source types and loaders.

Examples

Basic auto-loading:

from haive.core.engine.document.loaders import AutoLoader

# Auto-detect and load from any source
loader = AutoLoader()
documents = loader.load("https://example.com/docs")

With preferences:

# Prefer quality over speed
loader = AutoLoader(preference="quality")
documents = loader.load("s3://bucket/documents/")

Bulk loading:

# Load entire directory/bucket/site
loader = AutoLoader()
documents = loader.load_all("/path/to/documents")

Author: Claude (Haive Document Loader System) Version: 1.0.0

Classes¶

`AutoLoader`	Ultimate automatic document loader with 230+ langchain_community integrations.
`AutoLoaderConfig`	Configuration model for the AutoLoader system.
`BulkLoadingResult`	Comprehensive result container for bulk document loading operations.
`LoadingResult`	Comprehensive result container for single-source document loading operations.

Functions¶

`aload_document`(path_or_url, **kwargs)	Convenience function to load documents asynchronously.
`load_document`(path_or_url, **kwargs)	Convenience function to load documents automatically.
`load_documents_bulk`(sources, **kwargs)	Convenience function to load multiple documents.

Module Contents¶

class haive.core.engine.document.loaders.auto_loader.AutoLoader(config=None, registry=None, path_analyzer=None)[source]¶

Ultimate automatic document loader with 230+ langchain_community integrations.

The AutoLoader is the primary interface for loading documents from any source type. It automatically detects source types, selects optimal loaders, and provides comprehensive loading capabilities with enterprise-grade features.

This class implements the complete document loading pipeline including: source detection, loader selection, document loading, metadata enrichment, error handling, retry logic, caching, and concurrent processing.

config¶

Configuration controlling loader behavior.

Type:: AutoLoaderConfig

registry¶

Registry of available document loaders.

Type:: EnhancedRegistry

path_analyzer¶

Component for analyzing and detecting source types.

Type:: PathAnalyzer

Supported Sources:

Local Files: PDF, DOCX, TXT, CSV, JSON, XML, code files, archives
Web Sources: HTML pages, APIs, documentation sites, social media
Databases: PostgreSQL, MySQL, MongoDB, Redis, Elasticsearch
Cloud Storage: S3, Google Cloud, Azure Blob, Google Drive, Dropbox
Business Platforms: Salesforce, HubSpot, Zendesk, Jira, Confluence
Communication: Slack, Discord, Teams, email systems
Specialized: Government data, healthcare, finance, education

Key Methods:

load(): Load documents from a single source
load_documents(): Load from multiple sources (standard langchain method)
load_bulk(): Bulk loading with detailed results
load_all(): Recursive loading from directories/websites
aload(): Async loading for high-performance scenarios

Examples

Basic document loading:

loader = AutoLoader()
docs = loader.load("document.pdf")  # Single document
docs = loader.load_documents(["file1.pdf", "file2.txt"])  # Multiple

Advanced configuration:

config = AutoLoaderConfig(
    preference=LoaderPreference.QUALITY,
    max_concurrency=20,
    enable_caching=True,
    enable_metadata=True
)
loader = AutoLoader(config)
docs = loader.load("https://complex-site.com")

Enterprise bulk loading:

sources = [
    "/shared/reports/quarterly.pdf",
    "s3://company-docs/policies/",
    "https://wiki.company.com/procedures",
    {"path": "salesforce://attachments", "auth": "token"}
]
result = loader.load_bulk(sources)
print(f"Loaded {result.total_documents} documents")

High-performance async loading:

async def process_sources():
    docs = await loader.aload_documents([
        "https://api.service.com/docs",
        "postgres://db/knowledge_base",
        "gs://bucket/research-papers/"
    ])
    return docs

Recursive directory processing:

# Load all documents from directory tree
docs = loader.load_all("/company/documents/")

# Scrape entire documentation site
docs = loader.load_all("https://docs.framework.com", max_depth=3)

Performance Features:

Concurrent loading with configurable worker limits
Intelligent caching with TTL support
Adaptive retry logic with exponential backoff
Progress tracking for bulk operations
Memory-efficient streaming for large datasets

Error Handling:

Graceful degradation for unsupported sources
Detailed error reporting with source tracking
Automatic fallback to alternative loaders
Comprehensive logging for debugging

Thread Safety:

This class is thread-safe and can be used safely in concurrent environments. Internal state is properly synchronized.

See also

AutoLoaderConfig: Configuration options
LoadingResult: Detailed loading results
BulkLoadingResult: Bulk operation results
LoaderPreference: Quality vs speed preferences

Initialize the AutoLoader with optional configuration and components.

Creates a new AutoLoader instance with the specified configuration. If no configuration is provided, uses sensible defaults optimized for balanced performance and quality.

Parameters:

config (Optional[AutoLoaderConfig]) – Configuration object controlling loader behavior including concurrency, preferences, caching, and retry settings. If None, uses default balanced configuration.
registry (Optional[Any]) – Custom enhanced registry instance containing document loader mappings. If None, uses the global enhanced registry with all 230+ registered loaders.
path_analyzer (Optional[PathAnalyzer]) – Custom path analyzer for source type detection. If None, uses the default PathAnalyzer instance.

Examples

Default initialization:

loader = AutoLoader()  # Uses balanced defaults

Custom configuration:

config = AutoLoaderConfig(
    preference=LoaderPreference.QUALITY,
    max_concurrency=5,
    enable_caching=True
)
loader = AutoLoader(config)

Advanced with custom components:

custom_registry = MyCustomRegistry()
custom_analyzer = MyPathAnalyzer()
loader = AutoLoader(
    config=my_config,
    registry=custom_registry,
    path_analyzer=custom_analyzer
)

Note

The AutoLoader automatically triggers source registration on first use. This process scans for available loaders and may take a few seconds on initial startup.

async aload(path_or_url, **kwargs)[source]¶

Asynchronously load documents from any source.

Parameters:

path_or_url (str) – Path, URL, or connection string to load from
**kwargs – Additional parameters passed to the source and loader

Returns:

List of loaded Document objects

Return type:

list[langchain_core.documents.Document]

Examples

Async document loading:

async def load_docs():
    documents = await loader.aload("https://example.com")
    return documents

documents = asyncio.run(load_docs())

async aload_bulk(sources, **kwargs)[source]¶

Asynchronously load documents from multiple sources.

Parameters:

sources (list[str | dict[str, Any]]) – List of source paths/URLs or dicts with source config
**kwargs – Default parameters applied to all sources

Returns:

BulkLoadingResult with aggregated results

Return type:

BulkLoadingResult

async aload_documents(sources, **kwargs)[source]¶

Asynchronously load documents from multiple sources (standard langchain plural method name).

This is the async version of load_documents() that takes a list of sources and returns a flattened list of all documents.

Parameters:

sources (list[str | dict[str, Any]]) – List of source paths/URLs or source configurations
**kwargs – Additional parameters passed to all sources and loaders

Returns:

Flattened list of Document objects from all sources

Return type:

list[langchain_core.documents.Document]

Examples

Async load from multiple sources:

loader = AutoLoader()
docs = await loader.aload_documents([
    "document1.pdf",
    "document2.txt",
    "https://example.com"
])

create_source_instance(source_info, path_or_url, **kwargs)[source]¶

Create a source instance for the detected source type.

Parameters:

source_info (haive.core.engine.document.loaders.path_analyzer.SourceInfo) – Source information from detection
path_or_url (str) – Original path or URL
**kwargs – Additional parameters for source creation

Returns:

Configured source instance

Raises:

ValueError – If source cannot be created

Return type:

haive.core.engine.document.loaders.sources.source_types.BaseSource

Examples

Create and configure source:

info = loader.detect_source("s3://bucket/file.pdf")
source = loader.create_source_instance(
    info,
    "s3://bucket/file.pdf",
    aws_access_key_id="key",
    aws_secret_access_key="secret"
)

detect_source(path_or_url)[source]¶

Detect source type and get source information.

Parameters:: path_or_url (str) – Path, URL, or connection string to analyze
Returns:: SourceInfo containing detected source details
Raises:: ValueError – If source type cannot be detected
Return type:: haive.core.engine.document.loaders.path_analyzer.SourceInfo

Examples

Detect file source:

info = loader.detect_source("/path/to/document.pdf")
print(f"Source type: {info.source_type}")
print(f"Category: {info.category}")

Detect web source:

info = loader.detect_source("https://example.com")
print(f"Capabilities: {info.capabilities}")

get_best_loader(source_info)[source]¶

Get the best loader for a source based on preferences.

Parameters:: source_info (haive.core.engine.document.loaders.path_analyzer.SourceInfo) – Source information from detection
Returns:: Tuple of (loader_name, loader_config)
Raises:: ValueError – If no suitable loader is found
Return type:: tuple[str, dict[str, Any]]

Examples

Get quality-focused loader:

config = AutoLoaderConfig(preference="quality")
loader = AutoLoader(config)
info = loader.detect_source("document.pdf")
loader_name, loader_config = loader.get_best_loader(info)

get_capabilities(source_type)[source]¶

Get capabilities for a specific source type.

Parameters:: source_type (str) – Name of the source type
Returns:: List of capabilities supported by the source
Return type:: list[haive.core.engine.document.loaders.sources.source_types.LoaderCapability]

Examples

Check source capabilities:

caps = loader.get_capabilities("pdf")
if LoaderCapability.BULK_LOADING in caps:
    print("Supports bulk loading")

get_supported_sources()[source]¶

Get information about all supported source types.

Returns:: Dictionary with source type information
Return type:: dict[str, Any]

Examples

List all supported sources:

sources = loader.get_supported_sources()
for source_type, info in sources.items():
    print(f"{source_type}: {info['description']}")

load(path_or_url, **kwargs)[source]¶

Load documents from any source with automatic detection and optimization.

This is the primary interface for single-source document loading. The method performs automatic source type detection, intelligent loader selection based on configured preferences, and returns a list of loaded Document objects.

The loading process includes: 1. Source type detection and analysis 2. Best loader selection based on preference and capabilities 3. Source instance creation with provided parameters 4. Document loading with retry logic and error handling 5. Optional metadata enrichment and caching

Parameters:

path_or_url (str) – Path, URL, or connection string to load from. Supports local files, web URLs, database connections, cloud storage URIs, and API endpoints. Examples: - “/path/to/file.pdf” (local file) - “https://example.com/doc.html” (web page) - “postgresql://user:pass@host/db” (database) - “s3://bucket/key” (cloud storage)
**kwargs – Additional parameters passed to the source and loader. Common parameters include: - extract_images (bool): Whether to extract images from documents - chunk_size (int): Text splitting chunk size - timeout (int): Override default timeout - headers (dict): HTTP headers for web requests - query (str): SQL query for database sources - recursive (bool): Recursive processing for directories

Returns:

List of loaded Document objects. Each Document contains:

page_content (str): Extracted text content
metadata (dict): Source metadata, extraction info, and enrichments

Return type:

List[Document]

Raises:

ValueError – If the source cannot be detected, is unsupported, or if required parameters are missing for the detected source type.
TimeoutError – If loading exceeds the configured timeout limit.
ConnectionError – If unable to connect to remote sources (web, database, API).
FileNotFoundError – If local files or directories do not exist.
PermissionError – If insufficient permissions to access the source.

Examples

Basic local file loading:

loader = AutoLoader()
docs = loader.load("/documents/report.pdf")
print(f"Loaded {len(docs)} pages")

Web page with custom parameters:

docs = loader.load(
    "https://docs.example.com/api",
    headers={"Authorization": "Bearer token"},
    timeout=120
)

Database with custom query:

docs = loader.load(
    "postgresql://user:pass@localhost:5432/knowledge",
    query="SELECT title, content FROM articles WHERE published = true",
    chunk_size=2000
)

Cloud storage with credentials:

docs = loader.load(
    "s3://company-docs/policies/security.pdf",
    aws_access_key_id="AKIA...",
    aws_secret_access_key="secret",
    region_name="us-east-1"
)

High-quality extraction:

config = AutoLoaderConfig(preference=LoaderPreference.QUALITY)
loader = AutoLoader(config)
docs = loader.load("complex_document.pdf", extract_images=True)

Note

Results are automatically cached if caching is enabled in configuration
Metadata enrichment adds source tracking information when enabled
The method is thread-safe and can be called concurrently
For multiple sources, consider using load_documents() or load_bulk()

See also

load(): Load from a single source
load_bulk(): Get detailed results and error information
load_all(): Recursive loading from directories/websites
aload_documents(): Async version for high-performance scenarios

Note

For detailed loading results including error information and per-source statistics, use load_bulk() instead.

validate_credentials(source_type, **credentials)[source]¶

Validate credentials for a source type.

Parameters:

source_type (str) – Name of the source type
**credentials – Credential parameters to validate

Returns:

True if credentials are valid

Return type:

bool

Examples

Validate database credentials:

valid = loader.validate_credentials(
    "postgresql",
    host="localhost",
    username="user",
    password="pass"
)

class haive.core.engine.document.loaders.auto_loader.AutoLoaderConfig(/, **data)[source]¶

Bases: pydantic.BaseModel

Configuration model for the AutoLoader system.

This class defines all configuration options for the AutoLoader, allowing fine-tuned control over loading behavior, performance characteristics, and operational parameters.

preference¶

Loading preference balancing speed vs quality. Options: SPEED, QUALITY, BALANCED. Default: BALANCED.

Type:: LoaderPreference

max_concurrency¶

Maximum number of concurrent loading operations. Range: 1-100. Default: 10.

Type:: int

timeout¶

Timeout for individual loading operations in seconds. Minimum: 10. Default: 300.

Type:: int

retry_attempts¶

Number of retry attempts for failed loads. Range: 0-10. Default: 3.

Type:: int

enable_caching¶

Whether to enable document caching for performance. Default: False.

Type:: bool

cache_ttl¶

Cache time-to-live in seconds. Minimum: 60. Default: 3600.

Type:: int

default_chunk_size¶

Default chunk size for text splitting. Range: 100-10000. Default: 1000.

Type:: int

enable_metadata¶

Whether to extract and enrich document metadata. Default: True.

Type:: bool

credential_manager¶

Custom credential manager instance. Default: None.

Type:: Optional[Any]

Examples

Basic quality-focused configuration:

config = AutoLoaderConfig(
    preference=LoaderPreference.QUALITY,
    max_concurrency=5,
    timeout=600,
    enable_metadata=True
)

High-performance configuration with caching:

config = AutoLoaderConfig(
    preference=LoaderPreference.SPEED,
    max_concurrency=50,
    enable_caching=True,
    cache_ttl=7200,
    retry_attempts=1
)

Balanced configuration for production:

config = AutoLoaderConfig(
    preference=LoaderPreference.BALANCED,
    max_concurrency=20,
    timeout=300,
    enable_caching=True,
    enable_metadata=True
)

Raises:: ValidationError – If any configuration values are outside valid ranges.
Parameters:: data (Any)

Note

Higher concurrency improves performance but increases resource usage. Enable caching for repeated document access patterns. Quality preference may be slower but provides better text extraction.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

class Config[source]¶: Pydantic configuration.

class haive.core.engine.document.loaders.auto_loader.BulkLoadingResult(/, **data)[source]¶

Bases: pydantic.BaseModel

Comprehensive result container for bulk document loading operations.

This Pydantic model provides detailed information about bulk loading operations, including individual source results, aggregate statistics, error tracking, and performance metrics across all sources.

Parameters:: data (Any)

total_documents¶

Total number of documents successfully loaded across all sources. Sum of documents from all successful LoadingResults.

Type:: int

results¶

List of individual LoadingResult objects, one for each source that was processed (both successful and failed). Provides detailed per-source information including errors.

Type:: List[LoadingResult]

failed_sources¶

List of tuples containing (source_identifier, error_message) for sources that failed to load. Allows easy identification of problematic sources.

Type:: List[Tuple[str, str]]

total_time¶

Total elapsed time for the entire bulk operation in seconds, including all concurrent processing and overhead.

Type:: float

summary¶

Dictionary containing aggregate statistics: - total_sources (int): Number of sources processed - successful_loads (int): Number of sources loaded successfully - failed_loads (int): Number of sources that failed - success_rate (float): Percentage of successful loads - avg_loading_time (float): Average time per source - total_errors (int): Total number of errors encountered

Type:: Dict[str, Any]

Examples

Analyzing bulk loading results:

sources = ["doc1.pdf", "doc2.pdf", "invalid.pdf"]
result = loader.load_bulk(sources)

print(f"Loaded {result.total_documents} documents")
print(f"Success rate: {result.summary['success_rate']:.1f}%")
print(f"Total time: {result.total_time:.2f}s")

if result.failed_sources:
    print("Failed sources:")
    for source, error in result.failed_sources:
        print(f"  {source}: {error}")

Processing individual results:

for i, loading_result in enumerate(result.results):
    source = sources[i]
    if loading_result.errors:
        print(f"{source} failed: {loading_result.errors}")
    else:
        docs = len(loading_result.documents)
        time = loading_result.loading_time
        print(f"{source}: {docs} docs in {time:.2f}s")

Performance analysis:

print("Performance Summary:")
print(f"  Total sources: {result.summary['total_sources']}")
print(f"  Average time per source: {result.summary['avg_loading_time']:.2f}s")
print(f"  Concurrent efficiency: {result.summary['total_sources'] * result.summary['avg_loading_time'] / result.total_time:.1f}x")

Note

This class is returned by load_bulk() and aload_bulk() methods. For simple flattened document lists, use load_documents() instead.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

class haive.core.engine.document.loaders.auto_loader.LoadingResult(/, **data)[source]¶

Bases: pydantic.BaseModel

Comprehensive result container for single-source document loading operations.

This Pydantic model encapsulates all information about a document loading operation, including the loaded documents, source analysis results, performance metrics, and any errors encountered during the process.

Parameters:: data (Any)

documents¶

List of successfully loaded Document objects. Each Document contains page_content (str) and metadata (dict). Empty list if loading failed.

Type:: List[Document]

source_info¶

Detailed information about the detected source including source type, category, confidence score, and capabilities.

Type:: SourceInfo

loader_used¶

Name of the specific loader that was selected and used for this operation (e.g., “pypdf”, “beautiful_soup”, “csv”).

Type:: str

loading_time¶

Total time taken for the loading operation in seconds, including source detection, loader instantiation, and document extraction.

Type:: float

metadata¶

Additional metadata collected during loading including loader configuration, extraction settings, and performance info.

Type:: Dict[str, Any]

errors¶

List of error messages encountered during loading. Empty list indicates successful loading without errors.

Type:: List[str]

Examples

Successful loading result:

result = loader.load_detailed("document.pdf")
print(f"Loaded {len(result.documents)} documents")
print(f"Source: {result.source_info.source_type}")
print(f"Loader: {result.loader_used}")
print(f"Time: {result.loading_time:.2f}s")

Error handling:

result = loader.load_detailed("invalid.pdf")
if result.errors:
    print(f"Loading failed: {result.errors}")
else:
    print(f"Success: {len(result.documents)} documents")

Note

This class is returned by load_detailed() and is included in BulkLoadingResult for individual source results in bulk operations.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

async haive.core.engine.document.loaders.auto_loader.aload_document(path_or_url, **kwargs)[source]¶

Convenience function to load documents asynchronously.

Parameters:

path_or_url (str) – Path, URL, or connection string to load from
**kwargs – Additional parameters

Returns:

List of loaded documents

Return type:

list[langchain_core.documents.Document]

Examples

Async document loading:

from haive.core.engine.document.loaders import aload_document

documents = await aload_document("https://example.com")

haive.core.engine.document.loaders.auto_loader.load_document(path_or_url, **kwargs)[source]¶

Convenience function to load documents automatically.

Parameters:

path_or_url (str) – Path, URL, or connection string to load from
**kwargs – Additional parameters

Returns:

List of loaded documents

Return type:

list[langchain_core.documents.Document]

Examples

Quick document loading:

from haive.core.engine.document.loaders import load_document

documents = load_document("file.pdf")
documents = load_document("https://example.com")

haive.core.engine.document.loaders.auto_loader.load_documents_bulk(sources, **kwargs)[source]¶

Convenience function to load multiple documents.

Parameters:

sources (list[str]) – List of paths, URLs, or connection strings
**kwargs – Additional parameters

Returns:

Flattened list of all loaded documents

Return type:

list[langchain_core.documents.Document]

Examples

Bulk loading:

from haive.core.engine.document.loaders import load_documents_bulk

documents = load_documents_bulk([
    "file1.pdf",
    "file2.docx",
    "https://example.com"
])