haive.core.engine.document.loaders.path_analyzer¶

Path analysis for automatic source detection.

This module provides comprehensive path analysis to automatically detect the type of document source from a path string. Critical for auto-loading.

Classes¶

`FileCategory`	High-level file category.
`PathAnalysisResult`	Result of comprehensive path analysis.
`PathAnalyzer`	Analyzes paths to determine source type and characteristics.
`PathType`	Primary path type classification.
`SourceInfo`	Comprehensive information about a detected document source.

Functions¶

`analyze_path`(path)	Convenience function for path analysis.
`analyze_path_to_source_info`(path)	Analyze path and return SourceInfo directly.
`convert_to_source_info`(analysis)	Convert PathAnalysisResult to SourceInfo for compatibility.

Module Contents¶

class haive.core.engine.document.loaders.path_analyzer.FileCategory[source]¶

Bases: str, enum.Enum

High-level file category.

Initialize self. See help(type(self)) for accurate signature.

class haive.core.engine.document.loaders.path_analyzer.PathAnalysisResult(/, **data)[source]¶

Bases: pydantic.BaseModel

Result of comprehensive path analysis.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:: data (Any)

class haive.core.engine.document.loaders.path_analyzer.PathAnalyzer[source]¶

Analyzes paths to determine source type and characteristics.

classmethod analyze(path)[source]¶

Perform comprehensive path analysis.

Parameters:: path (str | pathlib.Path)
Return type:: PathAnalysisResult

class haive.core.engine.document.loaders.path_analyzer.PathType[source]¶

Bases: str, enum.Enum

Primary path type classification.

Initialize self. See help(type(self)) for accurate signature.

class haive.core.engine.document.loaders.path_analyzer.SourceInfo(/, **data)[source]¶

Bases: pydantic.BaseModel

Comprehensive information about a detected document source.

This Pydantic model contains the complete results of source detection and analysis, providing all information needed for optimal loader selection and configuration. Created by the PathAnalyzer during the source detection phase.

Parameters:: data (Any)

source_type¶

Specific source type identifier used for loader selection. Examples: ‘pdf’, ‘web’, ‘csv’, ‘postgresql’, ‘s3’, ‘sharepoint’. This maps directly to registered loader implementations.

Type:: str

category¶

High-level classification of the source type. Used for capability grouping and fallback logic. Categories include: FILE_DOCUMENT, WEB_SCRAPING, DATABASE_SQL, CLOUD_STORAGE, etc.

Type:: SourceCategory

confidence¶

Detection confidence score from 0.0 to 1.0. Higher values indicate more certain detection. Values below 0.5 may trigger additional validation or fallback detection methods.

Type:: float

metadata¶

Rich metadata collected during analysis. Contains source-specific information such as: - file_extension: File extension for local files - mime_type: Detected MIME type - estimated_size: Estimated content size - url_components: Parsed URL components for web sources - database_type: Database system type for database sources

Type:: Dict[str, Any]

capabilities¶

List of supported capabilities for this source type. Used for loader filtering and feature availability checks. None if not determined.

Type:: Optional[List[LoaderCapability]]

Examples

PDF file detection result:

source_info = SourceInfo(
    source_type="pdf",
    category=SourceCategory.FILE_DOCUMENT,
    confidence=0.95,
    metadata={
        "file_extension": ".pdf",
        "mime_type": "application/pdf",
        "estimated_size": 1024000
    },
    capabilities=[
        LoaderCapability.TEXT_EXTRACTION,
        LoaderCapability.METADATA_EXTRACTION
    ]
)

Web source detection result:

source_info = SourceInfo(
    source_type="web",
    category=SourceCategory.WEB_SCRAPING,
    confidence=0.90,
    metadata={
        "protocol": "https",
        "domain": "docs.example.com",
        "url_components": {"scheme": "https", "host": "docs.example.com"}
    },
    capabilities=[
        LoaderCapability.WEB_SCRAPING,
        LoaderCapability.BULK_LOADING
    ]
)

Usage:: This class is primarily used internally by the AutoLoader system for source detection and loader selection. Users typically don’t create SourceInfo instances directly but receive them in LoadingResult objects and through the detect_source() method.