haive.core.engine.document.loaders.path_analyzer¶

Path analysis for automatic source detection.

This module provides comprehensive path analysis to automatically detect the type of document source from a path string. Critical for auto-loading.

Classes¶

FileCategory

High-level file category.

PathAnalysisResult

Result of comprehensive path analysis.

PathAnalyzer

Analyzes paths to determine source type and characteristics.

PathType

Primary path type classification.

SourceInfo

Comprehensive information about a detected document source.

Functions¶

analyze_path(path)

Convenience function for path analysis.

analyze_path_to_source_info(path)

Analyze path and return SourceInfo directly.

convert_to_source_info(analysis)

Convert PathAnalysisResult to SourceInfo for compatibility.

Module Contents¶

class haive.core.engine.document.loaders.path_analyzer.FileCategory[source]¶

Bases: str, enum.Enum

High-level file category.

Initialize self. See help(type(self)) for accurate signature.

class haive.core.engine.document.loaders.path_analyzer.PathAnalysisResult(/, **data)[source]¶

Bases: pydantic.BaseModel

Result of comprehensive path analysis.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

data (Any)

class haive.core.engine.document.loaders.path_analyzer.PathAnalyzer[source]¶

Analyzes paths to determine source type and characteristics.

classmethod analyze(path)[source]¶

Perform comprehensive path analysis.

Parameters:

path (str | pathlib.Path)

Return type:

PathAnalysisResult

class haive.core.engine.document.loaders.path_analyzer.PathType[source]¶

Bases: str, enum.Enum

Primary path type classification.

Initialize self. See help(type(self)) for accurate signature.

class haive.core.engine.document.loaders.path_analyzer.SourceInfo(/, **data)[source]¶

Bases: pydantic.BaseModel

Comprehensive information about a detected document source.

This Pydantic model contains the complete results of source detection and analysis, providing all information needed for optimal loader selection and configuration. Created by the PathAnalyzer during the source detection phase.

Parameters:

data (Any)

source_type¶

Specific source type identifier used for loader selection. Examples: ‘pdf’, ‘web’, ‘csv’, ‘postgresql’, ‘s3’, ‘sharepoint’. This maps directly to registered loader implementations.

Type:

str

category¶

High-level classification of the source type. Used for capability grouping and fallback logic. Categories include: FILE_DOCUMENT, WEB_SCRAPING, DATABASE_SQL, CLOUD_STORAGE, etc.

Type:

SourceCategory

confidence¶

Detection confidence score from 0.0 to 1.0. Higher values indicate more certain detection. Values below 0.5 may trigger additional validation or fallback detection methods.

Type:

float

metadata¶

Rich metadata collected during analysis. Contains source-specific information such as: - file_extension: File extension for local files - mime_type: Detected MIME type - estimated_size: Estimated content size - url_components: Parsed URL components for web sources - database_type: Database system type for database sources

Type:

Dict[str, Any]

capabilities¶

List of supported capabilities for this source type. Used for loader filtering and feature availability checks. None if not determined.

Type:

Optional[List[LoaderCapability]]

Examples

PDF file detection result:

source_info = SourceInfo(
    source_type="pdf",
    category=SourceCategory.FILE_DOCUMENT,
    confidence=0.95,
    metadata={
        "file_extension": ".pdf",
        "mime_type": "application/pdf",
        "estimated_size": 1024000
    },
    capabilities=[
        LoaderCapability.TEXT_EXTRACTION,
        LoaderCapability.METADATA_EXTRACTION
    ]
)

Web source detection result:

source_info = SourceInfo(
    source_type="web",
    category=SourceCategory.WEB_SCRAPING,
    confidence=0.90,
    metadata={
        "protocol": "https",
        "domain": "docs.example.com",
        "url_components": {"scheme": "https", "host": "docs.example.com"}
    },
    capabilities=[
        LoaderCapability.WEB_SCRAPING,
        LoaderCapability.BULK_LOADING
    ]
)
Usage:

This class is primarily used internally by the AutoLoader system for source detection and loader selection. Users typically don’t create SourceInfo instances directly but receive them in LoadingResult objects and through the detect_source() method.

See also

  • PathAnalyzer: Creates SourceInfo instances

  • LoadingResult: Contains SourceInfo for completed operations

  • SourceCategory: Enumeration of source categories

  • LoaderCapability: Enumeration of loader capabilities

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

haive.core.engine.document.loaders.path_analyzer.analyze_path(path)[source]¶

Convenience function for path analysis.

Parameters:

path (str | pathlib.Path)

Return type:

PathAnalysisResult

haive.core.engine.document.loaders.path_analyzer.analyze_path_to_source_info(path)[source]¶

Analyze path and return SourceInfo directly.

Parameters:

path (str | pathlib.Path) – Path to analyze

Returns:

SourceInfo object with detected source information

Return type:

SourceInfo

haive.core.engine.document.loaders.path_analyzer.convert_to_source_info(analysis)[source]¶

Convert PathAnalysisResult to SourceInfo for compatibility.

Parameters:

analysis (PathAnalysisResult) – PathAnalysisResult from path analysis

Returns:

SourceInfo object with detected information

Return type:

SourceInfo