haive.core.engine.document.loaders.path_analyzer¶
Path analysis for automatic source detection.
This module provides comprehensive path analysis to automatically detect the type of document source from a path string. Critical for auto-loading.
Classes¶
High-level file category. |
|
Result of comprehensive path analysis. |
|
Analyzes paths to determine source type and characteristics. |
|
Primary path type classification. |
|
Comprehensive information about a detected document source. |
Functions¶
|
Convenience function for path analysis. |
Analyze path and return SourceInfo directly. |
|
|
Convert PathAnalysisResult to SourceInfo for compatibility. |
Module Contents¶
- class haive.core.engine.document.loaders.path_analyzer.FileCategory[source]¶
-
High-level file category.
Initialize self. See help(type(self)) for accurate signature.
- class haive.core.engine.document.loaders.path_analyzer.PathAnalysisResult(/, **data)[source]¶
Bases:
pydantic.BaseModelResult of comprehensive path analysis.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
data (Any)
- class haive.core.engine.document.loaders.path_analyzer.PathAnalyzer[source]¶
Analyzes paths to determine source type and characteristics.
- classmethod analyze(path)[source]¶
Perform comprehensive path analysis.
- Parameters:
path (str | pathlib.Path)
- Return type:
- class haive.core.engine.document.loaders.path_analyzer.PathType[source]¶
-
Primary path type classification.
Initialize self. See help(type(self)) for accurate signature.
- class haive.core.engine.document.loaders.path_analyzer.SourceInfo(/, **data)[source]¶
Bases:
pydantic.BaseModelComprehensive information about a detected document source.
This Pydantic model contains the complete results of source detection and analysis, providing all information needed for optimal loader selection and configuration. Created by the PathAnalyzer during the source detection phase.
- Parameters:
data (Any)
- source_type¶
Specific source type identifier used for loader selection. Examples: ‘pdf’, ‘web’, ‘csv’, ‘postgresql’, ‘s3’, ‘sharepoint’. This maps directly to registered loader implementations.
- Type:
- category¶
High-level classification of the source type. Used for capability grouping and fallback logic. Categories include: FILE_DOCUMENT, WEB_SCRAPING, DATABASE_SQL, CLOUD_STORAGE, etc.
- Type:
- confidence¶
Detection confidence score from 0.0 to 1.0. Higher values indicate more certain detection. Values below 0.5 may trigger additional validation or fallback detection methods.
- Type:
- metadata¶
Rich metadata collected during analysis. Contains source-specific information such as: - file_extension: File extension for local files - mime_type: Detected MIME type - estimated_size: Estimated content size - url_components: Parsed URL components for web sources - database_type: Database system type for database sources
- Type:
Dict[str, Any]
- capabilities¶
List of supported capabilities for this source type. Used for loader filtering and feature availability checks. None if not determined.
- Type:
Optional[List[LoaderCapability]]
Examples
PDF file detection result:
source_info = SourceInfo( source_type="pdf", category=SourceCategory.FILE_DOCUMENT, confidence=0.95, metadata={ "file_extension": ".pdf", "mime_type": "application/pdf", "estimated_size": 1024000 }, capabilities=[ LoaderCapability.TEXT_EXTRACTION, LoaderCapability.METADATA_EXTRACTION ] )
Web source detection result:
source_info = SourceInfo( source_type="web", category=SourceCategory.WEB_SCRAPING, confidence=0.90, metadata={ "protocol": "https", "domain": "docs.example.com", "url_components": {"scheme": "https", "host": "docs.example.com"} }, capabilities=[ LoaderCapability.WEB_SCRAPING, LoaderCapability.BULK_LOADING ] )
- Usage:
This class is primarily used internally by the AutoLoader system for source detection and loader selection. Users typically don’t create SourceInfo instances directly but receive them in LoadingResult objects and through the detect_source() method.
See also
PathAnalyzer: Creates SourceInfo instances
LoadingResult: Contains SourceInfo for completed operations
SourceCategory: Enumeration of source categories
LoaderCapability: Enumeration of loader capabilities
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- haive.core.engine.document.loaders.path_analyzer.analyze_path(path)[source]¶
Convenience function for path analysis.
- Parameters:
path (str | pathlib.Path)
- Return type:
- haive.core.engine.document.loaders.path_analyzer.analyze_path_to_source_info(path)[source]¶
Analyze path and return SourceInfo directly.
- Parameters:
path (str | pathlib.Path) – Path to analyze
- Returns:
SourceInfo object with detected source information
- Return type:
- haive.core.engine.document.loaders.path_analyzer.convert_to_source_info(analysis)[source]¶
Convert PathAnalysisResult to SourceInfo for compatibility.
- Parameters:
analysis (PathAnalysisResult) – PathAnalysisResult from path analysis
- Returns:
SourceInfo object with detected information
- Return type: