haive.core.engine.document.processors¶

Document Processing Components.

This module provides document processing capabilities including chunking and content transformation that integrate with the DocumentEngine.

The processors handle: - Content normalization - Document chunking strategies - Metadata extraction - Format conversion

Classes¶

`ChunkingProcessor`	Processor for chunking documents into smaller pieces.
`ContentNormalizer`	Processor for normalizing document content.
`DocumentProcessor`	Base class for document processing operations.
`FormatDetector`	Processor for detecting document formats.
`MetadataExtractor`	Processor for extracting metadata from documents.

Module Contents¶

class haive.core.engine.document.processors.ChunkingProcessor(chunking_strategy=ChunkingStrategy.RECURSIVE, chunk_size=1000, chunk_overlap=200, **kwargs)[source]¶

Bases: DocumentProcessor

Processor for chunking documents into smaller pieces.

Initialize the chunking processor.

Parameters:

chunking_strategy (haive.core.engine.document.config.ChunkingStrategy) – Strategy for chunking
chunk_size (int) – Size of chunks in characters
chunk_overlap (int) – Overlap between chunks
**kwargs – Additional configuration

chunk_text(text, strategy, chunk_size, chunk_overlap, metadata)[source]¶

Chunk text according to the specified strategy.

Parameters:

text (str) – Text to chunk
strategy (haive.core.engine.document.config.ChunkingStrategy) – Chunking strategy
chunk_size (int) – Size of chunks
chunk_overlap (int) – Overlap between chunks
metadata (dict[str, Any]) – Base metadata for chunks

Returns:

List of document chunks

Return type:

list[haive.core.engine.document.config.DocumentChunk]

class haive.core.engine.document.processors.ContentNormalizer(normalize_whitespace=True, remove_extra_newlines=True, strip_content=True, **kwargs)[source]¶

Bases: DocumentProcessor

Processor for normalizing document content.

Initialize the content normalizer.

Parameters:

normalize_whitespace (bool) – Whether to normalize whitespace
remove_extra_newlines (bool) – Whether to remove extra newlines
strip_content (bool) – Whether to strip leading/trailing whitespace
**kwargs – Additional configuration

normalize_content(content)[source]¶

Normalize document content.

Parameters:: content (str) – Content to normalize
Returns:: Normalized content
Return type:: str

class haive.core.engine.document.processors.DocumentProcessor(**kwargs)[source]¶

Base class for document processing operations.

Initialize the processor.

abstractmethod process(document)[source]¶

Process a document.

Parameters:: document (langchain_core.documents.Document) – Document to process
Returns:: Processed document
Return type:: haive.core.engine.document.config.ProcessedDocument

class haive.core.engine.document.processors.FormatDetector(**kwargs)[source]¶

Bases: DocumentProcessor

Processor for detecting document formats.

Initialize the processor.

detect_format(content, metadata)[source]¶

Detect document format from content and metadata.

Parameters:

content (str) – Document content
metadata (dict[str, Any]) – Document metadata

Returns:

Detected document format

Return type:

haive.core.engine.document.config.DocumentFormat

class haive.core.engine.document.processors.MetadataExtractor(**kwargs)[source]¶

Bases: DocumentProcessor

Processor for extracting metadata from documents.

Initialize the processor.

extract_metadata(content, existing_metadata)[source]¶

Extract additional metadata from document content.

Parameters:

content (str) – Document content
existing_metadata (dict[str, Any]) – Existing metadata

Returns:

Enhanced metadata dictionary

Return type:

dict[str, Any]