haive.agents.document_processing.agentΒΆ

Comprehensive Document Processing Agent.

This agent provides end-to-end document processing capabilities including: - Document fetching with ReactAgent + search tools - Auto-loading with bulk processing - Transform/split/annotate/embed pipeline - Advanced RAG features (refined queries, self-query, etc.) - State management and persistence

The agent integrates all existing Haive document processing components into a unified, powerful system for document-based AI workflows.

Examples

Basic document processing:

agent = DocumentProcessingAgent()
result = agent.process_query("Load and analyze reports from https://company.com/reports")

Advanced RAG with custom retrieval:

config = DocumentProcessingConfig(
    retrieval_strategy="self_query",
    query_refinement=True,
    annotation_enabled=True,
    embedding_model="text-embedding-3-large"
)
agent = DocumentProcessingAgent(config=config)
result = agent.process_query("Find all financial projections from Q4 2024")

Multi-source document processing:

sources = [
    "/path/to/local/docs/",
    "https://wiki.company.com/procedures",
    "s3://bucket/documents/",
    {"url": "https://api.service.com/docs", "headers": {"Authorization": "Bearer token"}}
]
agent = DocumentProcessingAgent()
result = agent.process_sources(sources, query="Extract key insights")

Author: Claude (Haive AI Agent Framework) Version: 1.0.0

ClassesΒΆ

DocumentProcessingAgent

Comprehensive document processing agent with full pipeline capabilities.

DocumentProcessingConfig

Configuration for comprehensive document processing.

DocumentProcessingResult

Result from document processing operation.

DocumentProcessingState

State for document processing operations.

Module ContentsΒΆ

class haive.agents.document_processing.agent.DocumentProcessingAgent(config=None, engine=None, name='document_processor')ΒΆ

Comprehensive document processing agent with full pipeline capabilities.

This agent provides a complete document processing pipeline including: 1. Document Discovery & Fetching (ReactAgent + search tools) 2. Auto-loading with bulk processing 3. Transform/split/annotate/embed pipeline 4. Advanced RAG features 5. State management and persistence

The agent integrates all existing Haive document processing components into a unified, powerful system for document-based AI workflows.

Initialize the document processing agent.

Parameters:
  • config (DocumentProcessingConfig | None) – Configuration for document processing

  • engine (haive.core.engine.aug_llm.AugLLMConfig | None) – LLM engine configuration

  • name (str) – Agent name for identification

get_capabilities()ΒΆ

Get agent capabilities and configuration.

Return type:

dict[str, Any]

async process_query(query, sources=None)ΒΆ

Process a query with comprehensive document processing pipeline.

Parameters:
  • query (str) – The user query to process

  • sources (list[str | dict[str, Any]] | None) – Optional list of specific sources to use

Returns:

DocumentProcessingResult with comprehensive results

Return type:

DocumentProcessingResult

async process_sources(sources, query)ΒΆ

Process specific sources with a query.

Parameters:
  • sources (list[str | dict[str, Any]]) – List of sources to process

  • query (str) – Query to process against the sources

Returns:

DocumentProcessingResult with results

Return type:

DocumentProcessingResult

class haive.agents.document_processing.agent.DocumentProcessingConfig(/, **data)ΒΆ

Bases: pydantic.BaseModel

Configuration for comprehensive document processing.

Parameters:

data (Any)

# Core Processing
auto_loader_configΒΆ

Configuration for document auto-loading

enable_bulk_processingΒΆ

Enable concurrent bulk document processing

max_concurrent_loadsΒΆ

Maximum concurrent document loads

# Search & Retrieval
search_enabledΒΆ

Enable web search for document discovery

search_depthΒΆ

Search depth for web queries (β€œbasic” or β€œadvanced”)

retrieval_strategyΒΆ

Strategy for document retrieval

retrieval_configΒΆ

Configuration for retrieval components

# Query Processing
query_refinementΒΆ

Enable query refinement for better results

multi_query_enabledΒΆ

Enable multiple query variations

query_expansionΒΆ

Enable query expansion techniques

# Document Processing
annotation_enabledΒΆ

Enable document annotation

summarization_enabledΒΆ

Enable document summarization

kg_extraction_enabledΒΆ

Enable knowledge graph extraction

# RAG Configuration
rag_strategyΒΆ

RAG strategy to use

context_window_sizeΒΆ

Context window size for RAG

chunk_sizeΒΆ

Chunk size for document splitting

chunk_overlapΒΆ

Overlap between chunks

# Embedding & Vectorization
embedding_modelΒΆ

Embedding model to use

vector_store_configΒΆ

Vector store configuration

# Performance
enable_cachingΒΆ

Enable document caching

cache_ttlΒΆ

Cache time-to-live in seconds

enable_streamingΒΆ

Enable streaming responses

# Output
structured_outputΒΆ

Enable structured output generation

response_formatΒΆ

Format for agent responses

include_sourcesΒΆ

Include source information in responses

include_metadataΒΆ

Include processing metadata

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

class haive.agents.document_processing.agent.DocumentProcessingResult(/, **data)ΒΆ

Bases: pydantic.BaseModel

Result from document processing operation.

Parameters:

data (Any)

responseΒΆ

Main response content

sourcesΒΆ

List of source documents used

metadataΒΆ

Processing metadata

documentsΒΆ

Processed documents

query_infoΒΆ

Information about query processing

timingΒΆ

Timing information

statisticsΒΆ

Processing statistics

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

class haive.agents.document_processing.agent.DocumentProcessingState(/, **data)ΒΆ

Bases: haive.core.schema.prebuilt.messages_state.MessagesState

State for document processing operations.

Extends MessagesState with document-specific fields for tracking document processing workflows.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

data (Any)