haive.core.engine.document.transformers.base¶
Document Transformer Engine for Haive Framework.
This module provides an engine for transforming documents using various strategies such as HTML conversion, document reordering, deduplication, and more.
Classes¶
Engine for transforming documents using various strategies. |
|
Registry for document transformer engines. |
Functions¶
|
Create a document transformer engine with the specified configuration. |
|
Create an embeddings-based filter transformer. |
|
Create an HTML to markdown transformer. |
|
Create an HTML to text transformer. |
Create a long context reordering transformer. |
|
|
Create a document translation transformer. |
Module Contents¶
- class haive.core.engine.document.transformers.base.DocTransformerEngine¶
Bases:
haive.core.engine.base.InvokableEngine[list[langchain_core.documents.Document],list[langchain_core.documents.Document]]Engine for transforming documents using various strategies.
This engine supports multiple document transformation techniques including: - HTML to text conversion - HTML to markdown conversion - HTML content extraction and cleaning - Document reordering for long contexts - Redundant document filtering - Document clustering - Text translation - Metadata tagging
- create_runnable(runnable_config=None)¶
Create a document transformer based on the configuration.
- Parameters:
runnable_config (langchain_core.runnables.RunnableConfig | None) – Optional runtime configuration
- Returns:
Document transformer instance
- Return type:
Any
- class haive.core.engine.document.transformers.base.DocTransformerRegistry¶
Bases:
haive.core.registry.base.AbstractRegistry[DocTransformerEngine]Registry for document transformer engines.
Initialize the registry with empty dictionaries.
- clear()¶
Clear the registry.
- Return type:
None
- find_by_id(id)¶
Find a document transformer by its unique ID.
- Parameters:
id (str)
- Return type:
DocTransformerEngine | None
- get(item_type, name)¶
Get a document transformer by type and name.
- Parameters:
item_type (Any)
name (str)
- Return type:
DocTransformerEngine | None
- get_all(item_type)¶
Get all document transformers.
- Parameters:
item_type (Any)
- Return type:
- classmethod get_instance()¶
Get singleton instance.
- Return type:
- register(item)¶
Register a document transformer engine.
- Parameters:
item (DocTransformerEngine)
- Return type:
- haive.core.engine.document.transformers.base.create_document_transformer(transformer_type, name=None, **kwargs)¶
Create a document transformer engine with the specified configuration.
- Parameters:
transformer_type (haive.core.engine.document.transformers.types.DocTransformerType) – Type of document transformer to create
name (str | None) – Name for the engine (generated if not provided)
**kwargs – Additional parameters for specific transformer types
- Returns:
Configured DocTransformerEngine
- Return type:
- haive.core.engine.document.transformers.base.create_embeddings_filter_transformer(embeddings_model, name='embeddings_filter_transformer', similarity_threshold=0.95, clustering=False)¶
Create an embeddings-based filter transformer.
This transformer removes redundant documents based on embedding similarity or clusters documents based on their embeddings.
- Parameters:
- Returns:
Configured DocTransformerEngine
- Return type:
- haive.core.engine.document.transformers.base.create_html_to_markdown_transformer(name='html_to_markdown_transformer', heading_style='ATX', autolinks=True)¶
Create an HTML to markdown transformer.
- Parameters:
- Returns:
Configured DocTransformerEngine
- Return type:
- haive.core.engine.document.transformers.base.create_html_to_text_transformer(name='html_to_text_transformer', ignore_links=True, ignore_images=True)¶
Create an HTML to text transformer.
- Parameters:
- Returns:
Configured DocTransformerEngine
- Return type:
- haive.core.engine.document.transformers.base.create_long_context_reorder_transformer(name='long_context_reorder_transformer')¶
Create a long context reordering transformer.
This transformer helps address the “lost in the middle” problem where performance degrades when models must access relevant information in the middle of long contexts.
- Parameters:
name (str) – Name for the engine
- Returns:
Configured DocTransformerEngine
- Return type:
- haive.core.engine.document.transformers.base.create_translate_transformer(name='translate_transformer', target_language='en')¶
Create a document translation transformer.
- Parameters:
- Returns:
Configured DocTransformerEngine
- Return type: