haive.core.engine.document.transformers.base

Document Transformer Engine for Haive Framework.

This module provides an engine for transforming documents using various strategies such as HTML conversion, document reordering, deduplication, and more.

Classes

DocTransformerEngine

Engine for transforming documents using various strategies.

DocTransformerRegistry

Registry for document transformer engines.

Functions

create_document_transformer(transformer_type[, name])

Create a document transformer engine with the specified configuration.

create_embeddings_filter_transformer(embeddings_model)

Create an embeddings-based filter transformer.

create_html_to_markdown_transformer([name, ...])

Create an HTML to markdown transformer.

create_html_to_text_transformer([name, ignore_links, ...])

Create an HTML to text transformer.

create_long_context_reorder_transformer([name])

Create a long context reordering transformer.

create_translate_transformer([name, target_language])

Create a document translation transformer.

Module Contents

class haive.core.engine.document.transformers.base.DocTransformerEngine

Bases: haive.core.engine.base.InvokableEngine[list[langchain_core.documents.Document], list[langchain_core.documents.Document]]

Engine for transforming documents using various strategies.

This engine supports multiple document transformation techniques including: - HTML to text conversion - HTML to markdown conversion - HTML content extraction and cleaning - Document reordering for long contexts - Redundant document filtering - Document clustering - Text translation - Metadata tagging

create_runnable(runnable_config=None)

Create a document transformer based on the configuration.

Parameters:

runnable_config (langchain_core.runnables.RunnableConfig | None) – Optional runtime configuration

Returns:

Document transformer instance

Return type:

Any

get_input_fields()

Define input field requirements.

Return type:

dict[str, tuple[type, Any]]

get_output_fields()

Define output field requirements.

Return type:

dict[str, tuple[type, Any]]

invoke(input_data, runnable_config=None)

Transform documents using the configured transformer.

Parameters:
  • input_data (list[langchain_core.documents.Document] | dict[str, Any]) – List of documents or dictionary with documents key

  • runnable_config (langchain_core.runnables.RunnableConfig | None) – Optional runtime configuration

Returns:

List of transformed documents

Return type:

list[langchain_core.documents.Document]

class haive.core.engine.document.transformers.base.DocTransformerRegistry

Bases: haive.core.registry.base.AbstractRegistry[DocTransformerEngine]

Registry for document transformer engines.

Initialize the registry with empty dictionaries.

clear()

Clear the registry.

Return type:

None

find_by_id(id)

Find a document transformer by its unique ID.

Parameters:

id (str)

Return type:

DocTransformerEngine | None

get(item_type, name)

Get a document transformer by type and name.

Parameters:
  • item_type (Any)

  • name (str)

Return type:

DocTransformerEngine | None

get_all(item_type)

Get all document transformers.

Parameters:

item_type (Any)

Return type:

dict[str, DocTransformerEngine]

classmethod get_instance()

Get singleton instance.

Return type:

DocTransformerRegistry

list(item_type)

List all document transformers.

Parameters:

item_type (Any)

Return type:

list[str]

register(item)

Register a document transformer engine.

Parameters:

item (DocTransformerEngine)

Return type:

DocTransformerEngine

haive.core.engine.document.transformers.base.create_document_transformer(transformer_type, name=None, **kwargs)

Create a document transformer engine with the specified configuration.

Parameters:
  • transformer_type (haive.core.engine.document.transformers.types.DocTransformerType) – Type of document transformer to create

  • name (str | None) – Name for the engine (generated if not provided)

  • **kwargs – Additional parameters for specific transformer types

Returns:

Configured DocTransformerEngine

Return type:

DocTransformerEngine

haive.core.engine.document.transformers.base.create_embeddings_filter_transformer(embeddings_model, name='embeddings_filter_transformer', similarity_threshold=0.95, clustering=False)

Create an embeddings-based filter transformer.

This transformer removes redundant documents based on embedding similarity or clusters documents based on their embeddings.

Parameters:
  • name (str) – Name for the engine

  • embeddings_model (haive.core.engine.embeddings.EmbeddingsEngineConfig) – Embeddings model to use

  • similarity_threshold (float) – Threshold for embedding similarity filtering

  • clustering (bool) – Whether to use clustering filter instead of redundancy filter

Returns:

Configured DocTransformerEngine

Return type:

DocTransformerEngine

haive.core.engine.document.transformers.base.create_html_to_markdown_transformer(name='html_to_markdown_transformer', heading_style='ATX', autolinks=True)

Create an HTML to markdown transformer.

Parameters:
  • name (str) – Name for the engine

  • heading_style (str) – Heading style for markdown conversion

  • autolinks (bool) – Whether to use automatic link style

Returns:

Configured DocTransformerEngine

Return type:

DocTransformerEngine

haive.core.engine.document.transformers.base.create_html_to_text_transformer(name='html_to_text_transformer', ignore_links=True, ignore_images=True)

Create an HTML to text transformer.

Parameters:
  • name (str) – Name for the engine

  • ignore_links (bool) – Whether to ignore links in HTML

  • ignore_images (bool) – Whether to ignore images in HTML

Returns:

Configured DocTransformerEngine

Return type:

DocTransformerEngine

haive.core.engine.document.transformers.base.create_long_context_reorder_transformer(name='long_context_reorder_transformer')

Create a long context reordering transformer.

This transformer helps address the “lost in the middle” problem where performance degrades when models must access relevant information in the middle of long contexts.

Parameters:

name (str) – Name for the engine

Returns:

Configured DocTransformerEngine

Return type:

DocTransformerEngine

haive.core.engine.document.transformers.base.create_translate_transformer(name='translate_transformer', target_language='en')

Create a document translation transformer.

Parameters:
  • name (str) – Name for the engine

  • target_language (str) – Target language code

Returns:

Configured DocTransformerEngine

Return type:

DocTransformerEngine