haive.agents.document_modifiers.tnt.utils
=========================================
.. py:module:: haive.agents.document_modifiers.tnt.utils
.. autoapi-nested-parse::
Utility functions for taxonomy generation and document processing.
This module provides utility functions for parsing, formatting, and managing
taxonomy-related data structures. It includes functions for handling XML-formatted
outputs, document summaries, and taxonomy clusters.
.. note::
All XML parsing functions assume well-formed XML input with specific expected tags.
Malformed XML may raise parsing errors.
.. rubric:: Example
Basic usage for taxonomy parsing::
xml_output = '''
1
Category A
Description text
'''
taxonomy = parse_taxonomy(xml_output)
Functions
---------
.. autoapisummary::
haive.agents.document_modifiers.tnt.utils.format_docs
haive.agents.document_modifiers.tnt.utils.format_taxonomy
haive.agents.document_modifiers.tnt.utils.format_taxonomy_md
haive.agents.document_modifiers.tnt.utils.get_content
haive.agents.document_modifiers.tnt.utils.parse_labels
haive.agents.document_modifiers.tnt.utils.parse_summary
haive.agents.document_modifiers.tnt.utils.parse_taxonomy
haive.agents.document_modifiers.tnt.utils.reduce_summaries
Module Contents
---------------
.. py:function:: format_docs(docs)
Format documents as XML table for taxonomy generation.
:param docs: List of Document objects, each must have:
- id: Document identifier
- summary: Document summary text
:returns: XML-formatted string containing conversation summaries
:rtype: str
.. rubric:: Example
>>> docs = [Document(id="1", summary="text")]
>>> xml = format_docs(docs)
>>> print(xml)
text
.. py:function:: format_taxonomy(clusters)
Convert taxonomy clusters to XML format.
:param clusters: List of cluster dictionaries, each containing:
- id (str): Cluster identifier
- name (str): Cluster name
- description (str): Cluster description
:returns: XML-formatted taxonomy string
:rtype: str
.. rubric:: Example
>>> clusters = [{"id": "1", "name": "Tech", "description": "Technology"}]
>>> xml = format_taxonomy(clusters)
>>> print(xml)
1
Tech
Technology
.. py:function:: format_taxonomy_md(clusters)
Format taxonomy clusters as a Markdown table.
:param clusters: List of cluster dictionaries, each containing:
- id (str): Cluster identifier
- name (str): Cluster name
- description (str): Cluster description
:returns: Markdown-formatted table string
:rtype: str
.. rubric:: Example
>>> clusters = [{"id": "1", "name": "Tech", "description": "Technology"}]
>>> md_table = format_taxonomy_md(clusters)
.. py:function:: get_content(state)
Extract document content from taxonomy generation state.
:param state: Current state of the taxonomy generation process.
Must contain a 'documents' key with list of document dictionaries.
:returns:
List of dictionaries, each containing:
- content (str): The content of a document
:rtype: list
.. rubric:: Example
>>> state = {"documents": [{"content": "doc1"}, {"content": "doc2"}]}
>>> contents = get_content(state)
>>> print(contents)
[{'content': 'doc1'}, {'content': 'doc2'}]
.. py:function:: parse_labels(output_text)
Parse category labels from prediction output.
Extracts category information from XML-formatted prediction text.
Handles multiple categories but returns only the first one.
:param output_text: XML-formatted string containing category predictions.
Expected format::
Label Name
:returns:
Dictionary containing:
- category (str): The first category label found
:rtype: dict
.. note::
If multiple categories are found, a warning is logged and only
the first category is returned.
.. rubric:: Example
>>> text = "Technology"
>>> result = parse_labels(text)
>>> print(result)
{'category': 'Technology'}
.. py:function:: parse_summary(xml_string)
Parse summary and explanation from XML-formatted string.
Extracts the content within and tags from the input XML string.
If tags are not found, returns empty strings for the missing elements.
:param xml_string: XML-formatted string containing and tags.
Example::
Main points...
Detailed analysis...
:returns:
Dictionary containing:
- summary (str): Content within tags
- explanation (str): Content within tags
:rtype: dict
.. rubric:: Example
>>> xml = "Key pointsDetails"
>>> result = parse_summary(xml)
>>> print(result)
{'summary': 'Key points', 'explanation': 'Details'}
.. py:function:: parse_taxonomy(output_text)
Parse taxonomy information from LLM-generated output.
Extracts cluster information including IDs, names, and descriptions from
XML-formatted output text.
:param output_text: XML-formatted string containing taxonomy clusters.
Expected format::
1
Category Name
Category Description
:returns:
Dictionary containing:
- clusters (list): List of dictionaries, each with:
- id (str): Cluster identifier
- name (str): Cluster name
- description (str): Cluster description
:rtype: dict
.. rubric:: Example
>>> text = "1TechTechnology"
>>> taxonomy = parse_taxonomy(text)
>>> print(taxonomy)
{'clusters': [{'id': '1', 'name': 'Tech', 'description': 'Technology'}]}
.. py:function:: reduce_summaries(combined)
Merge summarized content with original documents.
Takes a dictionary containing both original documents and their summaries,
and combines them into a single state object.
:param combined: Dictionary containing:
- documents (list): Original document list
- summaries (list): Corresponding summaries
:returns:
Updated state containing:
- documents (list): List of documents with added summaries
:rtype: TaxonomyGenerationState
.. rubric:: Example
>>> combined = {
... "documents": [{"id": 1, "content": "text"}],
... "summaries": [{"summary": "sum", "explanation": "exp"}]
... }
>>> state = reduce_summaries(combined)