haive.agents.document_modifiers.tnt.utils ========================================= .. py:module:: haive.agents.document_modifiers.tnt.utils .. autoapi-nested-parse:: Utility functions for taxonomy generation and document processing. This module provides utility functions for parsing, formatting, and managing taxonomy-related data structures. It includes functions for handling XML-formatted outputs, document summaries, and taxonomy clusters. .. note:: All XML parsing functions assume well-formed XML input with specific expected tags. Malformed XML may raise parsing errors. .. rubric:: Example Basic usage for taxonomy parsing:: xml_output = ''' 1 Category A Description text ''' taxonomy = parse_taxonomy(xml_output) Functions --------- .. autoapisummary:: haive.agents.document_modifiers.tnt.utils.format_docs haive.agents.document_modifiers.tnt.utils.format_taxonomy haive.agents.document_modifiers.tnt.utils.format_taxonomy_md haive.agents.document_modifiers.tnt.utils.get_content haive.agents.document_modifiers.tnt.utils.parse_labels haive.agents.document_modifiers.tnt.utils.parse_summary haive.agents.document_modifiers.tnt.utils.parse_taxonomy haive.agents.document_modifiers.tnt.utils.reduce_summaries Module Contents --------------- .. py:function:: format_docs(docs) Format documents as XML table for taxonomy generation. :param docs: List of Document objects, each must have: - id: Document identifier - summary: Document summary text :returns: XML-formatted string containing conversation summaries :rtype: str .. rubric:: Example >>> docs = [Document(id="1", summary="text")] >>> xml = format_docs(docs) >>> print(xml) text .. py:function:: format_taxonomy(clusters) Convert taxonomy clusters to XML format. :param clusters: List of cluster dictionaries, each containing: - id (str): Cluster identifier - name (str): Cluster name - description (str): Cluster description :returns: XML-formatted taxonomy string :rtype: str .. rubric:: Example >>> clusters = [{"id": "1", "name": "Tech", "description": "Technology"}] >>> xml = format_taxonomy(clusters) >>> print(xml) 1 Tech Technology .. py:function:: format_taxonomy_md(clusters) Format taxonomy clusters as a Markdown table. :param clusters: List of cluster dictionaries, each containing: - id (str): Cluster identifier - name (str): Cluster name - description (str): Cluster description :returns: Markdown-formatted table string :rtype: str .. rubric:: Example >>> clusters = [{"id": "1", "name": "Tech", "description": "Technology"}] >>> md_table = format_taxonomy_md(clusters) .. py:function:: get_content(state) Extract document content from taxonomy generation state. :param state: Current state of the taxonomy generation process. Must contain a 'documents' key with list of document dictionaries. :returns: List of dictionaries, each containing: - content (str): The content of a document :rtype: list .. rubric:: Example >>> state = {"documents": [{"content": "doc1"}, {"content": "doc2"}]} >>> contents = get_content(state) >>> print(contents) [{'content': 'doc1'}, {'content': 'doc2'}] .. py:function:: parse_labels(output_text) Parse category labels from prediction output. Extracts category information from XML-formatted prediction text. Handles multiple categories but returns only the first one. :param output_text: XML-formatted string containing category predictions. Expected format:: Label Name :returns: Dictionary containing: - category (str): The first category label found :rtype: dict .. note:: If multiple categories are found, a warning is logged and only the first category is returned. .. rubric:: Example >>> text = "Technology" >>> result = parse_labels(text) >>> print(result) {'category': 'Technology'} .. py:function:: parse_summary(xml_string) Parse summary and explanation from XML-formatted string. Extracts the content within and tags from the input XML string. If tags are not found, returns empty strings for the missing elements. :param xml_string: XML-formatted string containing and tags. Example:: Main points... Detailed analysis... :returns: Dictionary containing: - summary (str): Content within tags - explanation (str): Content within tags :rtype: dict .. rubric:: Example >>> xml = "Key pointsDetails" >>> result = parse_summary(xml) >>> print(result) {'summary': 'Key points', 'explanation': 'Details'} .. py:function:: parse_taxonomy(output_text) Parse taxonomy information from LLM-generated output. Extracts cluster information including IDs, names, and descriptions from XML-formatted output text. :param output_text: XML-formatted string containing taxonomy clusters. Expected format:: 1 Category Name Category Description :returns: Dictionary containing: - clusters (list): List of dictionaries, each with: - id (str): Cluster identifier - name (str): Cluster name - description (str): Cluster description :rtype: dict .. rubric:: Example >>> text = "1TechTechnology" >>> taxonomy = parse_taxonomy(text) >>> print(taxonomy) {'clusters': [{'id': '1', 'name': 'Tech', 'description': 'Technology'}]} .. py:function:: reduce_summaries(combined) Merge summarized content with original documents. Takes a dictionary containing both original documents and their summaries, and combines them into a single state object. :param combined: Dictionary containing: - documents (list): Original document list - summaries (list): Corresponding summaries :returns: Updated state containing: - documents (list): List of documents with added summaries :rtype: TaxonomyGenerationState .. rubric:: Example >>> combined = { ... "documents": [{"id": 1, "content": "text"}], ... "summaries": [{"summary": "sum", "explanation": "exp"}] ... } >>> state = reduce_summaries(combined)