Create Next App

Multi-Agent | Agent to Agent Approach

The fundamental goal of our project was to create a proof of concept illustrating the efficiency of chaining multiple agents together within the Spritz orchestration system to carry out a sequence of tasks. This forward-thinking method aimed to highlight the capabilities of coordinated agent-based systems in automating and streamlining intricate processes.

In collaboration with Good Morning Italia, a renowned Italian journalistic outlet, our focus was on leveraging the capabilities of Spritz to integrate various agents and perform targeted tasks. Good Morning Italia, known for its succinct and meticulously verified informational briefings, served as an ideal partner for testing the efficacy of our platform in handling real-world journalistic scenarios. Through this collaboration, we aimed to demonstrate the potential of Spritz in facilitating dynamic interactions among agents, thereby contributing to the evolution of AI-driven solutions in information processing and dissemination.

What have we done?

The successful integration of the Agent to Agent (A2A) concept into the Spritz platform relied on a meticulously designed methodology. The Spritz platform, functioning as a facilitator for seamless communication among agents, employs a structured approach to handle task queues and orchestrate the distribution of jobs. The core components of an agent within Spritz include:

Discovery: AI agents employ the /discover endpoint to communicate their capabilities and requirements dynamically. This information is crucial for task orchestration, providing essential insights into each agent's abilities.
Task Execution: The /execute endpoint is utilized to initiate tasks based on the information gathered during the discovery stage. This endpoint supports intricate interactions, allowing for custom prompts for AI-based agents. Webhooks play a pivotal role in providing real-time feedback on task status, ensuring continuous updates for Spritz and other agents.
Task Abortion: The /abort endpoint allows for the termination of tasks when necessary, offering flexibility in managing resources efficiently and adapting to changing requirements in real-time.
Status Checking: The /status endpoint facilitates the request for the current status of tasks, offering a valuable tool for scenarios where webhook updates may not have been received or for double-checking task progress and outcomes.

The task initiation through the /execute endpoint allows for specifying subsequent tasks upon completion, introducing a queue-like behavior where tasks can be chained or scheduled based on agent capabilities and workload. The utilization of webhooks for real-time updates enhances this orchestration, allowing for dynamic adjustments and feedback loops.

The integration of the A2A concept into Spritz involved the creation of a group of agents, each specialising in specific tasks:

Parser Agent: Parses text, including HTML tags, in JSON format data and calls related agents based on predefined scenarios.
Translator Agent: Receives HTML format parsed text and the target language, calling either the Page Designer Agent or Speaker Agent based on the translation requirements after it translated the text it received.
Page-Designer Agent: Generates PDF files in a specific format as per the established standards.
Speaker Agent: Generates speeches in Mp3 format using AWS Polly, incorporating two different voices (male and female) for each designated tag and paragraph in the requested language. Additionally, this agent involves the generation of SSML code for Polly to enhance the quality of the synthesized speech.

diagram-2

The nextTask variable is used to specify which task, and which agent to call next

{
  "id": "TASK5678",
  "status": "completed",
  "data": {
    "info": "Task transferred to Translation Agent",
    "output": {
      "name": "next_task",
      "type": "nextTask",
      "data": [
        {
          "nextTask": {
            "agentIdentifier": "Translation Agent",
            "taskDetails": {
              "task": "Translate Html",
              "inputs": [
                {
                  "name": "language",
                  "type": "shortText",
                  "data": "English"
                },
                {
                  "name": "html_input",
                  "type": "longText",
                  "data": "HTML Input"
                }
              ]
            }
          }
        }
      ]
    }
  }
}

The integration predominantly utilizes GPT-4 (gpt-4-0125-preview, gpt-4-turbo-preview, and gpt-4-1106-preview) and GPT-3.5 (gpt-3.5-turbo-0125) for agent collaboration. In crafting prompts for language models, a system prompt is employed, which is carefully tailored to guide the models effectively and elicit desired outputs. You can find the system prompts below;

"""You are a text to speech conversion assistant.
	  You can use the following SSML tags:
	  - Adding a pause    <break>
	  - Specifying another language for specific words    <lang>
	  - Placing a custom tag in your text    <mark>
	  - Adding a pause between paragraphs    <p>
	  - Using phonetic pronunciation    <phoneme>
	  - Adding a pause between sentences    <s>
	  - Controlling how special types of words are spoken    <say-as>
	  - do not add   <speak>
	  - Pronouncing acronyms and abbreviations    <sub>
	  - Improving pronunciation by specifying parts of speech    <w>
  
		Do not use any other tag, if any text is enclosed in ###, 
		keep both the hashtags and the text as it is, there might 
		be multiple ### in the text, so keep the hashtags and text as it is.
	  - Add breaks tags to imitate the pauses of a human speaker
	  - Add breaks/pauses in mid-sentence if it is too long
	  - if words in double quotes are present, use appropriate ssml tags
	  - Do not add backslashes \ in the text
	  - text in ###text### should not be enclosed in tags, the tags 
			should be placed after the ###text###
	  - the text enclosed in lang tags should not have any special symbol 
 """

serve as the guiding instructions for the AI models during the training and execution phases. The decision to feed the language model with smaller chunks of text, such as individual news paragraphs, proved to be more effective in achieving desired outputs, particularly in tasks like translation or formatting changes. While GPT-3.5 (turbo) occasionally omitted certain details specified in the prompt, GPT-4 (0125-preview) occasionally complicated the task and returned outputs in a slightly different format. Consequently, we opted to stick with GPT-4 (turbo-preview) as it consistently produced the desired output.

Conclusion

In conclusion, the successful implementation of the Agent to Agent concept on the Spritz platform marks a significant achievement in enabling seamless communication among agents. Through the initiation of tasks with specific instructions for one agent, Spritz efficiently orchestrates a collaborative effort among different agents, adapting to various scenarios such as the generation of PDF files and speeches in multiple languages. Notably, the system's ability to interpret and execute tasks based on instructions demonstrates its versatility and potential for widespread applications.

Future Work

Looking ahead, the evolution of this platform holds promising prospects, particularly in the realm of autonomous decision-making by agents. By equipping decision-maker agents with a comprehensive understanding of the Agent to Agent structure, we aim to empower them to independently trigger and coordinate various agents to fulfill complex tasks. For instance, a user could interact with the system by simply typing a command in a text box, such as "generate me a GMI briefing speech in ‘this’ language" The decision-maker agent would interpret this input, initiate the relevant agents, and seamlessly orchestrate the generation of the requested output. This user-friendly approach simplifies the interaction process, making it accessible to a broader audience. Moreover, this model can be extended to voice-controlled assistants, where users can articulate their requests verbally, and the system, by converting speech to text, can execute tasks in a similar manner, enhancing the overall user experience. The envisioned future work aims to bring about a more intuitive and efficient interaction paradigm, demonstrating the platform's adaptability to diverse user preferences and requirements.