Published on April 19, 2023
In response to human textual prompts, Microsoft Research has released Visual ChatGPT, a chatbot system that is capable of generating and manipulating images. In order to support multi-modal interactions, the system combines OpenAI’s ChatGPT with 22 different visual foundation models (VFM).
An arXiv paper describes the system. It is possible for users to interact with the bot by typing text or uploading images. In addition to generating textual prompts, the bot can also generate images, either from scratch or by manipulating previous images in the chat history. A key component of the bot is the Prompt Manager, which converts raw text from the user into a “chain of thought” prompt that is used by ChatGPT to determine whether VFM is required.
While ChatGPT and other large language models (LLM) have demonstrated impressive natural language processing capabilities, they are trained to handle only one mode of input: text. In place of training a new model to handle multimodal inputs, the Microsoft team developed a Prompt Manager to generate text inputs to ChatGPT which can then be processed by VFMs such as CLIP or Stable Diffusion for computer vision tasks.
Prompt Managers are based on LangChain Agents, and VFMs are defined as LangChain Agent Tools. After incorporating input from the user’s prompt and conversation history, which includes image filenames, the agent applies the prompt prefixes and suffixes to determine whether a tool is needed.
An additional text in the prefix guides ChatGPT to ask itself “Do I need a tool?” to perform the user’s desired task, and if so, output the name of the tool along with the required inputs, such as an image filename or a text description of the image to generate. Until the agent no longer requires a tool, it will iteratively invoke VFM tools and send the resulting image to chat. After that, the last generated text output will be sent to the chat room.
One user noted in a Hacker News thread about the work that the VFMs consume much less memory than language models.
On GitHub, you can find the source code for Visual ChatGPT.
Presentations
Browse LSET presentations to understand interesting…
Explore Now
eBooks
Get complete guides to empower yourself academically…
Explore Now
Infographics
Learn about information technology and business…
Error: Contact form not found.
Error: Contact form not found.
Error: Contact form not found.
Error: Contact form not found.
Error: Contact form not found.
Error: Contact form not found.
Error: Contact form not found.
Error: Contact form not found.
Error: Contact form not found.
Error: Contact form not found.
Error: Contact form not found.
Error: Contact form not found.
Error: Contact form not found.
[wpforms id=”9030″]