News/Tech News

Microsoft Open-Sources Visual ChatGPT Multimodal Chatbot

Published on April 19, 2023

In response to human textual prompts, Microsoft Research has released Visual ChatGPT, a chatbot system that is capable of generating and manipulating images. In order to support multi-modal interactions, the system combines OpenAI’s ChatGPT with 22 different visual foundation models (VFM).

An arXiv paper describes the system. It is possible for users to interact with the bot by typing text or uploading images. In addition to generating textual prompts, the bot can also generate images, either from scratch or by manipulating previous images in the chat history. A key component of the bot is the Prompt Manager, which converts raw text from the user into a “chain of thought” prompt that is used by ChatGPT to determine whether VFM is required.

While ChatGPT and other large language models (LLM) have demonstrated impressive natural language processing capabilities, they are trained to handle only one mode of input: text. In place of training a new model to handle multimodal inputs, the Microsoft team developed a Prompt Manager to generate text inputs to ChatGPT which can then be processed by VFMs such as CLIP or Stable Diffusion for computer vision tasks.

Prompt Managers are based on LangChain Agents, and VFMs are defined as LangChain Agent Tools. After incorporating input from the user’s prompt and conversation history, which includes image filenames, the agent applies the prompt prefixes and suffixes to determine whether a tool is needed.

An additional text in the prefix guides ChatGPT to ask itself “Do I need a tool?” to perform the user’s desired task, and if so, output the name of the tool along with the required inputs, such as an image filename or a text description of the image to generate. Until the agent no longer requires a tool, it will iteratively invoke VFM tools and send the resulting image to chat. After that, the last generated text output will be sent to the chat room.

One user noted in a Hacker News thread about the work that the VFMs consume much less memory than language models.

On GitHub, you can find the source code for Visual ChatGPT.

Tech News

ChatGPT Is Fun, but the Future Is Fully Autonomous AI for Code at QCon London img

ChatGPT Is Fun, but the Future Is Fully Autonomous AI for Code at QCon London

A presentation on artificial intelligence (AI) for code writing was given by Mathew Lodge, CEO of DiffBlue, at the…

New Java SE Universal Subscription from Oracle img

New Java SE Universal Subscription from Oracle

Since January 2023, Oracle has announced the new Java SE Universal subscription and pricing, which will replace ..

Our Latest Blog

Unlock Your Potential with a Level 5 Diploma in Business London's Top Courses img

Unlock Your Potential with a Level 5 Diploma in Business: London’s Top Courses

Are you looking to enhance your knowledge and skills in the field of business? Do...
Read More
Unlock Your Potential with Level 4 Diploma in Business Courses in London img

Unlock Your Potential with Level 4 Diploma in Business Courses in London

Are you looking for a comprehensive course to take your business career to the next...
Read More

Follow Us

Resources

Presentations
Browse LSET presentations to understand interesting…

Explore Now


eBooks
Get complete guides to empower yourself academically…

Explore Now


Infographics
Learn about information technology and business…

Explore Now