Back to Search

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

AUTHOR Yang, Zhengyuan; Gan, Zhe; Li, Chunyuan
PUBLISHER Now Publishers (05/06/2024)
PRODUCT TYPE Paperback (Paperback)

Description
This monograph presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants.


The focus encompasses five core topics, categorized into two classes; (i) a survey of well-established research areas: multimodal foundation models pre-trained for specific purposes, including two topics - methods of learning vision backbones for visual understanding and text-to-image generation; (ii) recent advances in exploratory, open research areas: multimodal foundation models that aim to play the role of general-purpose assistants, including three topics - unified vision models inspired by large language models (LLMs), end-to-end training of multimodal LLMs, and chaining multimodal tools with LLMs.


The target audience of the monograph is researchers, graduate students, and professionals in computer vision and vision-language multimodal communities who are eager to learn the basics and recent advances in multimodal foundation models.

Show More
Product Format
Product Details
ISBN-13: 9781638283362
ISBN-10: 1638283362
Binding: Paperback or Softback (Trade Paperback (Us))
Content Language: English
More Product Details
Page Count: 230
Carton Quantity: 34
Product Dimensions: 6.14 x 0.48 x 9.21 inches
Weight: 0.72 pound(s)
Country of Origin: US
Subject Information
BISAC Categories
Computers | Software Development & Engineering - Computer Graphics
Computers | Artificial Intelligence - Computer Vision & Pattern Recognit
Computers | User Interfaces
Descriptions, Reviews, Etc.
publisher marketing
This monograph presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants.


The focus encompasses five core topics, categorized into two classes; (i) a survey of well-established research areas: multimodal foundation models pre-trained for specific purposes, including two topics - methods of learning vision backbones for visual understanding and text-to-image generation; (ii) recent advances in exploratory, open research areas: multimodal foundation models that aim to play the role of general-purpose assistants, including three topics - unified vision models inspired by large language models (LLMs), end-to-end training of multimodal LLMs, and chaining multimodal tools with LLMs.


The target audience of the monograph is researchers, graduate students, and professionals in computer vision and vision-language multimodal communities who are eager to learn the basics and recent advances in multimodal foundation models.

Show More
List Price $99.00
Your Price  $98.01
Paperback