Meta, formerly known as Facebook, has introduced “CM3leon,” an artificial intelligence (AI) model capable of both text-to-image and image-to-text generation. CM3leon is the first multimodal model trained using a recipe adapted from text-only language models, incorporating large-scale retrieval-augmented pre-training and multitask supervised fine-tuning stages.
Meta Introduces CM3leon: A State-of-the-Art Text-to-Image Generation Model Outperforming Google’s Parti
According to Meta, CM3leon’s image generation tools can produce more coherent imagery that aligns better with input prompts. Additionally, CM3leon requires only five times the computing power and a smaller training dataset compared to previous transformer-based methods.
When evaluated against the widely used image generation benchmark (zero-shot MS-COCO), CM3leon achieved a high FID (Frechet Inception Distance) score of 4.88. This establishes a new state-of-the-art in text-to-image generation, surpassing Google’s text-to-image model, Parti.
Meta’s CM3leon: A Multimodal Model Excelling in Vision-Language Tasks and Advancing Image Generation
Furthermore, Meta highlights that CM3leon excels in various vision-language tasks, including visual question answering and long-form captioning. Despite being trained on a relatively small dataset of only three billion text tokens, CM3leon’s zero-shot performance compares favorably to larger models trained on larger datasets.
Meta believes that CM3leon’s strong performance across different tasks is a step towards higher-fidelity image generation and understanding. The company envisions models like CM3leon enhancing creativity and enabling better applications in the metaverse. They express their eagerness to explore the limits of multimodal language models and release more models in the future.