Saturday, 13 Dec 2025
Subscribe
logo
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Font ResizerAa
Data Center NewsData Center News
Search
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI > Meta’s Transfusion model handles text and images in a single architecture
AI

Meta’s Transfusion model handles text and images in a single architecture

Last updated: August 31, 2024 7:42 pm
Published August 31, 2024
Share
Meta's Transfusion model handles text and images in a single architecture
SHARE

Be part of our each day and weekly newsletters for the newest updates and unique content material on industry-leading AI protection. Study Extra


Multi-modal fashions that may course of each textual content and pictures are a rising space of analysis in synthetic intelligence. Nonetheless, coaching these fashions presents a singular problem: language fashions cope with discrete values (phrases and tokens), whereas picture era fashions should deal with steady pixel values. 

Present multi-modal fashions use methods that cut back the standard of representing knowledge. In a new research paper, scientists from Meta and the University of Southern California introduce Transfusion, a novel approach that permits a single mannequin to seamlessly deal with each discrete and steady modalities. 

The challenges of multi-modal fashions

Current approaches to handle the multi-modality problem usually contain completely different tradeoffs. Some methods use separate architectures for language and picture processing, usually pre-training every part individually. That is the tactic utilized in fashions corresponding to LLaVA. These fashions battle to study the advanced interactions between completely different modalities, particularly when processing paperwork the place photos and textual content are interleaved.

Different methods quantize photos into discrete values, successfully changing them right into a sequence of tokens just like textual content. That is the method utilized by Meta’s Chameleon, which was launched earlier this yr. Whereas this method allows using language fashions for picture processing, it ends in the lack of info contained within the steady pixel values. 

meta chameleon architecture
Meta’s Chameleon encoding and decoding logic. Supply: arxiv

Chunting Zhou, Senior Analysis Scientist at Meta AI and co-author of the paper, beforehand labored on the Chameleon paper. 

See also  OpenAI posts Model Spec revealing how it wants AI to behave

“We observed that the quantization methodology creates an info bottleneck for picture representations, the place discrete representations of photos are extremely compressed and lose info within the authentic photos,” she instructed VentureBeat. “And within the meantime it’s very tough to coach a superb discrete picture tokenizer. Thus, we requested the query ‘Can we simply use the extra pure steady representations of photos after we practice a multi-modal mannequin along with discrete textual content?’”

Transfusion: A unified method to multi-modal studying

“Diffusion fashions and next-token-prediction autoregressive fashions characterize one of the best worlds for producing steady and discrete knowledge respectively,” Zhou stated. “This impressed us to develop a brand new multi-modal methodology that mixes one of the best of each worlds in a pure and easy manner.” 

Transfusion is a recipe for coaching a single mannequin that may deal with each discrete and steady modalities with out the necessity for quantization or separate modules. The core concept behind Transfusion is to coach a single mannequin with two aims: language modeling for textual content and diffusion for photos. 

Transfusion combines these two aims to coach a transformer mannequin that may course of and generate each textual content and pictures. Throughout coaching, the mannequin is uncovered to each textual content and picture knowledge, and the loss features for language modeling and diffusion are utilized concurrently.

Meta Transfusion architecture
Meta’s Transfusion makes use of a single transformer structure to course of each textual content and pictures Supply: arxiv

“We present it’s potential to completely combine each modalities, with no info loss, by coaching a single mannequin to each predict discrete textual content tokens and diffuse steady photos,” the researchers write.

See also  Chinese startup Z.ai launches powerful open source GLM-4.5 model family with PowerPoint creation

Transfusion makes use of a unified structure and vocabulary to course of mixed-modality inputs. The mannequin consists of light-weight modality-specific elements that convert textual content tokens and picture patches into the suitable representations earlier than they’re processed by the transformer.

To enhance the illustration of picture knowledge, Transfusion makes use of variational autoencoders (VAE), neural networks that may study to characterize advanced knowledge, corresponding to photos, in a lower-dimensional steady area. In Transfusion, a VAE is used to encode every 8×8 patch of a picture into an inventory of steady values. 

Meta Transfusion VAE
Transfusion makes use of variational autoencoders (VAE) to interrupt down photos into 8×8 patches versus diffusing them at pixel degree

“Our fundamental innovation is demonstrating that we are able to use separate losses for various modalities – language modeling for textual content, diffusion for photos – over shared knowledge and parameters,” the researchers write.

Transfusion outperforms quantization-based approaches

The researchers skilled a 7-billion mannequin primarily based on Transfusion and evaluated it on a wide range of commonplace uni-modal and cross-modal benchmarks, together with text-to-text, text-to-image, and image-to-text duties. They in contrast its efficiency to an equally-sized mannequin primarily based on Chameleon, which is the present outstanding open-science methodology for coaching native mixed-modal fashions.

Of their experiments, Transfusion persistently outperformed the Chameleon throughout all modalities. In text-to-image era, Transfusion achieved higher outcomes with lower than a 3rd of the computational value of Chameleon. Equally, in image-to-text era, Transfusion matched Chameleon’s efficiency with solely 21.8% of the computational sources.

Surprisingly, Transfusion additionally confirmed higher efficiency on text-only benchmarks, although each Transfusion and Chameleon use the identical language modeling goal for textual content. This means that coaching on quantized picture tokens can negatively impression textual content efficiency.

See also  Diffbot’s AI model doesn’t guess — it knows, thanks to a trillion-fact knowledge graph

“As a alternative, Transfusion scales higher than the generally adopted multi-modal coaching approaches with discrete picture tokens by a big margin throughout the board,” Zhou stated.

Transfusion image generation
Examples of photos generated with a 7B Transfusion mannequin

The researchers ran separate experiments on picture era and in contrast Transfusion with different picture era fashions. Transfusion outperformed different in style fashions corresponding to DALL-E 2 and Secure Diffusion XL whereas additionally having the ability to generate textual content.

“Transfusion opens up quite a lot of new alternatives for multi-modal studying and new fascinating use circumstances,” Zhou stated. “As Transfusion works simply as LLM however on multi-modality knowledge, this probably unlocks new functions with higher controllability on interactive periods of consumer inputs, e.g. interactive enhancing of photos and movies.”


Source link
TAGGED: architecture, handles, images, Metas, Model, Single, text, Transfusion
Share This Article
Twitter Email Copy Link Print
Previous Article T5 Acquires Land in Atlanta for 300 MW Data Center Campus T5 Acquires Land in Atlanta for 300 MW Data Center Campus
Next Article A quantum neural network can see optical illusions like humans do. Could it be the future of AI? A quantum neural network can see optical illusions like humans do. Could it be the future of AI?
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

Depot Raises $4.1M in Seed Funding

Depot, a Portland, OR-based supplier of a construct acceleration platform, raised $4.1M in Seed funding.…

August 24, 2024

Chinese startup Manus challenges ChatGPT in data visualization: which should enterprises use?

Need smarter insights in your inbox? Join our weekly newsletters to get solely what issues…

July 22, 2025

Cologix and Crosslake Fibre Strengthen Ties with New Montreal PoP

North American provider of hyperscale edge information middle providers and network-neutral connectivity, Cologix, has expanded…

March 17, 2024

AWS report: Generative AI overtakes security in global tech budgets for 2025

Be a part of our every day and weekly newsletters for the newest updates and…

May 7, 2025

Revolutionising digital infrastructure with AI integration

Richard Osborne, CTO of Purple Rework, discusses leveraging AI to rework present digital infrastructure to…

March 28, 2024

You Might Also Like

Google’s new framework helps AI agents spend their compute and tool budget more wisely
AI

Google’s new framework helps AI agents spend their compute and tool budget more wisely

By saad
BBVA embeds AI into banking workflows using ChatGPT Enterprise
AI

BBVA embeds AI into banking workflows using ChatGPT Enterprise

By saad
Ai2's new Olmo 3.1 extends reinforcement learning training for stronger reasoning benchmarks
AI

Ai2's new Olmo 3.1 extends reinforcement learning training for stronger reasoning benchmarks

By saad
Experimental AI concludes as autonomous systems rise
AI

Experimental AI concludes as autonomous systems rise

By saad
Data Center News
Facebook Twitter Youtube Instagram Linkedin

About US

Data Center News: Stay informed on the pulse of data centers. Latest updates, tech trends, and industry insights—all in one place. Elevate your data infrastructure knowledge.

Top Categories
  • Global Market
  • Infrastructure
  • Innovations
  • Investments
Usefull Links
  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

© 2024 – datacenternews.tech – All rights reserved

Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.