Monday, 15 Dec 2025
Subscribe
logo
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Font ResizerAa
Data Center NewsData Center News
Search
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI > Chinese researchers unveil LLaVA-o1 to challenge OpenAI’s o1 model
AI

Chinese researchers unveil LLaVA-o1 to challenge OpenAI’s o1 model

Last updated: November 23, 2024 4:56 pm
Published November 23, 2024
Share
Chinese researchers unveil LLaVA-o1 to challenge OpenAI's o1 model
SHARE

Be part of our each day and weekly newsletters for the most recent updates and unique content material on industry-leading AI protection. Be taught Extra


OpenAI‘s o1 mannequin has proven that inference-time scaling—utilizing extra compute throughout inference—can considerably increase a language mannequin’s reasoning talents. LLaVA-o1, a brand new mannequin developed by researchers from a number of universities in China, brings this paradigm to open-source imaginative and prescient language fashions (VLMs).

Early open-source VLMs sometimes use a direct prediction method, producing solutions with out reasoning in regards to the immediate and the steps required to unravel the immediate. With out a structured reasoning course of, they’re much less efficient at duties that require logical reasoning. Superior prompting methods corresponding to chain-of-thought (CoT) prompting, the place the mannequin is inspired to generate intermediate reasoning steps, produce some marginal enhancements. However VLMs usually produce errors or hallucinate.

The researchers noticed {that a} key concern is that the reasoning course of in current VLMs is just not sufficiently systematic and structured. The fashions don’t generate reasoning chains and infrequently get caught in reasoning processes the place they don’t know at what stage they’re and what particular downside they need to clear up.

“We observe that VLMs usually provoke responses with out adequately organizing the issue and the accessible data,” the researchers write. “Furthermore, they continuously deviate from a logical reasoning towards conclusions, as an alternative of presenting a conclusion prematurely and subsequently trying to justify it. On condition that language fashions generate responses token-by-token, as soon as an faulty conclusion is launched, the mannequin sometimes continues alongside a flawed reasoning path.”

See also  New embedding model leaderboard shakeup: Google takes #1 while Alibaba's open source alternative closes gap

Multistage reasoning

OpenAI o1 makes use of inference-time scaling to unravel the systematic and structured reasoning downside and permits the mannequin to pause and evaluation its outcomes because it regularly solves the issue. Whereas OpenAI has not launched a lot element in regards to the underlying mechanism of o1, its outcomes present promising instructions for enhancing the reasoning talents of foundational fashions.

Impressed by o1, the researchers designed LLaVA-o1 to carry out stage-by-stage reasoning. As a substitute of producing a direct reasoning chain, LLaVA-o1 breaks down the reasoning course of into 4 distinct levels:

Abstract: The mannequin first offers a high-level abstract of the query, outlining the core downside it wants to deal with.

Caption:  If a picture is current, the mannequin describes the related components, specializing in parts associated to the query.

Reasoning:  Constructing on the abstract, the mannequin performs structured, logical reasoning to derive a preliminary reply.

Conclusion: Lastly, the mannequin presents a concise abstract of the reply based mostly on the previous reasoning.

Solely the conclusion stage is seen to the person; the opposite three levels signify the mannequin’s inner reasoning course of, much like the hidden reasoning hint of o1. This structured method permits LLaVA-o1 to handle its reasoning course of independently, resulting in improved efficiency on complicated duties.

“This structured method allows the mannequin to independently handle its reasoning course of, enhancing its adaptability and efficiency on complicated reasoning duties,” the researchers write.

Stage-level beam search (proper) vs different inference-time scaling methods Supply: arXiv

LLaVA-o1 additionally introduces a novel inference-time scaling approach known as “stage-level beam search.” Stage-level beam search generates a number of candidate outputs at every reasoning stage. It then selects one of the best candidate at every stage to proceed the technology course of. That is in distinction to the basic best-of-N method, by which the mannequin is prompted to generate a number of full responses earlier than deciding on one.

See also  HPE, Pure Storage unveil enterprise storage products

“Notably, it’s the structured output design of LLaVA-o1 that makes this method possible, enabling environment friendly and correct verification at every stage,” the researchers write. “This validates the effectiveness of structured output in enhancing inference time scaling.”

Coaching LLaVA-o1

Llava o1 training data
LLaVA-o1 coaching information is annotated with GPT-4o Supply: arXiv

To coach LLaVA-o1, the researchers compiled a brand new dataset of round 100,000 image-question-answer pairs obtained from a number of extensively used VQA datasets. The dataset covers a wide range of duties, from multi-turn query answering to chart interpretation and geometric reasoning.

The researchers used GPT-4o to generate the detailed four-stage reasoning processes for every instance, together with the abstract, caption, reasoning and conclusion levels. 

The researchers then fine-tuned Llama-3.2-11B-Imaginative and prescient-Instruct on this dataset to acquire the ultimate LLaVA-o1 mannequin. The researchers haven’t launched the mannequin however plan to launch the dataset, known as the LLaVA-o1-100k.

LLaVA-o1 in motion

The researchers evaluated LLaVA-o1 on a number of multimodal reasoning benchmarks.  Regardless of being educated on solely 100,000 examples, LLaVA-o1 confirmed vital efficiency enhancements over the bottom Llama mannequin, with a mean benchmark rating improve of 6.9%.  

LLaVA-o1 results
LLaVA-o1 vs different open and closed fashions Supply: arXiv

Moreover, stage-level beam search led to further efficiency good points, demonstrating the effectiveness of inference-time scaling. As a result of computational useful resource constraints, the researchers have been solely capable of check the approach with a beam dimension of two. They count on even larger enhancements with bigger beam sizes.

Impressively, LLaVA-o1 outperformed not solely different open-source fashions of the identical dimension or bigger but in addition some closed-source fashions like GPT-4-o-mini and Gemini 1.5 Professional.

See also  OpenAI, Google DeepMind and Anthropic sound alarm: 'We may be losing the ability to understand AI'

“LLaVA-o1 establishes a brand new commonplace for multimodal reasoning in VLMs, providing sturdy efficiency and scalability, particularly in inference time,” the researchers write. “Our work paves the way in which for future analysis on structured reasoning in VLMs, together with potential expansions with exterior verifiers and the usage of reinforcement studying to additional improve complicated multimodal reasoning capabilities.”


Source link
TAGGED: challenge, Chinese, LLaVAo1, Model, OpenAIs, researchers, unveil
Share This Article
Twitter Email Copy Link Print
Previous Article Technosylva Technosylva Receives Investment from General Atlantic’s BeyondNetZero Fund and TA Associates
Next Article 3D-printing advance mitigates three defects simultaneously for failure-free metal parts 3D-printing advance mitigates three defects simultaneously for failure-free metal parts
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

When progress doesn’t feel like home: Why many are hesitant to join the AI migration

Need smarter insights in your inbox? Join our weekly newsletters to get solely what issues…

July 28, 2025

US Signal strengthens Michigan presence

John White, Chief Working Officer at US Sign, emphasised the corporate's strategic imaginative and prescient,…

August 15, 2024

Durata welcomes more visitors to expanded Middlesbrough HQ

Durata’s new unit will include an Innovation Centre designed specifically to host potential clients from…

January 22, 2024

Gondola Skate Moving Systems Receives Investment from HCAP Partners

Gondola Skate Moving Systems, a San Diego, CA-based fixture mobilization options firm, obtained an funding…

April 10, 2024

OpenAI data residency advances enterprise AI governance

For chief information and data officers, particularly in tightly regulated sectors, information governance has been…

October 26, 2025

You Might Also Like

Build vs buy is dead — AI just killed it
AI

Build vs buy is dead — AI just killed it

By saad
Nous Research just released Nomos 1, an open-source AI that ranks second on the notoriously brutal Putnam math exam
AI

Nous Research just released Nomos 1, an open-source AI that ranks second on the notoriously brutal Putnam math exam

By saad
Enterprise users swap AI pilots for deep integrations
AI

Enterprise users swap AI pilots for deep integrations

By saad
Why most enterprise AI coding pilots underperform (Hint: It's not the model)
AI

Why most enterprise AI coding pilots underperform (Hint: It's not the model)

By saad
Data Center News
Facebook Twitter Youtube Instagram Linkedin

About US

Data Center News: Stay informed on the pulse of data centers. Latest updates, tech trends, and industry insights—all in one place. Elevate your data infrastructure knowledge.

Top Categories
  • Global Market
  • Infrastructure
  • Innovations
  • Investments
Usefull Links
  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

© 2024 – datacenternews.tech – All rights reserved

Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.