Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.

Visual Prompt Generation: Cross-Attention in Q-Former

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

\ Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

\ Figure 7. Overview of QFormer

A. Detailed Architecture of QFormer

The architecture overview is depicted in Figure 7. Specifically, QFormer is initialized as a BERT-based model[8] comprising a total of L = 12 layers. In contrast to typical BERT models that process textual inputs, QFormer takes R = 32 learnable query embeddings as inputs. These embeddings are utilized to extract visual information from the input visual data during Stage-1 pretraining in BLIP2[22]. Subsequently, they serve as visual prompt embeddings for the LLM inputs after projection.

\ Inside the QFormer, each layer includes a self-attention module composed of a Multi-Head Attention component and a Forward module (consisting of Linear, LayerNorm, and Residual Connection). The cross-attention module, initialized with random values, is inserted every G layers, where learnable query embeddings interact with visual embeddings. In the main paper, for the sake of conciseness, we condensed the representation of the multi-head attention and forward modules into self(cross) attention modules. Furthermore, we exclusively illustrated the modifications made to the cross-attention module in MIVPG, as the self-attention modules remain unchanged. The final QFormer output is represented by the last layer’s query embeddings.

\ For a more comprehensive understanding, readers are encouraged to refer to [22].

\

:::info Authors:

(1) Wenliang Zhong, The University of Texas at Arlington (wxz9204@mavs.uta.edu);

(2) Wenyi Wu, Amazon (wenyiwu@amazon.com);

(3) Qi Li, Amazon (qlimz@amazon.com);

(4) Rob Barton, Amazon (rab@amazon.com);

(5) Boxin Du, Amazon (boxin@amazon.com);

(6) Shioulin Sam, Amazon (shioulin@amazon.com);

(7) Karim Bouyarmane, Amazon (bouykari@amazon.com);

(8) Ismail Tutar, Amazon (ismailt@amazon.com);

(9) Junzhou Huang, The University of Texas at Arlington (jzhuang@uta.edu).

:::


:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Market Opportunity
Prompt Logo
Prompt Price(PROMPT)
$0.05644
$0.05644$0.05644
-0.37%
USD
Prompt (PROMPT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

MAGAX vs Pengu vs PEPE: Which Meme Coin Could Deliver the Biggest Gains in 2025?

MAGAX vs Pengu vs PEPE: Which Meme Coin Could Deliver the Biggest Gains in 2025?

Three meme coins dominate September chatter, but one offers the clearest path to asymmetric upside. Meme coins remain one of crypto’s most unpredictable yet rewarding niches. September’s market chatter has centered around MAGAX, Pengu, and PEPE—each representing a different stage in the meme-to-earn story. The question is: which one can deliver meaningful returns as 2025 […] The post MAGAX vs Pengu vs PEPE: Which Meme Coin Could Deliver the Biggest Gains in 2025? appeared first on Live Bitcoin News.
Share
LiveBitcoinNews2025/09/24 03:15
North America Sees $2.3T in Crypto

North America Sees $2.3T in Crypto

The post North America Sees $2.3T in Crypto appeared on BitcoinEthereumNews.com. Key Notes North America received $2.3 trillion in crypto value between July 2024 and June 2025, representing 26% of global activity. Tokenized U.S. treasuries saw assets under management (AUM) grow from $2 billion to over $7 billion in the last twelve months. U.S.-listed Bitcoin ETFs now account for over $120 billion in AUM, signaling strong institutional demand for the asset. . North America has established itself as a major center for cryptocurrency activity, with significant transaction volumes recorded over the past year. The region’s growth highlights an increasing institutional and retail interest in digital assets, particularly within the United States. According to a new report from blockchain analytics firm Chainalysis published on September 17, North America received $2.3 trillion in cryptocurrency value between July 2024 and June 2025. This volume represents 26% of all global transaction activity during that period. The report suggests this activity was influenced by a more favorable regulatory outlook and institutional trading strategies. A peak in monthly value was recorded in December 2024, when an estimated $244 billion was transferred in a single month. ETFs and Tokenization Drive Adoption The rise of spot Bitcoin BTC $115 760 24h volatility: 0.5% Market cap: $2.30 T Vol. 24h: $43.60 B ETFs has been a significant factor in the market’s expansion. U.S.-listed Bitcoin ETFs now hold over $120 billion in assets under management (AUM), making up a large portion of the roughly $180 billion held globally. The strong demand is reflected in a recent resumption of inflows, although the products are not without their detractors, with author Robert Kiyosaki calling ETFs “for losers.” The market for tokenized real-world assets also saw notable growth. While funds holding tokenized U.S. treasuries expanded their AUM from approximately $2 billion to more than $7 billion, the trend is expanding into other asset classes.…
Share
BitcoinEthereumNews2025/09/18 02:07
Watchdog frowns on BARMM move to remove ‘none of the above’ from ballots

Watchdog frowns on BARMM move to remove ‘none of the above’ from ballots

POLLS. Residents queue to vote for the BARMM local elections, at the Ragondingan Central Elementary School, Buadiposo-Buntong, Lanao Del Sur, on May 12, 2025.
Share
Rappler2026/01/21 09:20