Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.

Visual Prompt Generation: Cross-Attention in Q-Former

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

\ Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

\ Figure 7. Overview of QFormer

A. Detailed Architecture of QFormer

The architecture overview is depicted in Figure 7. Specifically, QFormer is initialized as a BERT-based model[8] comprising a total of L = 12 layers. In contrast to typical BERT models that process textual inputs, QFormer takes R = 32 learnable query embeddings as inputs. These embeddings are utilized to extract visual information from the input visual data during Stage-1 pretraining in BLIP2[22]. Subsequently, they serve as visual prompt embeddings for the LLM inputs after projection.

\ Inside the QFormer, each layer includes a self-attention module composed of a Multi-Head Attention component and a Forward module (consisting of Linear, LayerNorm, and Residual Connection). The cross-attention module, initialized with random values, is inserted every G layers, where learnable query embeddings interact with visual embeddings. In the main paper, for the sake of conciseness, we condensed the representation of the multi-head attention and forward modules into self(cross) attention modules. Furthermore, we exclusively illustrated the modifications made to the cross-attention module in MIVPG, as the self-attention modules remain unchanged. The final QFormer output is represented by the last layer’s query embeddings.

\ For a more comprehensive understanding, readers are encouraged to refer to [22].

\

:::info Authors:

(1) Wenliang Zhong, The University of Texas at Arlington (wxz9204@mavs.uta.edu);

(2) Wenyi Wu, Amazon (wenyiwu@amazon.com);

(3) Qi Li, Amazon (qlimz@amazon.com);

(4) Rob Barton, Amazon (rab@amazon.com);

(5) Boxin Du, Amazon (boxin@amazon.com);

(6) Shioulin Sam, Amazon (shioulin@amazon.com);

(7) Karim Bouyarmane, Amazon (bouykari@amazon.com);

(8) Ismail Tutar, Amazon (ismailt@amazon.com);

(9) Junzhou Huang, The University of Texas at Arlington (jzhuang@uta.edu).

:::


:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Market Opportunity
Prompt Logo
Prompt Price(PROMPT)
$0.05808
$0.05808$0.05808
+5.50%
USD
Prompt (PROMPT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Zhao Changpeng posted that "TWT token is expanding its use cases", and TWT rose by more than 37% in 24 hours.

Zhao Changpeng posted that "TWT token is expanding its use cases", and TWT rose by more than 37% in 24 hours.

PANews reported on September 19 that Zhao Changpeng retweeted the tweet " Trust Wallet Updates Token Litepaper " and said: "The TWT token was originally just an experiment. The price of FDV rose quickly. They destroyed 99% of the supply, but there were not many use cases. Now, (use cases) are expanding." Coingecko data shows that the price of TWT token is currently $1.1, with a 24-hour increase of 37.6%.
Share
PANews2025/09/19 12:48
Bubblemaps: The top five traders in STBL token trading volume are interconnected and have made profits exceeding $10 million

Bubblemaps: The top five traders in STBL token trading volume are interconnected and have made profits exceeding $10 million

PANews reported on September 18th that blockchain analytics platform Bubblemaps published an article on the X platform claiming that Tether co-founder Reeve Collins had just launched a new token, STBL. However, the top five traders are suspiciously interconnected and have profited over $10 million. Collins launched STBL yesterday, a new stablecoin system built around three tokens: USST (stablecoin), YLD (yield token supporting USST), and STBL (governance token). An analysis of the top five traders by STBL trading volume revealed that these five profit-makers received capital injections at the same time. Tracing the source of their funds revealed a clear connection: the funds all came from the same source (injected via Tornado Cash); bots were used to borrow USDC from the Venus Protocol; and the total profit exceeded $10 million. However, there is no evidence that these traders are connected to the core team. In fact, this group of bots has a history of extracting value from other tokens, not just STBL.
Share
PANews2025/09/18 10:09
SEC Cryptocurrency Enforcement Actions Drop by 60% in 2025

SEC Cryptocurrency Enforcement Actions Drop by 60% in 2025

The post SEC Cryptocurrency Enforcement Actions Drop by 60% in 2025 appeared on BitcoinEthereumNews.com. Key Points: SEC actions fell 60% in 2025 under new leadership
Share
BitcoinEthereumNews2026/01/23 10:15