AdaMix, a parameter-efficient fine-tuning method, outperforms full model fine-tuning in few-shot NLU tasks across benchmarks like GLUE. Using prompt-based strategies without extra validation or unlabeled data, AdaMix consistently boosts performance with both BERT and RoBERTa encoders, demonstrating stability and efficiency in few-shot scenarios.AdaMix, a parameter-efficient fine-tuning method, outperforms full model fine-tuning in few-shot NLU tasks across benchmarks like GLUE. Using prompt-based strategies without extra validation or unlabeled data, AdaMix consistently boosts performance with both BERT and RoBERTa encoders, demonstrating stability and efficiency in few-shot scenarios.

Smarter AI Training with Few-Shot Natural Language Tasks

2025/10/02 17:00

Abstract and 1. Introduction

  1. Background

    2.1 Mixture-of-Experts

    2.2 Adapters

  2. Mixture-of-Adaptations

    3.1 Routing Policy

    3.2 Consistency regularization

    3.3 Adaptation module merging and 3.4 Adaptation module sharing

    3.5 Connection to Bayesian Neural Networks and Model Ensembling

  3. Experiments

    4.1 Experimental Setup

    4.2 Key Results

    4.3 Ablation Study

  4. Related Work

  5. Conclusions

  6. Limitations

  7. Acknowledgment and References

Appendix

A. Few-shot NLU Datasets B. Ablation Study C. Detailed Results on NLU Tasks D. Hyper-parameter

A Few-shot NLU Datasets

Data. In contrast to the fully supervised setting in the above experiments, we also perform fewshot experiments following the prior study (Wang et al., 2021) on six tasks including MNLI (Williams et al., 2018), RTE (Dagan et al., 2005; Bar Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009), QQP[1] and SST-2 (Socher et al.). The results are reported on their development set following (Zhang et al., 2021). MPQA (Wiebe et al., 2005) and Subj (Pang and Lee, 2004) are used for polarity and subjectivity detection, where we follow (Gao et al., 2021) to keep 2, 000 examples for testing. The few-shot model only has access to |K| labeled samples for any task. Following true few-shot learning setting (Perez et al., 2021; Wang et al., 2021), we do not use any additional validation set for any hyper-parameter tuning or early stopping. The performance of each model is reported after fixed number of training epochs. For a fair comparison, we use the same set of few-shot labeled instances for training as in (Wang et al., 2021). We train each model with 5 different seeds and report average performance with standard deviation across the runs. In the few-shot experiments, we follow (Wang et al., 2021) to train AdaMix via the prompt-based fine-tuning strategy. In contrast to (Wang et al., 2021), we do not use any unlabeled data.

\

B Ablation Study

\ Table 11: Ablation study demonstrating the impact of parameter sharing in AdaMix adapter framework.

\

C Detailed Results on NLU Tasks

The results on NLU tasks are included in Table 1 and Table 13. The performance AdaMix with RoBERTa-large encoder achieves the best performance in terms of different task metrics in the GLUE benchmark. AdaMix with adapters is the

\ \ Table 12: Varying the bottleneck dimension of adapters in AdaMix with BERT-base and RoBERTa-large encoder. * denotes the bottleneck dimension used in AdaMix with adapters.

\ \ only PEFT method which outperforms full model fine-tuning on all the tasks and on average score. Additionally, the improvement brought by AdaMix is more significant with BERT-base as the encoder, demonstrating 2.2% and 1.2% improvement over the performance of full model fine-tuning and the best performing baseline UNIPELT with BERTbase. The improvement is observed to be consistent as that with RoBERTa-large on every task. The NLG results are included in Table 4 and 5.

D Hyper-parameter

Detailed hyper-parameter configuration for different tasks presented in Table 15 and Table 16.

\

:::info Authors:

(1) Yaqing Wang, Purdue University (wang5075@purdue.edu);

(2) Sahaj Agarwal, Microsoft (sahagar@microsoft.com);

(3) Subhabrata Mukherjee, Microsoft Research (submukhe@microsoft.com);

(4) Xiaodong Liu, Microsoft Research (xiaodl@microsoft.com);

(5) Jing Gao, Purdue University (jinggao@purdue.edu);

(6) Ahmed Hassan Awadallah, Microsoft Research (hassanam@microsoft.com);

(7) Jianfeng Gao, Microsoft Research (jfgao@microsoft.com).

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

[1] https://www.quora.com/q/quoradata/

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

This Exclusive Cayman Getaway Tastes As Good As It Feels

This Exclusive Cayman Getaway Tastes As Good As It Feels

The post This Exclusive Cayman Getaway Tastes As Good As It Feels appeared on BitcoinEthereumNews.com. 1OAK’s Sand Soleil sits on Grand Cayman’s iconic Seven Mile Beach 1OAK Exhausted and professionally burnt out, I arrived at 1OAK’s Sand Soleil in search of the type of restoration that could still my mind and get me writing again. The seven-day culinary experience was a no-brainer for me as a food writer. The integration of an epicurean getaway with pure Cayman luxury seemed to be the perfect spark for my creativity—private chef dinners, deep dives into Caribbean flavors, and hands-on masterclasses, all located within a serene, oceanfront villa. I had finally arrived. With the last rays of the sun setting behind Grand Cayman’s famous Seven Mile Beach, casting a warm golden glow across the water, I tasted Chef Joe Hughes’ ceviche for the first time—cubes of wahoo cured in lime, with charred pineapple and a subtle, nutty crunch. Chef Joe Hughes’ love for bright, Asian-inspired flavours came through in this wahoo tataki layered with Vietnamese herbs, ripe papaya and mango, cashew and cilantro, all brought together with a nuoc cham. Jamie Fortune Something softened. For the first time in months, I began to feel present. Sophia List, the brainchild of the 1OAK experience, heard me well. With an intuition honed by years of curating luxury, she matched me with what she called “a vision realized.” List told me Sand Soleil—like the other 1OAK homes on Seven Mile Beach and in West Bay—was created to feel like a real sanctuary. For her, it’s the laid-back alternative to a busy hotel, a place where you get privacy and elegance without any fuss. “We wanted to introduce the Cayman Islands to something truly special—an ultra-luxury experience that combines exquisite design, maximum privacy, and a sense of calm,” she shared as she guided me through the four-bedroom villa. “We are so excited to…
Share
BitcoinEthereumNews2025/12/06 14:01
How Pros Buy Bitcoin Dips With DCA Like Institutions

How Pros Buy Bitcoin Dips With DCA Like Institutions

The post How Pros Buy Bitcoin Dips With DCA Like Institutions appeared on BitcoinEthereumNews.com. “Buy every dip.” That’s the advice from Strike CEO Jack Mallers. According to Mallers, with quantitative tightening over and rate cuts and stimulus on the horizon, the great print is coming. The US can’t afford falling asset prices, he argues, which translates into a giant wall of liquidity ready to muscle in and prop prices up. While retail has latched onto terms like “buy the dip” and “dollar-cost averaging” (DCA) for buying at market lows or making regular purchases, these are really concepts borrowed from the pros like Samar Sen, the senior vice president and head of APAC at Talos, an institutional digital asset trading platform. He says that institutional traders have used these terms for decades to manage their entry points into the market and build exposure gradually, while avoiding emotional decision-making in volatile markets. Source: Jack Mallers Related: Cryptocurrency investment: The ultimate indicators for crypto trading How institutions buy the dip Treasury companies like Strategy and BitMine have become poster children for institutions buying the dip and dollar-cost averaging (DCA) at scale, steadfastly vacuuming up coins every chance they get. Strategy stacked another 130 Bitcoin (BTC) on Monday, while the insatiable Tom Lee scooped up $150 million of Ether (ETH) on Thursday, prompting Arkham to post, “Tom Lee is DCAing ETH.” But while it may look like the smart money is glued to the screen reacting to every market downturn, the reality is quite different. Institutions don’t use the retail vocabulary, Samar explains, but the underlying ideas of disciplined accumulation, opportunistic rebalancing and staying insulated from short-term noise are very much present in how they engage with assets like Bitcoin. The core difference, he points out, is in how they execute those ideas. While retail investors are prone to react to headlines and price charts, institutional desks rely…
Share
BitcoinEthereumNews2025/12/06 13:53