publications

2025

Preprint
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, and Tu Vu

In arXiv preprint arXiv:2506.01062, 2025

Abs arXiv Bib PDF

We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across o3-mini, o4-mini, and o3, with performance often plateauing or even declining early. Additionally, while recent models are less affected by the "lost-in-the-middle" issue, they still fail to reliably identify relevant documents in LongSeal when faced with numerous distractors. To facilitate future work, we release SealQA at http://huggingface.co/datasets/vtllms/sealqa.
@inproceedings{pham-etal-2025-sealqa, title = {SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models}, author = {Pham, Thinh and Nguyen, Nguyen and Zunjare, Pratibha and Chen, Weiyuan and Tseng, Yu-Min and Vu, Tu}, booktitle = {arXiv preprint arXiv:2506.01062}, year = {2025}, pdf = {https://arxiv.org/pdf/2506.01062}, preprint = {true}, }
// Our benchmark dataset has been used by Google’s Gemini, DeepSeek, and Kimi
EMNLP
Efficient Model Development through Fine-tuning Transfer

Pin-Jie Lin, Rishab Balasubramanian, Fengyuan Liu, Nikhil Kandpal, and Tu Vu

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

Abs arXiv Bib PDF

Modern LLMs struggle with efficient updates, as each new pretrained model version requires repeating expensive alignment processes. This challenge also applies to domain- or language-specific models, where fine-tuning on specialized data must be redone for every new base model release. In this paper, we explore the transfer of fine-tuning updates between model versions. Specifically, we derive the diff vector from one source model version, which represents the weight changes from fine-tuning, and apply it to the base model of a different target version. Through empirical evaluations on various open-weight model versions, we show that transferring diff vectors can significantly improve the target base model, often achieving performance comparable to its fine-tuned counterpart. For example, reusing the fine-tuning updates from Llama 3.0 8B leads to an absolute accuracy improvement of 10.7% on GPQA over the base Llama 3.1 8B without additional training, surpassing Llama 3.1 8B Instruct. In a multilingual model development setting, we show that this approach can significantly increase performance on target-language tasks without retraining, achieving an absolute improvement of 4.7% and 15.5% on Global MMLU for Malagasy and Turkish, respectively, compared to Llama 3.1 8B Instruct. Our controlled experiments reveal that fine-tuning transfer is most effective when the source and target models are linearly connected in the parameter space. Additionally, we demonstrate that fine-tuning transfer offers a stronger and more computationally efficient starting point for further fine-tuning. Finally, we propose an iterative recycling-then-finetuning approach for continuous model development, which improves both efficiency and effectiveness. Our findings suggest that fine-tuning transfer is a viable strategy to reduce training costs while maintaining model performance.
@inproceedings{lin-etal-2025-efficient, title = {Efficient Model Development through Fine-tuning Transfer}, author = {Lin, Pin-Jie and Balasubramanian, Rishab and Liu, Fengyuan and Kandpal, Nikhil and Vu, Tu}, booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing}, year = {2025}, pdf = {https://arxiv.org/pdf/2503.20110}, }
Preprint
ATEB: Evaluating and Improving Advanced NLP Tasks for Text Embedding Models

Simeng Han, Frank Palma Gomez, Tu Vu, Zefei Li, Daniel Cer, Hansi Zeng, Chris Tar, Arman Cohan, and Gustavo Hernandez Abrego

In arXiv preprint arXiv:2502.16766, 2025

Abs arXiv Bib PDF

Traditional text embedding benchmarks primarily evaluate embedding models’ capabilities to capture semantic similarity. However, more advanced NLP tasks require a deeper understanding of text, such as safety and factuality. These tasks demand an ability to comprehend and process complex information, often involving the handling of sensitive content, or the verification of factual statements against reliable sources. We introduce a new benchmark designed to assess and highlight the limitations of embedding models trained on existing information retrieval data mixtures on advanced capabilities, which include factuality, safety, instruction following, reasoning and document-level understanding. This benchmark includes a diverse set of tasks that simulate real-world scenarios where these capabilities are critical and leads to identification of the gaps of the currently advanced embedding models. Furthermore, we propose a novel method that reformulates these various tasks as retrieval tasks. By framing tasks like safety or factuality classification as retrieval problems, we leverage the strengths of retrieval models in capturing semantic relationships while also pushing them to develop a deeper understanding of context and content. Using this approach with single-task fine-tuning, we achieved performance gains of 8% on factuality classification and 13% on safety classification. Our code and data will be publicly available.
@inproceedings{han-etal-2025-ateb, title = {ATEB: Evaluating and Improving Advanced NLP Tasks for Text Embedding Models}, author = {Han, Simeng and Gomez, Frank Palma and Vu, Tu and Li, Zefei and Cer, Daniel and Zeng, Hansi and Tar, Chris and Cohan, Arman and Abrego, Gustavo Hernandez}, booktitle = {arXiv preprint arXiv:2502.16766}, year = {2025}, pdf = {https://arxiv.org/pdf/2502.16766}, }
TMLR
What Matters for Model Merging at Scale?

Prateek Yadav, Tu Vu, Jonathan Lai, Alexandra Chronopoulou, Manaal Faruqui, Mohit Bansal, and Tsendsuren Munkhdalai

In Transactions on Machine Learning Research, 2025

Abs arXiv Bib PDF

Model merging aims to combine multiple expert models into a more capable single model, offering benefits such as reduced storage and serving costs, improved generalization, and support for decentralized model development. Despite its promise, previous studies have primarily focused on merging a few small models. This leaves many unanswered questions about the effect of scaling model size and how it interplays with other key factors – like the base model quality and number of expert models – , to affect the merged model’s performance. This work systematically evaluates the utility of model merging at scale, examining the impact of these different factors. We experiment with merging fully fine-tuned models using 4 popular merging methods – Averaging, Task Arithmetic, Dare, and TIES – across model sizes ranging from 1B-64B parameters and merging up to 8 different expert models. We evaluate the merged models on both held-in tasks, i.e., the expert’s training tasks, and zero-shot generalization to unseen held-out tasks. Our experiments provide several new insights about model merging at scale and the interplay between different factors. First, we find that merging is more effective when experts are created from strong base models, i.e., models with good zero-shot performance. Second, larger models facilitate easier merging. Third merging consistently improves generalization capabilities. Notably, when merging 8 large expert models, the merged models often generalize better compared to the multitask trained models. Fourth, we can better merge more expert models when working with larger models. Fifth, different merging methods behave very similarly at larger scales. Overall, our findings shed light on some interesting properties of model merging while also highlighting some limitations. We hope that this study will serve as a reference point on large-scale merging for upcoming research.
@inproceedings{yadav-etal-2024-matters, title = {What Matters for Model Merging at Scale?}, author = {Yadav, Prateek and Vu, Tu and Lai, Jonathan and Chronopoulou, Alexandra and Faruqui, Manaal and Bansal, Mohit and Munkhdalai, Tsendsuren}, booktitle = {Transactions on Machine Learning Research}, year = {2025}, url = {https://arxiv.org/abs/2410.03617}, pdf = {https://arxiv.org/pdf/2410.03617}, }

2024

EMNLP
Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation

Tu Vu*, Kalpesh Krishna*, Salaheddin Alzubi, Chris Tar, Manaal Faruqui, and Yun-Hsuan Sung

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Abs arXiv Bib PDF

As large language models (LLMs) advance, it becomes more challenging to reliably evaluate their output due to the high costs of human evaluation. To make progress towards better LLM autoraters, we introduce FLAMe, a family of Foundational Large Autorater Models. FLAMe is trained on our large and diverse collection of 100+ quality assessment tasks comprising 5M+ human judgments, curated and standardized using publicly released human evaluations from previous research. FLAMe significantly improves generalization to a wide variety of held-out tasks, outperforming LLMs trained on proprietary data like GPT-4 and Claude-3 on many tasks. We show that FLAMe can also serve as a powerful starting point for further downstream fine-tuning, using reward modeling evaluation as a case study (FLAMe-RM). Notably, on RewardBench, our FLAMe-RM-24B model (with an accuracy of 87.8%) is the top-performing generative model trained exclusively on permissively licensed data, outperforming both GPT-4-0125 (85.9%) and GPT-4o (84.7%). Additionally, we explore a more computationally efficient approach using a novel tail-patch fine-tuning strategy to optimize our FLAMe multitask mixture for reward modeling evaluation (FLAMe-Opt-RM), offering competitive RewardBench performance while requiring approximately 25x less training datapoints. Overall, our FLAMe variants outperform all popular proprietary LLM-as-a-Judge models we consider across 8 out of 12 autorater evaluation benchmarks, encompassing 53 quality assessment tasks, including RewardBench and LLM-AggreFact. Finally, our analysis reveals that FLAMe is significantly less biased than these LLM-as-a-Judge models on the CoBBLEr autorater bias benchmark, while effectively identifying high-quality responses for code generation.
@inproceedings{vu-etal-2024-foundational, title = {Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation}, author = {Vu*, Tu and Krishna*, Kalpesh and Alzubi, Salaheddin and Tar, Chris and Faruqui, Manaal and Sung, Yun-Hsuan}, booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing}, year = {2024}, url = {https://arxiv.org/abs/2407.10817}, pdf = {https://arxiv.org/pdf/2407.10817.pdf}, }
// The top-performing generative model on RewardBench as of July 15, 2024, trained only on publicly available data
ACL
FreshLLMs: Refreshing large language models with search engine augmentation

Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and Thang Luong

In Findings of the Association for Computational Linguistics: ACL 2024, 2024

Abs arXiv Bib PDF

Most large language models (LLMs) are trained once and never updated; thus, they lack the ability to dynamically adapt to our ever-changing world. In this work, we perform a detailed study of the factuality of LLM-generated text in the context of answering questions that test current world knowledge. Specifically, we introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types, including questions that require fast-changing world knowledge as well as questions with false premises that need to be debunked. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Through human evaluations involving more than 50K judgments, we shed light on limitations of these models and demonstrate significant room for improvement: for instance, all models (regardless of model size) struggle on questions that involve fast-changing knowledge and false premises. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA by incorporating relevant and up-to-date information retrieved from a search engine into the prompt. Our experiments show that FreshPrompt outperforms both competing search engine-augmented prompting methods such as Self-Ask (Press et al., 2022) as well as commercial systems such as Perplexity.AI. Further analysis of FreshPrompt reveals that both the number of retrieved evidences and their order play a key role in influencing the correctness of LLM-generated answers. Additionally, instructing the LLM to generate concise and direct answers helps reduce hallucination compared to encouraging more verbose answers. To facilitate future work, we release FreshQA at github.com/freshllms/freshqa and commit to updating it at regular intervals.
@inproceedings{vu-etal-2024-freshllms, title = {{F}resh{LLM}s: Refreshing large language models with search engine augmentation}, author = {Vu, Tu and Iyyer, Mohit and Wang, Xuezhi and Constant, Noah and Wei, Jerry and Wei, Jason and Tar, Chris and Sung, Yun-Hsuan and Zhou, Denny and Le, Quoc and Luong, Thang}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2024}, year = {2024}, url = {}, pages = {}, pdf = {https://arxiv.org/pdf/2310.03214.pdf}, }
// Our dataset and method have inspired or been used for the development of Google’s Gemini, Perplexity.AI’s Online LLMs, You.com, and Contextual AI’s RAG 2.0
ICLR
Mixture-of-experts meets instruction tuning: A winning combination for large language models

Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen, Tu Vu, Yuexin Wu, Wuyang Chen, Albert Webson, Yunxuan Li, Vincent Zhao, Hongkun Yu, Kurt Keutzer, Trevor Darrell, and Denny Zhou

In Proceedings of the 12th International Conference on Learning Representations, 2024

Abs arXiv Bib PDF

Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to Large Language Models (LLMs) without increasing inference cost. Instruction tuning is a technique for training LLMs to follow instructions. We advocate combining these two approaches, as we find that MoE models benefit more from instruction tuning than dense models. In particular, we conduct empirical studies across three experimental setups: (i) Direct finetuning on individual downstream tasks devoid of instruction tuning; (ii) Instructiontuning followed by in-context few-shot or zero-shot generalization on downstream tasks; and (iii) Instruction tuning supplemented by further finetuning on individual downstream tasks. In the first scenario, MoE models overall underperform dense models of identical computational capacity. This narrative, however, dramatically changes with the introduction of instruction tuning (second and third scenario), used independently or in conjunction with task-specific finetuning. Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks, while using only a third of the FLOPs. The advancements embodied byFLAN-MOE inspire a reevaluation of the design principles of large-scale, high-performance language models in the framework of task-agnostic learning.
@inproceedings{shen2023mixture, title = {Mixture-of-experts meets instruction tuning: A winning combination for large language models}, author = {Shen, Sheng and Hou, Le and Zhou, Yanqi and Du, Nan and Longpre, Shayne and Wei, Jason and Chung, Hyung Won and Zoph, Barret and Fedus, William and Chen, Xinyun and Vu, Tu and Wu, Yuexin and Chen, Wuyang and Webson, Albert and Li, Yunxuan and Zhao, Vincent and Yu, Hongkun and Keutzer, Kurt and Darrell, Trevor and Zhou, Denny}, booktitle = {Proceedings of the 12th International Conference on Learning Representations}, year = {2024}, pdf = {https://arxiv.org/pdf/2305.14705.pdf}, }

2023

Technical report
Gemini: A Family of Highly Capable Multimodal Models

Google Gemini Team: Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew Dai, Anja Hauth, and others including Tu Vu

In arXiv preprint arXiv:2312.11805, 2023

Abs arXiv Bib PDF

This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
@inproceedings{geminiteam2024gemini, title = {Gemini: A Family of Highly Capable Multimodal Models}, author = {Anil, Rohan and Borgeaud, Sebastian and Wu, Yonghui and Alayrac, Jean-Baptiste and Yu, Jiahui and Soricut, Radu and Schalkwyk, Johan and Dai, Andrew and Hauth, Anja and others}, booktitle = {arXiv preprint arXiv:2312.11805}, year = {2023}, pdf = {https://arxiv.org/pdf/2312.11805}, }
// Google AI Blog
ICML
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, and Adam Roberts

In Proceedings of the 40th International Conference on Machine Learning, 2023

Abs arXiv Bib PDF Data

We study the design decision of publicly available instruction tuning methods, by reproducing and breaking down the development of Flan 2022 (Chung et al., 2022). Through careful ablation studies on the Flan Collection of tasks and methods, we tease apart the effect of design decisions which enable Flan-T5 to outperform prior work by 3-17% across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, chain-of-thought) actually yields equivalent or stronger (2%) performance in all settings. In further experiments we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks – motivating instruction-tuned models as more computationally-efficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available.
@inproceedings{pmlr-v202-longpre23a, title = {The Flan Collection: Designing Data and Methods for Effective Instruction Tuning}, author = {Longpre, Shayne and Hou, Le and Vu, Tu and Webson, Albert and Chung, Hyung Won and Tay, Yi and Zhou, Denny and Le, Quoc V and Zoph, Barret and Wei, Jason and Roberts, Adam}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {22631--22648}, year = {2023}, volume = {202}, series = {Proceedings of Machine Learning Research(PMLR)}, publisher = {PMLR}, url = {https://proceedings.mlr.press/v202/longpre23a.html}, pdf = {https://proceedings.mlr.press/v202/longpre23a/longpre23a.pdf}, }
// Google Research Blog
NeurIPS
Self-Evaluation Improves Selective Generation in Large Language Models

Jie Ren, Yao Zhao, Tu Vu, Peter J Liu, and Balaji Lakshminarayanan

In Proceedings on "I Can’t Believe It’s Not Better! - Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops, 2023

Abs arXiv Bib PDF

Safe deployment of large language models (LLMs) may benefit from a reliable method for assessing their generated content to determine when to abstain or to selectively generate. While likelihood-based metrics such as perplexity are widely employed, recent research has demonstrated the limitations of using sequence-level probability estimates given by LLMs as reliable indicators of generation quality. Conversely, LLMs have demonstrated strong calibration at the token level, particularly when it comes to choosing correct answers in multiple-choice questions or evaluating true/false statements. In this work, we reformulate open-ended generation tasks into token-level prediction tasks, and leverage LLMs’ superior calibration at the token level. We instruct an LLM to self-evaluate its answers, employing either a multi-way comparison or a point-wise evaluation approach, with the option to include a "None of the above" option to express the model’s uncertainty explicitly. We benchmark a range of scoring methods based on self-evaluation and evaluate their performance in selective generation using TruthfulQA and TL;DR. Through experiments with PaLM-2 and GPT-3, we demonstrate that self-evaluation based scores not only improve accuracy, but also correlate better with the overall quality of generated content.
@inproceedings{ren-etal-2023-self, title = {Self-Evaluation Improves Selective Generation in Large Language Models}, author = {Ren, Jie and Zhao, Yao and Vu, Tu and Liu, Peter J and Lakshminarayanan, Balaji}, booktitle = {Proceedings on "I Can't Believe It's Not Better! - Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops}, year = {2023}, pdf = {https://arxiv.org/pdf/2312.09300.pdf}, }
ACL
Dialect-robust Evaluation of Generated Text

Jiao Sun, Thibault Sellam, Elizabeth Clark, Tu Vu, Timothy Dozat, Dan Garrette, Aditya Siddhant, Jacob Eisenstein, and Sebastian Gehrmann

In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

Abs arXiv Bib PDF

Text generation metrics that are not robust to dialect variation make it impossible to tell how well systems perform for many groups of users, and can even penalize systems for producing text in lower-resource dialects. In this paper, we introduce a suite of methods to assess whether metrics are dialect robust. These methods show that state-of-the-art metrics are not dialect robust: they often prioritize dialect similarity over semantics, preferring outputs that are semantically incorrect over outputs that match the semantics of the reference but contain dialect differences. As a step towards dialect-robust metrics for text generation, we propose NANO, which introduces regional and language information to the metric’s pretraining. NANO significantly improves dialect robustness while preserving the correlation between automated metrics and human ratings. It also enables a more ambitious approach to evaluation, dialect awareness, in which system outputs are scored by both semantic match to the reference and appropriateness in any specified dialect.
@inproceedings{sun-etal-2023-dialect, title = {Dialect-robust Evaluation of Generated Text}, author = {Sun, Jiao and Sellam, Thibault and Clark, Elizabeth and Vu, Tu and Dozat, Timothy and Garrette, Dan and Siddhant, Aditya and Eisenstein, Jacob and Gehrmann, Sebastian}, booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, year = {2023}, url = {https://aclanthology.org/2023.acl-long.331}, pages = {6010--6028}, pdf = {https://arxiv.org/pdf/2211.00922.pdf}, }

2022

ACL
SPoT: Better Frozen Model Adaptation through Soft Prompt Transfer

Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou, and Daniel Cer

In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022

Abs arXiv Bib PDF

There has been growing interest in parameter-efficient methods to apply pre-trained language models to downstream tasks. Building on the Prompt Tuning approach of Lester et al. (2021), which learns task-specific soft prompts to condition a frozen pre-trained model to perform different tasks, we propose a novel prompt-based transfer learning approach called SPoT: Soft Prompt Transfer. SPoT first learns a prompt on one or more source tasks and then uses it to initialize the prompt for a target task. We show that SPoT significantly boosts the performance of Prompt Tuning across many tasks. More remarkably, across all model sizes, SPoT matches or outperforms standard Model Tuning (which fine-tunes all model parameters) on the SuperGLUE benchmark, while using up to 27,000\mbox\times fewer task-specific parameters. To understand where SPoT is most effective, we conduct a large-scale study on task transferability with 26 NLP tasks in 160 combinations, and demonstrate that many tasks can benefit each other via prompt transfer. Finally, we propose an efficient retrieval approach that interprets task prompts as task embeddings to identify similar tasks and predict the most transferable source tasks for a novel target task.
@inproceedings{vu-etal-2022-spot, title = {{SP}o{T}: Better Frozen Model Adaptation through Soft Prompt Transfer}, author = {Vu, Tu and Lester, Brian and Constant, Noah and Al-Rfou, Rami and Cer, Daniel}, booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, year = {2022}, url = {https://aclanthology.org/2022.acl-long.346}, pages = {5039--5059}, pdf = {https://arxiv.org/pdf/2110.07904.pdf}, }
// Headlines of Google AI’s Natural Language Accelerated Newsletter Q1, 2022
EMNLP
Overcoming Catastrophic Forgetting in Zero-Shot Cross-Lingual Generation

Tu Vu, Aditya Barua, Brian Lester, Daniel Cer, Mohit Iyyer, and Noah Constant

In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

Abs arXiv Bib PDF

In this paper, we explore the challenging problem of performing a generative task in a target language when labeled data is only available in English, using summarization as a case study. We assume a strict setting with no access to parallel data or machine translation and find that common transfer learning approaches struggle in this setting, as a generative multilingual model fine-tuned purely on English catastrophically forgets how to generate non-English. Given the recent rise of parameter-efficient adaptation techniques, we conduct the first investigation into how one such method, prompt tuning (Lester et al., 2021), can overcome catastrophic forgetting to enable zero-shot cross-lingual generation. Our experiments show that parameter-efficient prompt tuning provides gains over standard fine-tuning when transferring between less-related languages, e.g., from English to Thai. However, a significant gap still remains between these methods and fully-supervised baselines. To improve cross-lingual transfer further, we explore several approaches, including: (1) mixing in unlabeled multilingual data, and (2) explicitly factoring prompts into recombinable language and task components. Our approaches can provide further quality gains, suggesting that robust zero-shot cross-lingual generation is within reach.
@inproceedings{vu-etal-2022-overcoming, title = {Overcoming Catastrophic Forgetting in Zero-Shot Cross-Lingual Generation}, author = {Vu, Tu and Barua, Aditya and Lester, Brian and Cer, Daniel and Iyyer, Mohit and Constant, Noah}, booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing}, year = {2022}, url = {https://aclanthology.org/2022.emnlp-main.630}, pages = {9279--9300}, pdf = {https://arxiv.org/pdf/2205.12647.pdf}, }
EMNLP
Leveraging QA Datasets to Improve Generative Data Augmentation

Dheeraj Mekala, Tu Vu, Timo Schick, and Jingbo Shang

In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

Abs arXiv Bib PDF

The ability of generative language models (GLMs) to generate text has improved considerably in the last few years, enabling their use for generative data augmentation. In this work, we propose CONDA, an approach to further improve GLM’s ability to generate synthetic data by reformulating data generation as context generation for a given question-answer (QA) pair and leveraging QA datasets for training context generators. Then, we cast downstream tasks into the same question answering format and adapt the fine-tuned context generators to the target task domain. Finally, we use the fine-tuned GLM to generate relevant contexts, which are in turn used as synthetic training data for their corresponding tasks. We perform extensive experiments on multiple classification datasets and demonstrate substantial improvements in performance for both few- and zero-shot settings. Our analysis reveals that QA datasets that require high-level reasoning abilities (e.g., abstractive and common-sense QA datasets) tend to give the best boost in performance in both few-shot and zero-shot settings.
@inproceedings{mekala-etal-2022-leveraging, title = {Leveraging {QA} Datasets to Improve Generative Data Augmentation}, author = {Mekala, Dheeraj and Vu, Tu and Schick, Timo and Shang, Jingbo}, booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing}, year = {2022}, url = {https://aclanthology.org/2022.emnlp-main.660}, pages = {9737--9750}, pdf = {https://arxiv.org/pdf/2205.12604.pdf}, }

2021

EMNLP
STraTA: Self-Training with Task Augmentation for Better Few-shot Learning

Tu Vu, Thang Luong, Quoc Le, Grady Simon, and Mohit Iyyer

In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

Abs arXiv Bib PDF

Despite their recent successes in tackling many NLP tasks, large-scale pre-trained language models do not perform as well in few-shot settings where only a handful of training examples are available. To address this shortcoming, we propose STraTA, which stands for Self-Training with Task Augmentation, an approach that builds on two key ideas for effective leverage of unlabeled data. First, STraTA uses task augmentation, a novel technique that synthesizes a large amount of data for auxiliary-task fine-tuning from target-task unlabeled texts. Second, STraTA performs self-training by further fine-tuning the strong base model created by task augmentation on a broad distribution of pseudo-labeled data. Our experiments demonstrate that STraTA can substantially improve sample efficiency across 12 few-shot benchmarks. Remarkably, on the SST-2 sentiment dataset, STraTA, with only 8 training examples per class, achieves comparable results to standard fine-tuning with 67K training examples. Our analyses reveal that task augmentation and self-training are both complementary and independently effective.
@inproceedings{vu-etal-2021-strata, title = {{ST}ra{TA}: Self-Training with Task Augmentation for Better Few-shot Learning}, author = {Vu, Tu and Luong, Thang and Le, Quoc and Simon, Grady and Iyyer, Mohit}, booktitle = {Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing}, year = {2021}, url = {https://aclanthology.org/2021.emnlp-main.462}, pages = {5715--5731}, pdf = {https://arxiv.org/pdf/2109.06270.pdf}, }

2020

EMNLP
Exploring and Predicting Transferability across NLP Tasks

Tu Vu, Tong Wang, Tsendsuren Munkhdalai, Alessandro Sordoni, Adam Trischler, Andrew Mattarella-Micke, Subhransu Maji, and Mohit Iyyer

In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020

Abs arXiv Bib PDF

Recent advances in NLP demonstrate the effectiveness of training large-scale language models and transferring them to downstream tasks. Can fine-tuning these models on tasks other than language modeling further improve performance? In this paper, we conduct an extensive study of the transferability between 33 NLP tasks across three broad classes of problems (text classification, question answering, and sequence labeling). Our results show that transfer learning is more beneficial than previously thought, especially when target task data is scarce, and can improve performance even with low-data source tasks that differ substantially from the target task (e.g., part-of-speech tagging transfers well to the DROP QA dataset). We also develop task embeddings that can be used to predict the most transferable source tasks for a given target task, and we validate their effectiveness in experiments controlled for source and target data size. Overall, our experiments reveal that factors such as data size, task and domain similarity, and task complexity all play a role in determining transferability.
@inproceedings{vu-etal-2020-exploring, title = {Exploring and Predicting Transferability across {NLP} Tasks}, author = {Vu, Tu and Wang, Tong and Munkhdalai, Tsendsuren and Sordoni, Alessandro and Trischler, Adam and Mattarella-Micke, Andrew and Maji, Subhransu and Iyyer, Mohit}, booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing}, year = {2020}, url = {https://aclanthology.org/2020.emnlp-main.635}, pages = {7882--7926}, pdf = {https://arxiv.org/pdf/2005.00770.pdf}, }

2019

ACL
Encouraging Paragraph Embeddings to Remember Sentence Identity Improves Classification

Tu Vu, and Mohit Iyyer

In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Abs arXiv Bib PDF

While paragraph embedding models are remarkably effective for downstream classification tasks, what they learn and encode into a single vector remains opaque. In this paper, we investigate a state-of-the-art paragraph embedding method proposed by Zhang et al. (2017) and discover that it cannot reliably tell whether a given sentence occurs in the input paragraph or not. We formulate a sentence content task to probe for this basic linguistic property and find that even a much simpler bag-of-words method has no trouble solving it. This result motivates us to replace the reconstruction-based objective of Zhang et al. (2017) with our sentence content probe objective in a semi-supervised setting. Despite its simplicity, our objective improves over paragraph reconstruction in terms of (1) downstream classification accuracies on benchmark datasets, (2) faster training, and (3) better generalization ability.
@inproceedings{vu-iyyer-2019-encouraging, title = {Encouraging Paragraph Embeddings to Remember Sentence Identity Improves Classification}, author = {Vu, Tu and Iyyer, Mohit}, booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics}, year = {2019}, url = {https://aclanthology.org/P19-1638}, pages = {6331--6338}, pdf = {https://arxiv.org/pdf/1906.03656.pdf}, }

2018

NAACL
Sentence Simplification with Memory-Augmented Neural Networks

Tu Vu, Baotian Hu, Tsendsuren Munkhdalai, and Hong Yu

In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 2018

Abs arXiv Bib PDF

Sentence simplification aims to simplify the content and structure of complex sentences, and thus make them easier to interpret for human readers, and easier to process for downstream NLP applications. Recent advances in neural machine translation have paved the way for novel approaches to the task. In this paper, we adapt an architecture with augmented memory capacities called Neural Semantic Encoders (Munkhdalai and Yu, 2017) for sentence simplification. Our experiments demonstrate the effectiveness of our approach on different simplification datasets, both in terms of automatic evaluation measures and human judgments.
@inproceedings{vu-etal-2018-sentence, title = {Sentence Simplification with Memory-Augmented Neural Networks}, author = {Vu, Tu and Hu, Baotian and Munkhdalai, Tsendsuren and Yu, Hong}, booktitle = {Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)}, year = {2018}, url = {https://aclanthology.org/N18-2013}, pages = {79--85}, pdf = {https://arxiv.org/pdf/1804.07445.pdf}, }
*SEM@NAACL
Integrating Multiplicative Features into Supervised Distributional Methods for Lexical Entailment

Tu Vu, and Vered Shwartz

In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, 2018

Abs arXiv Bib PDF

Supervised distributional methods are applied successfully in lexical entailment, but recent work questioned whether these methods actually learn a relation between two words. Specifically, Levy et al. (2015) claimed that linear classifiers learn only separate properties of each word. We suggest a cheap and easy way to boost the performance of these methods by integrating multiplicative features into commonly used representations. We provide an extensive evaluation with different classifiers and evaluation setups, and suggest a suitable evaluation setup for the task, eliminating biases existing in previous ones.
@inproceedings{vu-shwartz-2018-integrating, title = {Integrating Multiplicative Features into Supervised Distributional Methods for Lexical Entailment}, author = {Vu, Tu and Shwartz, Vered}, booktitle = {Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics}, year = {2018}, url = {https://aclanthology.org/S18-2020}, pages = {160--166}, pdf = {https://arxiv.org/pdf/1804.08845.pdf}, }