Tu Vu

I am an Assistant Professor at Virginia Tech (VT). I am also a Faculty Researcher at Google. Prior to joining VT, I held the position of Research Scientist at Google DeepMind for a year after receiving my PhD in Computer Science from the University of Massachusetts Amherst, advised by Mohit Iyyer.

🔍 My research aims to develop effective and efficient methods for advancing and democratizing artificial intelligence in the era of large language models (LLMs). Specific areas of focus include:

Advancing LLMs: improving LLMs’ critical capabilities, including reasoning and instruction following, and their emergent use in evaluation (e.g., LM-as-a-Judge or LLM-as-a-Critic)
Transfer learning: reusing learned knowledge or components effectively across settings (e.g., tasks, languages, modalities, or models)
LLM updating: keeping LLMs current by efficiently incorporating factual and up-to-date information with minimal retraining
Parameter-efficient adaptation: adjusting LLMs to new distributions (e.g., unseen tasks, domains, or languages) efficiently, especially in low-resource settings.

⭐ For prospective PhD students

I plan to recruit one new PhD student every year. If you are interested in joining my group, please apply to the VT Graduate School and list me as a potential advisor. Please also check out the application deadlines and information for prospective students. Due to the high volume of emails I receive, I may not be able to respond to each one individually; please don't be discouraged — I may still review your application.

⭐ For Undergraduate and Masters students at VT

I am happy to collaborate on research with current VT students who have at least one full academic year until graduation. If you are interested, feel free to email me. I will follow up if there is a good fit.

Recent news

Jun. 2025	One paper to appear at TMLR 2025 on model merging at scale!
Jun. 2025	Invited guest lecture at The New Turing Institute
Jun. 2025	New preprint on a challenge benchmark for LLM reasoning over conflicting evidence
Apr. 2025	I received the New Faculty Mentoring Grant from VT
Mar. 2025	New preprint on fine-tuning transfer for efficient model development
Nov. 2024	Our lab received a research gift from Adobe
Nov. 2024	✈️ Attended EMNLP 2024 in Miami, Florida 🌴
Nov. 2024	Invited talk at Qualcomm Seminar Series
Oct. 2024	Invited talk at Mila / McGill NLP seminar
Oct. 2024	New preprint on model merging at scale
Sep. 2024	One paper to appear at EMNLP 2024 on foundational autoraters (FLAMe)!
Aug. 2024	I started my professorship at Virginia Tech
Jul. 2024	New preprint on Foundational Autoraters (FLAMe)
May. 2024	FreshLLMs got accepted to ACL 2024 Findings!
Feb. 2024	I am now serving as an Area Chair for ACL Rolling Review (ARR)
Jan. 2024	Flan-MoE got accepted to ICLR 2024!
Nov. 2023	Invited talk at Graph Neural Networks Reading Group, Google
Oct. 2023	New preprint on LLM factuality (FreshLLMs)
Aug. 2023	I joined Google DeepMind in Mountain View, CA as a Research Scientist
Jul. 2023	I successfully defended my PhD thesis!

Teaching

	CS 4804: Introduction to AI (Fall 2025)
	CS 5624: Natural Language Processing (Spring 2025)

Advisees

Group:

	Weiyuan Chen (Incoming PhD student @ VT // advanced reasoning)
	Jing Chen (Incoming PhD student @ VT // TBD)
	Yu-Min Tseng (Incoming PhD student @ VT // TBD)
	Noah Provenzano (1^st year MS student @ VT // advanced reasoning)
	Rituraj Sharma (Senior student & incoming MS student @ VT // advanced reasoning)
	Nguyen Nguyen (Sophomore student @ VT // search-augmented LLMs)
	Thinh Pham (1^st year PhD student @ VT // search-augmented LLMs)
	Rishab Balasubramanian (1^st year PhD student @ VT // cross-model knowledge transfer)
	Quyet Do (1^st year PhD student @ VT // instruction following)
	Pin-Jie Lin (1^st year PhD student @ VT // efficient model development)

Others:

	Zhenting Qi (Student Researcher @ Google, Summer 2025)
	Prateek Yadav (Research Intern @ Google DeepMind, Summer 2024 — Spring 2025)
	Simeng Han (Student Researcher @ Google DeepMind, Summer 2024 — Spring 2025)
	Salaheddin Alzubi (Masters student @ UMass Amherst, Fall 2022 — Spring 2023)
	Dheeraj Mekala (PhD student @ UCSD, Spring — Summer 2022)

Preprints

Preprint
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, and Tu Vu

In arXiv preprint arXiv:2506.01062, 2025

Abs arXiv Bib PDF

We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across o3-mini, o4-mini, and o3, with performance often plateauing or even declining early. Additionally, while recent models are less affected by the "lost-in-the-middle" issue, they still fail to reliably identify relevant documents in LongSeal when faced with numerous distractors. To facilitate future work, we release SealQA at http://huggingface.co/datasets/vtllms/sealqa.
@inproceedings{pham-etal-2025-sealqa, title = {SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models}, author = {Pham, Thinh and Nguyen, Nguyen and Zunjare, Pratibha and Chen, Weiyuan and Tseng, Yu-Min and Vu, Tu}, booktitle = {arXiv preprint arXiv:2506.01062}, year = {2025}, pdf = {https://arxiv.org/pdf/2506.01062}, preprint = {true}, }
Preprint
Efficient Model Development through Fine-tuning Transfer

Pin-Jie Lin, Rishab Balasubramanian, Fengyuan Liu, Nikhil Kandpal, and Tu Vu

In arXiv preprint arXiv:2503.20110, 2025

Abs arXiv Bib PDF

Modern LLMs struggle with efficient updates, as each new pretrained model version requires repeating expensive alignment processes. This challenge also applies to domain- or language-specific models, where fine-tuning on specialized data must be redone for every new base model release. In this paper, we explore the transfer of fine-tuning updates between model versions. Specifically, we derive the diff vector from one source model version, which represents the weight changes from fine-tuning, and apply it to the base model of a different target version. Through empirical evaluations on various open-weight model versions, we show that transferring diff vectors can significantly improve the target base model, often achieving performance comparable to its fine-tuned counterpart. For example, reusing the fine-tuning updates from Llama 3.0 8B leads to an absolute accuracy improvement of 10.7% on GPQA over the base Llama 3.1 8B without additional training, surpassing Llama 3.1 8B Instruct. In a multilingual model development setting, we show that this approach can significantly increase performance on target-language tasks without retraining, achieving an absolute improvement of 4.7% and 15.5% on Global MMLU for Malagasy and Turkish, respectively, compared to Llama 3.1 8B Instruct. Our controlled experiments reveal that fine-tuning transfer is most effective when the source and target models are linearly connected in the parameter space. Additionally, we demonstrate that fine-tuning transfer offers a stronger and more computationally efficient starting point for further fine-tuning. Finally, we propose an iterative recycling-then-finetuning approach for continuous model development, which improves both efficiency and effectiveness. Our findings suggest that fine-tuning transfer is a viable strategy to reduce training costs while maintaining model performance.
@inproceedings{lin-etal-2025-efficient, title = {Efficient Model Development through Fine-tuning Transfer}, author = {Lin, Pin-Jie and Balasubramanian, Rishab and Liu, Fengyuan and Kandpal, Nikhil and Vu, Tu}, booktitle = {arXiv preprint arXiv:2503.20110}, year = {2025}, pdf = {https://arxiv.org/pdf/2503.20110}, preprint = {true}, }

Selected publications

For an up-to-date list of my research papers, please see my Google Scholar profile. * denotes equal contribution.

EMNLP
Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation

Tu Vu*, Kalpesh Krishna*, Salaheddin Alzubi, Chris Tar, Manaal Faruqui, and Yun-Hsuan Sung

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Abs arXiv Bib PDF

As large language models (LLMs) advance, it becomes more challenging to reliably evaluate their output due to the high costs of human evaluation. To make progress towards better LLM autoraters, we introduce FLAMe, a family of Foundational Large Autorater Models. FLAMe is trained on our large and diverse collection of 100+ quality assessment tasks comprising 5M+ human judgments, curated and standardized using publicly released human evaluations from previous research. FLAMe significantly improves generalization to a wide variety of held-out tasks, outperforming LLMs trained on proprietary data like GPT-4 and Claude-3 on many tasks. We show that FLAMe can also serve as a powerful starting point for further downstream fine-tuning, using reward modeling evaluation as a case study (FLAMe-RM). Notably, on RewardBench, our FLAMe-RM-24B model (with an accuracy of 87.8%) is the top-performing generative model trained exclusively on permissively licensed data, outperforming both GPT-4-0125 (85.9%) and GPT-4o (84.7%). Additionally, we explore a more computationally efficient approach using a novel tail-patch fine-tuning strategy to optimize our FLAMe multitask mixture for reward modeling evaluation (FLAMe-Opt-RM), offering competitive RewardBench performance while requiring approximately 25x less training datapoints. Overall, our FLAMe variants outperform all popular proprietary LLM-as-a-Judge models we consider across 8 out of 12 autorater evaluation benchmarks, encompassing 53 quality assessment tasks, including RewardBench and LLM-AggreFact. Finally, our analysis reveals that FLAMe is significantly less biased than these LLM-as-a-Judge models on the CoBBLEr autorater bias benchmark, while effectively identifying high-quality responses for code generation.
@inproceedings{vu-etal-2024-foundational, title = {Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation}, author = {Vu*, Tu and Krishna*, Kalpesh and Alzubi, Salaheddin and Tar, Chris and Faruqui, Manaal and Sung, Yun-Hsuan}, booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing}, year = {2024}, url = {https://arxiv.org/abs/2407.10817}, pdf = {https://arxiv.org/pdf/2407.10817.pdf}, }
// The top-performing generative model on RewardBench as of July 15, 2024, trained only on publicly available data
TMLR
What Matters for Model Merging at Scale?

Prateek Yadav, Tu Vu, Jonathan Lai, Alexandra Chronopoulou, Manaal Faruqui, Mohit Bansal, and Tsendsuren Munkhdalai

In Transactions on Machine Learning Research, 2025

Abs arXiv Bib PDF

Model merging aims to combine multiple expert models into a more capable single model, offering benefits such as reduced storage and serving costs, improved generalization, and support for decentralized model development. Despite its promise, previous studies have primarily focused on merging a few small models. This leaves many unanswered questions about the effect of scaling model size and how it interplays with other key factors – like the base model quality and number of expert models – , to affect the merged model’s performance. This work systematically evaluates the utility of model merging at scale, examining the impact of these different factors. We experiment with merging fully fine-tuned models using 4 popular merging methods – Averaging, Task Arithmetic, Dare, and TIES – across model sizes ranging from 1B-64B parameters and merging up to 8 different expert models. We evaluate the merged models on both held-in tasks, i.e., the expert’s training tasks, and zero-shot generalization to unseen held-out tasks. Our experiments provide several new insights about model merging at scale and the interplay between different factors. First, we find that merging is more effective when experts are created from strong base models, i.e., models with good zero-shot performance. Second, larger models facilitate easier merging. Third merging consistently improves generalization capabilities. Notably, when merging 8 large expert models, the merged models often generalize better compared to the multitask trained models. Fourth, we can better merge more expert models when working with larger models. Fifth, different merging methods behave very similarly at larger scales. Overall, our findings shed light on some interesting properties of model merging while also highlighting some limitations. We hope that this study will serve as a reference point on large-scale merging for upcoming research.
@inproceedings{yadav-etal-2024-matters, title = {What Matters for Model Merging at Scale?}, author = {Yadav, Prateek and Vu, Tu and Lai, Jonathan and Chronopoulou, Alexandra and Faruqui, Manaal and Bansal, Mohit and Munkhdalai, Tsendsuren}, booktitle = {Transactions on Machine Learning Research}, year = {2025}, url = {https://arxiv.org/abs/2410.03617}, pdf = {https://arxiv.org/pdf/2410.03617}, }
Technical report
Gemini: A Family of Highly Capable Multimodal Models

Google Gemini Team: Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew Dai, Anja Hauth, and others including Tu Vu

In arXiv preprint arXiv:2312.11805, 2023

Abs arXiv Bib PDF

This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
@inproceedings{geminiteam2024gemini, title = {Gemini: A Family of Highly Capable Multimodal Models}, author = {Anil, Rohan and Borgeaud, Sebastian and Wu, Yonghui and Alayrac, Jean-Baptiste and Yu, Jiahui and Soricut, Radu and Schalkwyk, Johan and Dai, Andrew and Hauth, Anja and others}, booktitle = {arXiv preprint arXiv:2312.11805}, year = {2023}, pdf = {https://arxiv.org/pdf/2312.11805}, }
// Google AI Blog
ACL
FreshLLMs: Refreshing large language models with search engine augmentation

Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and Thang Luong

In Findings of the Association for Computational Linguistics: ACL 2024, 2024

Abs arXiv Bib PDF

Most large language models (LLMs) are trained once and never updated; thus, they lack the ability to dynamically adapt to our ever-changing world. In this work, we perform a detailed study of the factuality of LLM-generated text in the context of answering questions that test current world knowledge. Specifically, we introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types, including questions that require fast-changing world knowledge as well as questions with false premises that need to be debunked. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Through human evaluations involving more than 50K judgments, we shed light on limitations of these models and demonstrate significant room for improvement: for instance, all models (regardless of model size) struggle on questions that involve fast-changing knowledge and false premises. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA by incorporating relevant and up-to-date information retrieved from a search engine into the prompt. Our experiments show that FreshPrompt outperforms both competing search engine-augmented prompting methods such as Self-Ask (Press et al., 2022) as well as commercial systems such as Perplexity.AI. Further analysis of FreshPrompt reveals that both the number of retrieved evidences and their order play a key role in influencing the correctness of LLM-generated answers. Additionally, instructing the LLM to generate concise and direct answers helps reduce hallucination compared to encouraging more verbose answers. To facilitate future work, we release FreshQA at github.com/freshllms/freshqa and commit to updating it at regular intervals.
@inproceedings{vu-etal-2024-freshllms, title = {{F}resh{LLM}s: Refreshing large language models with search engine augmentation}, author = {Vu, Tu and Iyyer, Mohit and Wang, Xuezhi and Constant, Noah and Wei, Jerry and Wei, Jason and Tar, Chris and Sung, Yun-Hsuan and Zhou, Denny and Le, Quoc and Luong, Thang}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2024}, year = {2024}, url = {}, pages = {}, pdf = {https://arxiv.org/pdf/2310.03214.pdf}, }
// Our dataset and method have inspired or been used for the development of Google’s Gemini, Perplexity.AI’s Online LLMs, You.com, and Contextual AI’s RAG 2.0
ICML
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, and Adam Roberts

In Proceedings of the 40th International Conference on Machine Learning, 2023

Abs arXiv Bib PDF Data

We study the design decision of publicly available instruction tuning methods, by reproducing and breaking down the development of Flan 2022 (Chung et al., 2022). Through careful ablation studies on the Flan Collection of tasks and methods, we tease apart the effect of design decisions which enable Flan-T5 to outperform prior work by 3-17% across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, chain-of-thought) actually yields equivalent or stronger (2%) performance in all settings. In further experiments we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks – motivating instruction-tuned models as more computationally-efficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available.
@inproceedings{pmlr-v202-longpre23a, title = {The Flan Collection: Designing Data and Methods for Effective Instruction Tuning}, author = {Longpre, Shayne and Hou, Le and Vu, Tu and Webson, Albert and Chung, Hyung Won and Tay, Yi and Zhou, Denny and Le, Quoc V and Zoph, Barret and Wei, Jason and Roberts, Adam}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {22631--22648}, year = {2023}, volume = {202}, series = {Proceedings of Machine Learning Research(PMLR)}, publisher = {PMLR}, url = {https://proceedings.mlr.press/v202/longpre23a.html}, pdf = {https://proceedings.mlr.press/v202/longpre23a/longpre23a.pdf}, }
// Google Research Blog
ICLR
Mixture-of-experts meets instruction tuning: A winning combination for large language models

Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen, Tu Vu, Yuexin Wu, Wuyang Chen, Albert Webson, Yunxuan Li, Vincent Zhao, Hongkun Yu, Kurt Keutzer, Trevor Darrell, and Denny Zhou

In Proceedings of the 12th International Conference on Learning Representations, 2024

Abs arXiv Bib PDF

Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to Large Language Models (LLMs) without increasing inference cost. Instruction tuning is a technique for training LLMs to follow instructions. We advocate combining these two approaches, as we find that MoE models benefit more from instruction tuning than dense models. In particular, we conduct empirical studies across three experimental setups: (i) Direct finetuning on individual downstream tasks devoid of instruction tuning; (ii) Instructiontuning followed by in-context few-shot or zero-shot generalization on downstream tasks; and (iii) Instruction tuning supplemented by further finetuning on individual downstream tasks. In the first scenario, MoE models overall underperform dense models of identical computational capacity. This narrative, however, dramatically changes with the introduction of instruction tuning (second and third scenario), used independently or in conjunction with task-specific finetuning. Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks, while using only a third of the FLOPs. The advancements embodied byFLAN-MOE inspire a reevaluation of the design principles of large-scale, high-performance language models in the framework of task-agnostic learning.
@inproceedings{shen2023mixture, title = {Mixture-of-experts meets instruction tuning: A winning combination for large language models}, author = {Shen, Sheng and Hou, Le and Zhou, Yanqi and Du, Nan and Longpre, Shayne and Wei, Jason and Chung, Hyung Won and Zoph, Barret and Fedus, William and Chen, Xinyun and Vu, Tu and Wu, Yuexin and Chen, Wuyang and Webson, Albert and Li, Yunxuan and Zhao, Vincent and Yu, Hongkun and Keutzer, Kurt and Darrell, Trevor and Zhou, Denny}, booktitle = {Proceedings of the 12th International Conference on Learning Representations}, year = {2024}, pdf = {https://arxiv.org/pdf/2305.14705.pdf}, }
ACL
SPoT: Better Frozen Model Adaptation through Soft Prompt Transfer

Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou, and Daniel Cer

In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022

Abs arXiv Bib PDF

There has been growing interest in parameter-efficient methods to apply pre-trained language models to downstream tasks. Building on the Prompt Tuning approach of Lester et al. (2021), which learns task-specific soft prompts to condition a frozen pre-trained model to perform different tasks, we propose a novel prompt-based transfer learning approach called SPoT: Soft Prompt Transfer. SPoT first learns a prompt on one or more source tasks and then uses it to initialize the prompt for a target task. We show that SPoT significantly boosts the performance of Prompt Tuning across many tasks. More remarkably, across all model sizes, SPoT matches or outperforms standard Model Tuning (which fine-tunes all model parameters) on the SuperGLUE benchmark, while using up to 27,000\mbox\times fewer task-specific parameters. To understand where SPoT is most effective, we conduct a large-scale study on task transferability with 26 NLP tasks in 160 combinations, and demonstrate that many tasks can benefit each other via prompt transfer. Finally, we propose an efficient retrieval approach that interprets task prompts as task embeddings to identify similar tasks and predict the most transferable source tasks for a novel target task.
@inproceedings{vu-etal-2022-spot, title = {{SP}o{T}: Better Frozen Model Adaptation through Soft Prompt Transfer}, author = {Vu, Tu and Lester, Brian and Constant, Noah and Al-Rfou, Rami and Cer, Daniel}, booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, year = {2022}, url = {https://aclanthology.org/2022.acl-long.346}, pages = {5039--5059}, pdf = {https://arxiv.org/pdf/2110.07904.pdf}, }
// Headlines of Google AI’s Natural Language Accelerated Newsletter Q1, 2022
EMNLP
Overcoming Catastrophic Forgetting in Zero-Shot Cross-Lingual Generation

Tu Vu, Aditya Barua, Brian Lester, Daniel Cer, Mohit Iyyer, and Noah Constant

In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

Abs arXiv Bib PDF

In this paper, we explore the challenging problem of performing a generative task in a target language when labeled data is only available in English, using summarization as a case study. We assume a strict setting with no access to parallel data or machine translation and find that common transfer learning approaches struggle in this setting, as a generative multilingual model fine-tuned purely on English catastrophically forgets how to generate non-English. Given the recent rise of parameter-efficient adaptation techniques, we conduct the first investigation into how one such method, prompt tuning (Lester et al., 2021), can overcome catastrophic forgetting to enable zero-shot cross-lingual generation. Our experiments show that parameter-efficient prompt tuning provides gains over standard fine-tuning when transferring between less-related languages, e.g., from English to Thai. However, a significant gap still remains between these methods and fully-supervised baselines. To improve cross-lingual transfer further, we explore several approaches, including: (1) mixing in unlabeled multilingual data, and (2) explicitly factoring prompts into recombinable language and task components. Our approaches can provide further quality gains, suggesting that robust zero-shot cross-lingual generation is within reach.
@inproceedings{vu-etal-2022-overcoming, title = {Overcoming Catastrophic Forgetting in Zero-Shot Cross-Lingual Generation}, author = {Vu, Tu and Barua, Aditya and Lester, Brian and Cer, Daniel and Iyyer, Mohit and Constant, Noah}, booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing}, year = {2022}, url = {https://aclanthology.org/2022.emnlp-main.630}, pages = {9279--9300}, pdf = {https://arxiv.org/pdf/2205.12647.pdf}, }
EMNLP
STraTA: Self-Training with Task Augmentation for Better Few-shot Learning

Tu Vu, Thang Luong, Quoc Le, Grady Simon, and Mohit Iyyer

In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

Abs arXiv Bib PDF

Despite their recent successes in tackling many NLP tasks, large-scale pre-trained language models do not perform as well in few-shot settings where only a handful of training examples are available. To address this shortcoming, we propose STraTA, which stands for Self-Training with Task Augmentation, an approach that builds on two key ideas for effective leverage of unlabeled data. First, STraTA uses task augmentation, a novel technique that synthesizes a large amount of data for auxiliary-task fine-tuning from target-task unlabeled texts. Second, STraTA performs self-training by further fine-tuning the strong base model created by task augmentation on a broad distribution of pseudo-labeled data. Our experiments demonstrate that STraTA can substantially improve sample efficiency across 12 few-shot benchmarks. Remarkably, on the SST-2 sentiment dataset, STraTA, with only 8 training examples per class, achieves comparable results to standard fine-tuning with 67K training examples. Our analyses reveal that task augmentation and self-training are both complementary and independently effective.
@inproceedings{vu-etal-2021-strata, title = {{ST}ra{TA}: Self-Training with Task Augmentation for Better Few-shot Learning}, author = {Vu, Tu and Luong, Thang and Le, Quoc and Simon, Grady and Iyyer, Mohit}, booktitle = {Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing}, year = {2021}, url = {https://aclanthology.org/2021.emnlp-main.462}, pages = {5715--5731}, pdf = {https://arxiv.org/pdf/2109.06270.pdf}, }
EMNLP
Exploring and Predicting Transferability across NLP Tasks

Tu Vu, Tong Wang, Tsendsuren Munkhdalai, Alessandro Sordoni, Adam Trischler, Andrew Mattarella-Micke, Subhransu Maji, and Mohit Iyyer

In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020

Abs arXiv Bib PDF

Recent advances in NLP demonstrate the effectiveness of training large-scale language models and transferring them to downstream tasks. Can fine-tuning these models on tasks other than language modeling further improve performance? In this paper, we conduct an extensive study of the transferability between 33 NLP tasks across three broad classes of problems (text classification, question answering, and sequence labeling). Our results show that transfer learning is more beneficial than previously thought, especially when target task data is scarce, and can improve performance even with low-data source tasks that differ substantially from the target task (e.g., part-of-speech tagging transfers well to the DROP QA dataset). We also develop task embeddings that can be used to predict the most transferable source tasks for a given target task, and we validate their effectiveness in experiments controlled for source and target data size. Overall, our experiments reveal that factors such as data size, task and domain similarity, and task complexity all play a role in determining transferability.
@inproceedings{vu-etal-2020-exploring, title = {Exploring and Predicting Transferability across {NLP} Tasks}, author = {Vu, Tu and Wang, Tong and Munkhdalai, Tsendsuren and Sordoni, Alessandro and Trischler, Adam and Mattarella-Micke, Andrew and Maji, Subhransu and Iyyer, Mohit}, booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing}, year = {2020}, url = {https://aclanthology.org/2020.emnlp-main.635}, pages = {7882--7926}, pdf = {https://arxiv.org/pdf/2005.00770.pdf}, }