Quantifying the Importance of Data Alignment in Downstream Model Performance

Quantifying the Importance of Data Alignment in Downstream Model Performance

Contrary to the conventional emphasis on dataset size, we explore the role of data alignment -- an often overlooked aspect of data quality -- in training capable Large Language Models (LLMs). To do so, we use the Task2Vec-based alignment coefficient, a quantitative measure of the similarity between two datasets, to quantify the impact of alignment between training data and evaluation data on downstream performance. In particular, we conduct controlled \textit{interventional} experiments for two settings: 1. the impact of increased alignment coefficients between various pre-training (pt) against evaluation datasets, and 2. the impact of increased alignment coefficients between domain specific fine-tuning (ft) against domain specific evaluation. The domain specific task we explore is Autoformalization -- the machine translation task between natural language and code for formal verification. In both settings, we find a strong, predictable negative correlation between the alignment coefficient of a model's training and evaluation data and the model's loss/perplexity on the respective downstream task. These findings suggest a re-evaluation of LLM training approaches, demonstrating the relevance of data alignment compared to data quantity, especially in specialized downstream tasks such as Autoformalization.

Publication:

arXiv e-prints

Pub Date:

January 2025

arXiv:

arXiv:2501.08496

Bibcode:

2025arXiv250108496C

Keywords:

Computer Science - Computation and Language;
Computer Science - Artificial Intelligence;
Computer Science - Machine Learning;
Computer Science - Programming Languages

ADS

Quantifying the Importance of Data Alignment in Downstream Model Performance

Abstract