IR2: Information Regularization for Information Retrieval
Abstract
Effective information retrieval (IR) in settings with limited training data, particularly for complex queries, remains a challenging task. This paper introduces IR2, Information Regularization for Information Retrieval, a technique for reducing overfitting during synthetic data generation. This approach, representing a novel application of regularization techniques in synthetic data creation for IR, is tested on three recent IR tasks characterized by complex queries: DORIS-MAE, ArguAna, and WhatsThatBook. Experimental results indicate that our regularization techniques not only outperform previous synthetic query generation methods on the tasks considered but also reduce cost by up to 50%. Furthermore, this paper categorizes and explores three regularization methods at different stages of the query synthesis pipeline-input, prompt, and output-each offering varying degrees of performance improvement compared to models where no regularization is applied. This provides a systematic approach for optimizing synthetic data generation in data-limited, complex-query IR scenarios. All code, prompts and synthetic data are available at https://github.com/Info-Regularization/Information-Regularization.
- Publication:
-
arXiv e-prints
- Pub Date:
- February 2024
- DOI:
- arXiv:
- arXiv:2402.16200
- Bibcode:
- 2024arXiv240216200W
- Keywords:
-
- Computer Science - Information Retrieval;
- Computer Science - Artificial Intelligence;
- Computer Science - Computation and Language;
- Computer Science - Machine Learning
- E-Print:
- Accepted by LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation