Nomic Embed: Training a Reproducible Long Context Text Embedder
Abstract
This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on short and long-context tasks. We release the training code and model weights under an Apache 2 license. In contrast with other open-source models, we release a training data loader with 235 million curated text pairs that allows for the full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors
- Publication:
-
arXiv e-prints
- Pub Date:
- February 2024
- DOI:
- arXiv:
- arXiv:2402.01613
- Bibcode:
- 2024arXiv240201613N
- Keywords:
-
- Computer Science - Computation and Language;
- Computer Science - Artificial Intelligence