## Huggingface Guide: Nomic Embed Text v1

### Introduction
The `nomic-embed-text-v1` is a powerful text encoder provided by Huggingface that outperforms OpenAI text-embedding-ada-002 and text-embedding-3-small on both short and long-context tasks. This guide will help you understand the features, usage, and training details of the `nomic-embed-text-v1`.

### Overview
The table below compares the performance of different text encoders:

| Name | SeqLen | MTEB | LoCo | Jina Long Context | Open Weights | Open Training Code | Open Data |
|———————|——–|——-|——-|——————–|————–|———————|———–|
| nomic-embed-text-v1 | 8192 | 62.39 | 85.53 | 54.16 | ✅ | ✅ | ✅ |
| jina-embeddings-v2 | 8192 | 60.39 | 85.45 | 51.90 | ✅ | ❌ | ❌ |
| text-embedding-3 | 8191 | 62.26 | 82.40 | 58.20 | ❌ | ❌ | ❌ |
| text-embedding-ada | 8191 | 60.99 | 52.7 | 55.25 | ❌ | ❌ | ❌ |

### Hosted Inference API
The easiest way to get started with Nomic Embed is through the Nomic Embedding API. You can generate embeddings using the `nomic` Python client as shown below:

“`python
from nomic import embed

output = embed.text(
texts=[‘Nomic Embedding API’, ‘#keepAIOpen’],
model=’nomic-embed-text-v1′,
task_type=’search_document’
)

print(output)
“`
For more information, refer to the [API reference](https://docs.nomic.ai/reference/endpoints/nomic-embed-text).

### Data Visualization
You can visualize a sample of the contrastive pretraining data using the Nomic Atlas map [here](https://atlas.nomic.ai/map/nomic-text-embed-v1-5m-sample).

### Training Details
The Nomic embedder is trained using a multi-stage training pipeline, including unsupervised contrastive training and finetuning with labeled datasets. For more details, refer to the [Technical Report](https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf) and corresponding [blog post](https://blog.nomic.ai/posts/nomic-embed-text-v1).

### Usage
#### Sentence Transformers
You can use the `nomic-embed-text-v1` model with the Sentence Transformers library in Python:

“`python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(“nomic-ai/nomic-embed-text-v1”, trust_remote_code=True)
sentences = [‘What is TSNE?’, ‘Who is Laurens van der Maaten?’]
embeddings = model.encode(sentences)
print(embeddings)
“`

#### Transformers
You can also use the `nomic-embed-text-v1` model with the Transformers library in Python:

“`python
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

# Example code for using nomic-embed-text-v1 with Transformers library
“`

#### Transformers.js
Using the `nomic-embed-text-v1` model with Transformers.js:
“`javascript
// Example code for using nomic-embed-text-v1 with Transformers.js
“`

This guide provides an overview of the Huggingface Nomic Embed Text v1 model, its features, usage, and training details. For more information, refer to the official documentation and resources provided by Huggingface.

Source link
## Hugging Face – Nomic Embedding API Manual

### Introduction
The Nomic Embedding API is a powerful tool for generating embeddings for text. It offers high performance and an easy-to-use Python client for seamless integration. In this manual, we will cover how to get started with the API, data visualization, training details, and usage examples for different platforms.

### Hosted Inference API
The easiest way to get started with Nomic Embed is through the Nomic Embedding API. You can generate embeddings with the `nomic` Python client by following these simple steps:

“`python
from nomic import embed

output = embed.text(
texts=[‘Nomic Embedding API’, ‘#keepAIOpen’],
model=’nomic-embed-text-v1′,
task_type=’search_document’
)

print(output)
“`

For more information, refer to the [API reference](https://docs.nomic.ai/reference/endpoints/nomic-embed-text).

### Data Visualization
You can visualize a 5M sample of the contrastive pretraining data using the Nomic Atlas map. Simply click the link below to view the visualization:
[Click here to visualize the data](https://atlas.nomic.ai/map/nomic-text-embed-v1-5m-sample)

### Training Details
The Nomic Embedder is trained using a multi-stage training pipeline. This involves a combination of unsupervised contrastive training and finetuning stages on high-quality labeled datasets. For a detailed overview, you can refer to the [Technical Report](https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf) and corresponding [blog post](https://blog.nomic.ai/posts/nomic-embed-text-v1). The training data is also released in its entirety, and more details can be found in the `contrastors` [repository](https://github.com/nomic-ai/contrastors).

### Usage
#### Sentence Transformers
You can use the Nomic Embedding API with Sentence Transformers as shown in the example below:

“`python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(“nomic-ai/nomic-embed-text-v1”, trust_remote_code=True)
sentences = [‘What is TSNE?’, ‘Who is Laurens van der Maaten?’]
embeddings = model.encode(sentences)
print(embeddings)
“`

#### Transformers
Transformers are also supported by the Nomic Embedding API. Here’s an example of how to use it with Transformers:

“`python
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

sentences = [‘What is TSNE?’, ‘Who is Laurens van der Maaten?’]

tokenizer = AutoTokenizer.from_pretrained(‘bert-base-uncased’, model_max_length=8192)
model = AutoModel.from_pretrained(‘nomic-ai/nomic-embed-text-v1′, trust_remote_code=True, rotary_scaling_factor=2)
model.eval()

encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors=’pt’)

with torch.no_grad():
model_output = model(**encoded_input)

embeddings = mean_pooling(model_output, encoded_input[‘attention_mask’])
embeddings = F.normalize(embeddings, p=2, dim=1)
print(embeddings)
“`

#### Transformers.js
For JavaScript usage, you can employ the Nomic Embedding API with Transformers.js using the following code snippet:

“`javascript
import { pipeline } from ‘@xenova/transformers’;

const extractor = await pipeline(‘feature-extraction’, ‘nomic-ai/nomic-embed-text-v1’, {
quantized: false,
});

const texts = [‘What is TSNE?’, ‘Who is Laurens van der Maaten?’];
const embeddings = await extractor(texts, { pooling: ‘mean’, normalize: true });
console.log(embeddings);
“`

### Conclusion
This manual has provided an overview of the Nomic Embedding API and demonstrated various usage examples for different platforms. With its high performance and ease of use, the Nomic Embedding API is a valuable tool for text embedding tasks. For any further inquiries or technical support, please refer to the [Nomic Embed documentation](https://docs.nomic.ai/).

nomic-embed-text-v1 is 8192 context length text encoder that surpasses OpenAI text-embedding-ada-002 and text-embedding-3-small performance on short and long context tasks.
.

Name SeqLen MTEB LoCo Jina Long Context Open Weights Open Training Code Open Data
nomic-embed-text-v1 8192 62.39 85.53 54.16
jina-embeddings-v2-base-en 8192 60.39 85.45 51.90
text-embedding-3-small 8191 62.26 82.40 58.20
text-embedding-ada-002 8191 60.99 52.7 55.25



Hosted Inference API

The easiest way to get started with Nomic Embed is through the Nomic Embedding API.

Generating embeddings with the nomic Python client is as easy as

from nomic import embed

output = embed.text(
    texts=['Nomic Embedding API', '#keepAIOpen'],
    model='nomic-embed-text-v1',
    task_type='search_document'
)

print(output)

For more information, see the API reference



Data Visualization

Click the Nomic Atlas map below to visualize a 5M sample of our contrastive pretraining data!

image/webp



Training Details

We train our embedder using a multi-stage training pipeline. Starting from a long-context BERT model,
the first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.

In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.

For more details, see the Nomic Embed Technical Report and corresponding blog post.

Training data to train the models is released in its entirety. For more details, see the contrastors repository



Usage



Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
sentences = ['What is TSNE?', 'Who is Laurens van der Maaten?']
embeddings = model.encode(sentences)
print(embeddings)



Transformers

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

sentences = ['What is TSNE?', 'Who is Laurens van der Maaten?']

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1', trust_remote_code=True)
model.eval()

encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded_input)

embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
print(embeddings)

The model natively supports scaling of the sequence length past 2048 tokens. To do so,

- tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
+ tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', model_max_length=8192)


- model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1', trust_remote_code=True)
+ model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1', trust_remote_code=True, rotary_scaling_factor=2)



Transformers.js

import { pipeline } from '@xenova/transformers';


const extractor = await pipeline('feature-extraction', 'nomic-ai/nomic-embed-text-v1', {
    quantized: false, 
});


const texts = ['What is TSNE?', 'Who is Laurens van der Maaten?'];
const embeddings = await extractor(texts, { pooling: 'mean', normalize: true });
console.log(embeddings);

The

element in HTML is used as a generic container for grouping and styling content. In the provided HTML snippet, the

is used to contain a table, h2 and h3 headings, blocks of text, preformatted code, and images. Below are a few use cases of the

element in HTML without h1 head and body tags.

1. Table container: The

element is used to contain a table that displays data in tabular format. The table includes information about different models and their performance scores.

2. Hosted Inference API: The

contains a heading and a block of text describing the Hosted Inference API. This could be a section on a web page that provides information about the API, how to use it, and its features.

3. Data Visualization: The

is used to contain an image and a link to a map for visualizing a sample of contrastive pretraining data. It allows users to interactively explore the data sample.

4. Training Details: The

contains a block of text that gives details about the training process of the Nomic Embedder. It includes information about the multi-stage training pipeline and the datasets used for training.

5. Usage: The

includes multiple sections under the Usage heading. Each section provides code examples and instructions for using the Nomic Embedder in different environments, such as using Python or JavaScript.

Overall, the

element in this context is used to logically group and style different types of content within the HTML document. It provides a structured way to organize and present information to the user. Additionally, the

element enables consistent styling and formatting of the contained content through the use of CSS, making the web page visually appealing and easy to navigate.

2024-02-02T01:35:13+01:00