# Guide to HuggingFace’s BGE-M3

For more details about the BGE-M3 project, please refer to our GitHub repository:
[GitHub Repo](https://github.com/FlagOpen/FlagEmbedding)

## Introduction
In this project, we introduce BGE-M3, which stands out for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
– **Multi-Functionality:** BGE-M3 can simultaneously perform three common retrieval functionalities of the embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
– **Multi-Linguality:** It can support more than 100 working languages.
– **Multi-Granularity:** It is capable of processing inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.

## Suggestions for Retrieval Pipeline in RAG
We recommend using the following pipeline: hybrid retrieval + re-ranking.
– **Hybrid Retrieval:** Leveraging the strengths of various methods, offering higher accuracy and stronger generalization capabilities.
– **Re-Ranker:** As cross-encoder models, re-ranker demonstrates higher accuracy than bi-encoder embedding model. Utilizing the re-ranking model (e.g., bge-reranker, cohere-reranker) after retrieval can further filter the selected text.

## Model Specs

## FAQ

### 1. Introduction to Different Retrieval Methods
– **Dense Retrieval:** Map the text into a single embedding, e.g., DPR, BGE-v1.5.
– **Sparse Retrieval (Lexical Matching):** A vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text, e.g., BM25, unicoil, and splade.
– **Multi-Vector Retrieval:** Use multiple vectors to represent a text, e.g., ColBERT.

### 2. How to Use BGE-M3 in Other Projects?
For embedding retrieval, you can employ the BGE-M3 model using the same approach as BGE. The only difference is that the BGE-M3 model no longer requires adding instructions to the queries. For sparse retrieval methods, most open-source libraries currently do not support direct utilization of the BGE-M3 model. Contributions from the community are welcome.

### 3. How to Fine-Tune BGE-M3 Model?
You can follow the common example in this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) to fine-tune the dense embedding. Our code and data for unified fine-tuning (dense, sparse, and multi-vectors) will be released.

## Usage
### Install
“`bash
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
pip install -e .
“`
Or
“`bash
pip install -U FlagEmbedding
“`

### Generate Embedding for Text
#### Example using Python
You can use the `FlagEmbedding` library to generate both dense and sparse embeddings for text. Refer to [baai_general_embedding](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding#usage) for details.

### Compute Score for Text Pairs
Input a list of text pairs to get the scores computed by different methods.

## Evaluation
– Multilingual (Miracl dataset)
![Miracl Evaluation](https://huggingface.co/BAAI/bge-m3/blob/main/imgs/miracl.jpg)
– Cross-lingual (MKQA dataset)
![MKQA Evaluation](https://huggingface.co/BAAI/bge-m3/blob/main/imgs/mkqa.jpg)

## Training
– Self-Knowledge Distillation
– Efficient Batching
– MCLS (Method for long text without fine-tuning)

## Models
We release two versions:
– BAAI/bge-m3-unsupervised: the model after contrastive learning in a large-scale dataset
– BAAI/bge-m3: the final model fine-tuned from BAAI/bge-m3-unsupervised

## Acknowledgment
Thanks to the authors of open-sourced datasets, including Miracl, MKQA, NarrativeQA, etc.

## Citation
If you find this repository useful, please consider giving a star ⭐️ and citation.

Source link
# Huggingface – BGE-M3 Tutorial

## About BGE-M3
In this project, we introduce BGE-M3, which is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
– Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
– Multi-Linguality: It can support more than 100 working languages.
– Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.

## Suggestions for Retrieval Pipeline in RAG
We recommend using the following pipeline: hybrid retrieval + re-ranking.
– Hybrid retrieval leverages the strengths of various methods, offering higher accuracy and stronger generalization capabilities. Using both embedding retrieval and the BM25 algorithm is a classic example. Now, you can use BGE-M3, which supports both embedding and sparse retrieval.
– As cross-encoder models, re-ranker demonstrates higher accuracy than bi-encoder embedding model.

## Model Specs
– Model Name: BGE-M3
– Publisher: FlagOpen
– Version: 1.5

## 1. Introduction for Different Retrieval Methods
– Dense Retrieval: Map the text into a single embedding, e.g., DPR, BGE-v1.5.
– Sparse Retrieval (Lexical Matching): A vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text, e.g., BM25, unicoil, splade.
– Multi-Vector Retrieval: Use multiple vectors to represent a text, e.g., ColBERT.

## 2. How to Use BGE-M3 in Other Projects?
For embedding retrieval, you can employ the BGE-M3 model using the same approach as BGE. For sparse retrieval methods, most open-source libraries currently do not support direct utilization of the BGE-M3 model. Contributions from the community are welcome.

## 3. How to Fine-Tune BGE-M3 Model?
You can follow the common example provided in the project repository to fine-tune the dense embedding. Code and data for unified fine-tuning (dense, sparse, and multi-vectors) will be released.

## Usage
To install BGE-M3, use the following command:
“`
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
pip install -e .
“`
Or install using pip:
“`
pip install -U FlagEmbedding
“`

### Generate Embedding for Text
“`python
from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel(‘BAAI/bge-m3’, use_fp16=True)

sentences_1 = [“What is BGE M3?”, “Definition of BM25”]
sentences_2 = [“BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.”, “BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document”]

embeddings_1 = model.encode(sentences_1)[‘dense_vecs’]
embeddings_2 = model.encode(sentences_2)[‘dense_vecs’]
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
“`

### Sparse Embedding (Lexical Weight)
“`python
from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel(‘BAAI/bge-m3’, use_fp16=True)

sentences_1 = [“What is BGE M3?”, “Definition of BM25”]
sentences_2 = [“BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.”, “BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document”]

output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False)
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=False)

print(model.convert_id_to_token(output_1[‘lexical_weights’]))
“`

### Compute Score for Text Pairs
“`python
from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel(‘BAAI/bge-m3’, use_fp16=True)

sentences_1 = [“What is BGE M3?”, “Definition of BM25”]
sentences_2 = [“BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.”, “BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document”]

sentence_pairs = [[i,j] for i in sentences_1 for j in sentences_2]
print(model.compute_score(sentence_pairs))
“`

## Evaluation
– Multilingual (Miracl dataset)
![Miracl Dataset](https://huggingface.co/BAAI/bge-m3/blob/main/imgs/miracl.jpg)
– Cross-lingual (MKQA dataset)
![MKQA Dataset](https://huggingface.co/BAAI/bge-m3/blob/main/imgs/mkqa.jpg)

## Training
– Self-knowledge Distillation: combining multiple outputs from different retrieval modes as a reward signal.
– Efficient Batching: Improve the efficiency when fine-tuning on long text.
– MCLS: A method to improve performance on long text without fine-tuning.

## Models
– BAAI/bge-m3-unsupervised: the model after contrastive learning in a large-scale dataset
– BAAI/bge-m3: the final model fine-tuned from BAAI/bge-m3-unsupervised

## Acknowledgement
Thanks to the authors of open-sourced datasets, including Miracl, MKQA, NarrativeQA, etc.

## Citation
If you find this repository useful, please consider giving a star and citing it.

For more details, please refer to the [GitHub repository](https://github.com/FlagOpen/FlagEmbedding).

For more details please refer to our github repo: https://github.com/FlagOpen/FlagEmbedding

In this project, we introduce BGE-M3, which is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.

  • Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
  • Multi-Linguality: It can support more than 100 working languages.
  • Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.

Some suggestions for retrieval pipeline in RAG:
We recommend to use following pipeline: hybrid retrieval + re-ranking.

  • Hybrid retrieval leverages the strengths of various methods, offering higher accuracy and stronger generalization capabilities.
    A classic example: using both embedding retrieval and the BM25 algorithm.
    Now, you can try to use BGE-M3, which supports both embedding and sparse retrieval.
    This allows you to obtain token weights (similar to the BM25) without any additional cost when generate dense embeddings.
  • As cross-encoder models, re-ranker demonstrates higher accuracy than bi-encoder embedding model.
    Utilizing the re-ranking model (e.g., bge-reranker, cohere-reranker) after retrieval can further filter the selected text.



Model Specs



FAQ

1. Introduction for different retrieval methods

  • Dense retrieval: map the text into a single embedding, e.g., DPR, BGE-v1.5
  • Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, unicoil, and splade
  • Multi-vector retrieval: use multiple vectors to represent a text, e.g., ColBERT.

2. How to use BGE-M3 in other projects?

For embedding retrieval, you can employ the BGE-M3 model using the same approach as BGE.
The only difference is that the BGE-M3 model no longer requires adding instructions to the queries.
For sparse retrieval methods, most open-source libraries currently do not support direct utilization of the BGE-M3 model.
Contributions from the community are welcome.

3. How to fine-tune bge-M3 model?

You can follow the common in this example
to fine-tune the dense embedding.

Our code and data for unified fine-tuning (dense, sparse, and multi-vectors) will be released.



Usage

Install:

git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
pip install -e .

or:

pip install -U FlagEmbedding



Generate Embedding for text

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3',  use_fp16=True) 

sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", 
               "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]

embeddings_1 = model.encode(sentences_1)['dense_vecs']
embeddings_2 = model.encode(sentences_2)['dense_vecs']
similarity = embeddings_1 @ embeddings_2.T
print(similarity)

You also can use sentence-transformers and huggingface transformers to generate dense embeddings.
Refer to baai_general_embedding for details.

  • Sparse Embedding (Lexical Weight)
from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3',  use_fp16=True) 

sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", 
               "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]

output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False)
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=False)


print(model.convert_id_to_token(output_1['lexical_weights']))





lexical_scores = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0])
print(lexical_scores)


print(model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_1['lexical_weights'][1]))

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3',  use_fp16=True) 

sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", 
               "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]

output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=True)
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=True)

print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][0]))
print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][1]))




Compute score for text pairs

Input a list of text pairs, you can get the scores computed by different methods.

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3',  use_fp16=True) 

sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", 
               "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]

sentence_pairs = [[i,j] for i in sentences_1 for j in sentences_2]
print(model.compute_score(sentence_pairs))









Evaluation

  • Multilingual (Miracl dataset)

avatar

  • Cross-lingual (MKQA dataset)

avatar

avatar



Training

  • Self-knowledge Distillation: combining multiple outputs from different
    retrieval modes as reward signal to enhance the performance of single mode(especially for sparse retrieval and multi-vec(colbert) retrival)
  • Efficient Batching: Improve the efficiency when fine-tuning on long text.
    The small-batch strategy is simple but effective, which also can used to fine-tune large embedding model.
  • MCLS: A simple method to improve the performance on long text without fine-tuning.
    If you have no enough resource to fine-tuning model with long text, the method is useful.

Refer to our report for more details.

The fine-tuning codes and datasets will be open-sourced in the near future.



Models

We release two versions:

  • BAAI/bge-m3-unsupervised: the model after contrastive learning in a large-scale dataset
  • BAAI/bge-m3: the final model fine-tuned from BAAI/bge-m3-unsupervised



Acknowledgement

Thanks the authors of open-sourced datasets, including Miracl, MKQA, NarritiveQA, etc.



Citation

If you find this repository useful, please consider giving a star :star: and citation


The use of the

tag in HTML is quite versatile and can be used in a variety of scenarios, especially when working with different retrieval functionalities and embedding models. Here are some use cases for the

tag:

1. Hybrid Retrieval and Re-ranking:
The

tag can be used in the implementation of hybrid retrieval which combines the strengths of various methods to offer higher accuracy and stronger generalization capabilities. For example, you can use the

tag to integrate both embedding retrieval and the BM25 algorithm, enabling you to obtain token weights without any additional cost when generating dense embeddings.

2. Multi-Functionality, Multi-Linguality, and Multi-Granularity:
With the

tag, you can implement the diverse functionalities of an embedding model like BGE-M3. This model is distinguished for its versatility in multi-functionality, multi-linguality, and multi-granularity. The

tag can be used to structure the presentation of features such as dense retrieval, multi-vector retrieval, and sparse retrieval across multiple languages and input granularities.

3. Fine-Tuning and Training:
Using the

tag, you can structure the presentation of detailed information on how to fine-tune the BGE-M3 model for dense retrieval, lexical matching, and multi-vector interaction. The tag can also be utilized to depict training methods such as self-knowledge distillation, efficient batching, and MCLS for performance optimization.

4. Embedding Generation and Evaluation:
For various embedding retrieval methods, including generating dense embeddings, sparse embeddings, and scoring text pairs, the

tag can be used to organize the code and examples for using the BGE-M3 model. Furthermore, the tag can be used to visualize evaluation metrics and results for multilingual and cross-lingual datasets.

5. Model Specifications, Usage, and Citation:
The

tag can be utilized for presenting information on the model specifications, installation process, usage instructions, and how users can cite the repository. It is a useful tool for structuring clear and organized content for users to understand and utilize the BGE-M3 model effectively.

Overall, the

tag can be leveraged in HTML to present detailed information, code examples, and visual aids relating to the use cases and features of the BGE-M3 embedding model and its applications in retrieval pipelines. It provides a structured and organized way to present complex information to users and developers.

2024-01-31T04:53:08+01:00