Here is a beginner’s guide to the PairRM model by LLM-Blender:

### Introduction
The Pairwise Reward Model (PairRM) is designed to take an instruction and a pair of output candidates as input and provide a score for each candidate to measure their relative quality. It can be used to rank a list of candidate outputs, enhance decoding using best-of-n sampling, and align instruction-tuned LLMs with RLHF methods. Unlike other reward models, PairRM takes a pair of candidates and compares them side-by-side to identify subtle differences between them.

### Installation
To install PairRM, you need to first install the `llm-blender` package using the following pip command:

“`
pip install git+https://github.com/yuchenlin/LLM-Blender.git
“`
After that, you can load the PairRM model using Python code.

### Usage
#### Use Case 1: Comparing/Ranking output candidates given an instruction
Use the `rank` function to rank a list of candidate responses or the `compare` function to directly compare two candidate responses.

#### Use Case 2: Best-of-n Sampling (Decoding Enhancment)
By using PairRM with best-of-n sampling, you can improve the quality of LLM responses with only a few changes to your inference code.

#### Use Case 3: RLHF
PairRM has been trained on various high-quality and large-scale datasets with human preference annotations. You can use it with popular RLHF toolkits such as TRL.

### Statistics
– **Context Length**: PairRM allows for a longer context length compared to previous models.
– **Training Datasets**: PairRM has been trained on diverse human-preference datasets.
– **Performance**: PairRM exhibits great correlation with human preferences with an extremely small model size approaching the performance of GPT-4.

### Citation & Credits
If you are using PairRM in your research, please cite the LLM-Blender paper from the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023).

For more details and examples, you can check out the complete guide [here](https://github.com/yuchenlin/LLM-Blender/blob/main/blender_usage.ipynb).

By following this guide, you’ll have a good understanding of PairRM and how to use it effectively in your natural language processing tasks.

Source link

Introduction:
PairRM is a model that takes an instruction and a pair of output candidates as input, and outputs a score for each candidate to measure their relative quality. This model can be used for (re)-ranking a list of candidate outputs, as well as for enhancing decoding through best-of-n sampling. It can further align instruction-tuned LLMs with RLHF methods. PairRM is based on Microsoft’s DeBERTa-v3-large model and is extremely efficient at 0.4B.

Installation:
To install PairRM, first, you need to install “llm-blender” using the following command:
“`
pip install git+https://github.com/yuchenlin/LLM-Blender.git
“`
In Python, you can then import the library and load the PairRM ranker:
“`
import llm_blender
blender = llm_blender.Blender()
blender.loadranker(“llm-blender/PairRM”)
“`

Usage:
There are several use cases for PairRM, including comparing/ranking output candidates given an instruction, best-of-n sampling (decoding enhancement), and RLHF alignment.

Use Case 1: Comparing/Ranking output candidates given an instruction
– You can rank a list of candidate responses using the `blender.rank()` function.
– You can directly compare two candidate responses using the `blender.compare()` function.

Use Case 2: Best-of-n Sampling (Decoding Enhancement)
This case involves using PairRM for best-of-n sampling, also known as rejection sampling, to enhance the response quality of language models. This can be accomplished through a few changes in the inference code.

Use Case 3: RLHF
PairRM can be applied to popular RLHF toolkits such as “trl” using the `blender.compare()` function.

Statistics:
Context Length:
PairRM has a maximum source length of 1224 and a maximum candidate length of 412, allowing for a total maximum length of 2048.

Training Datasets:
PairRM has been trained on various high-quality and large-scale datasets with human preference annotations, exhibiting great correlation with human preferences.

Performance:
PairRM’s pairwise comparison agreement performance closely approaches that of GPT-4, showcasing its efficacy for language model evaluation.

Citation & Credits:
If you use PairRM in your research, please cite LLM-blender using the following BibTex citation:
“`
@inproceedings{llm-blender-2023,
title = “LLM-Blender: Ensembling Large Language Models with Pairwise Comparison and Generative Fusion”,
author = “Jiang, Dongfu and Ren, Xiang and Lin, Bill Yuchen”,
booktitle = “Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023)”,
year = “2023”
}
“`

For a more detailed guide, you can also check out our example jupyter notebook usage on our LLM-Blender GitHub repository: `blender_usage.ipynb`.

This manual and tutorial provides an overview of the PairRM model and its applications, along with instructions for installation and usage.



News



Introduction

Pairwise Reward Model (PairRM) takes an instruction and a pair of output candidates as the input,
and output a score for each candidate to measure their relative quality.
PairRM can be used to (re-)rank a list of candidate outputs and thus can be used an LLM evaluator to efficiently assess the quality of LLMs in local environment.
PairRM can also be used to enhance the decoding by best-of-n sampling (i.e., reranking N sampled outputs).
Apart from that, one can also use PairRM to further align instruction-tuned LLMs with RLHF methods.

Unlike the other RMs that encode and score each candidate respectively,
PairRM takes a pair of candidates and compares them side-by-side to indentify the subtle differences between them.
Also, PairRM is based on microsoft/deberta-v3-large, and thus it is super efficient: 0.4B.
We trained PairRM on a diverse collection of six human-preference datasets (see more here).

PairRM is part of the LLM-Blender project (ACL 2023). Please see our paper above to know more.



Installation

  • First install llm-blender
pip install git+https://github.com/yuchenlin/LLM-Blender.git
import llm_blender
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") 



Usage



Use Case 1: Comparing/Ranking output candidates given an instruction

  • Ranking a list candidate responses
inputs = ["hello, how are you!", "I love you!"]
candidates_texts = [["get out!", "hi! I am fine, thanks!", "bye!"], 
                    ["I love you too!", "I hate you!", "Thanks! You're a good guy!"]]
ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=1)


"""
ranks -->
array([[3, 1, 2], # it means "hi! I am fine, thanks!" ranks the 1st, "bye" ranks the 2nd, and "get out!" ranks the 3rd. 
       [1, 3, 2]], # it means "I love you too"! ranks the the 1st, and "I hate you!" ranks the 3rd.
       dtype=int32) 

"""
  • Directly comparing two candidate responses
inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
comparison_results = blender.compare(inputs, candidates_A, candidates_B)

       

Comparing two multi-turn conversations.
conv1 = [
    {
        "content": "hello",
        "role": "USER"
    },
    {
        "content": "[assistant1‚Äės response 1]",
        "role": "ASSISTANT"
    },
    ...
]
conv2 = [
    {
        "content": "hello",
        "role": "USER"
    },
    {
        "content": "[assistant2's response 1]",
        "role": "ASSISTANT"
    },
    ...
]
comparison_results = blender.compare_conversations([conv1], [conv2])



Use Case 2: Best-of-n Sampling (Decoding Enhancment)

Best-of-n Sampling, aka, rejection sampling, is a strategy to enhance the response quality by selecting the one that was ranked highest by the reward model
(see more in OpenAI WebGPT section 3.2 and OpenAI Blog).
Best-of-n sampling with PairRM is a very easy way to imporve your LLMs with only a few changes of your inference code:


import llm_blender
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto")
system_message = {"role": "system", "content": "You are a friendly chatbot."}


inputs = ["can you tell me a joke about OpenAI?"]
messages = [[system_message, {"role": "user", "content": _input}] for _input in inputs]
prompts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages]


input_ids = tokenizer(prompts[0], return_tensors="pt").input_ids
sampled_outputs = model.generate(input_ids, do_sample=True, top_k=50, top_p=0.95, num_return_sequences=1)
print(tokenizer.decode(sampled_outputs[0][len(input_ids[0]):], skip_special_tokens=False))



blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") 
outputs = blender.best_of_n_generate(model, tokenizer, prompts, n=10)

print("### Prompt:\n", prompts[0])
print("### best-of-n generations:\n", outputs[0])

""" 
Sure, here's a joke about OpenAI:

Why did OpenAI decide to hire a mime as their new AI researcher?

Because they wanted someone who could communicate complex ideas without making a sound!

(Note: This is a joke, not a reflection of OpenAI's actual hiring practices.)
"""



Use case 3: RLHF

PairRM has been trained on various high-quality and large-scale datasets with human preference annotations
and shown great correlation with human preferences with an extremely small model size (0.4B),
approching the performance of GPT-4.
PairRM will better help the future alignment of LLMs in a more efficient and effective way.
With a blender.compare() function, you can apply PairRM to popular RLHF toolkits such as trl.

ūüĒ• Check more details on our example jupyter notebook usage: blender_usage.ipynb

Learn more in our LLM-Blender Github README.md



Statistics



Context length

PairRanker type Source max length Candidate max length Total max length
pair-ranker (our previous version) 128 128 384
PairRM (This model) 1224 412 2048



Training Datasets



Performance

PairRM has been trained on various high-quality and large-scale dataset with human preference annotations and exhibits great correlation with human preferences
with an extremly small model size (0.4B), approching the performance of GPT-4.

We test the pairwise comparison on

All following results are reported as pairwise comparison accuracies (agreements).



Auto-J Pairwise test data performance

Model Summ Exam Code Rewriting Crea W Func W Comm NLP Overall
Closed -source Models
ChatGPT 33.3 40.3 36.6 31.6 48.2 40.4 47.6 45.8 42.7
Claude -2 30.6 36.1 41.7 34.2 48.1 42.5 40.6 48.5 42.4
GPT -4 59.7 51.4 69.2 58.3 66.7 60.4 58.3 65.2 61.9
Open -source Models
SteamSHP 33.3 29.2 26.7 33.3 40.7 31.3 51.4 51.9 40.6
PandaLM 29.2 33.3 31.7 23.3 43.5 32.9 44.8 48.9 38.9
LLaMA -2-Chat -13B 20.8 27.8 19.2 20 31.5 27.5 35.8 31.8 29
Vicuna -13B-v1.5 30.6 23.6 35 28.3 36.1 37.5 45.5 39.8 37.3
WizardLM -13B-v1.2 22.2 20.8 32.5 19.2 28.7 25.4 29.2 33 27.8
LLAMA -2-chat -70B 34.7 33.3 36.7 35.8 51.4 54.2 47.2 47.7 45.9
AUTO -J (13b) 45.8 38.9 59.2 47.5 54.6 57.1 58 57.6 54.8
UltraRM (13b) 56.94 43.06 55.0 53.33 67.13 64.17 56.25 59.85 59.85
PairRM (0.4b) 56.94 52.78 58.33 55.83 61.57 59.17 57.64 62.5 59.05



HHH-Alignment and MT-bench human judgements

Evaluator LM HHH ALIGNMENT MT BENCH HUMAN JUDG .
Help . Harm . Hon . Other Total Avg . Human Preference
RANDOM 50 50 50 50 50 34.26
STANFORDNLP REWARD MODEL 69.49 60.34 52.46 51.16 58.82 44.79
ALMOST REWARD MODEL 74.58 67.24 78.69 86.05 76.02 49.9
LLAMA2 -CHAT 7B 66.1 81.03 70.49 74.42 72.85 51.78
LLAMA2 -CHAT 13B 74.58 87.93 55.74 79.07 73.76 52.34
LLAMA2 -CHAT 70B 66.1 89.66 67.21 74.42 74.21 53.67
LLAMA2 -CHAT 13B+COARSE . 68.74 68.97 65.57 67.44 67.42 46.89
GPT -3.5-TURBO -0613 76.27 87.93 67.21 86.05 78.73 57.12
PROMETHEUS 7B 69.49 84.48 78.69 90.7 80.09 55.14
PROMETHEUS 13B 81.36 82.76 75.41 76.74 79.19 57.72
UltraRM (13B) 86.44 79.31 81.97 88.37 83.71 56
PairRM (0.4B) 84.75 84.48 80.33 90.7 84.62 59
GPT -4-0613 91.53 93.1 85.25 83.72 88.69 63.87

While PairRM is a extremely small model (0.4B) based on deberta, the pairwise comparison aggrement performance approches GPT-4’s performance!

Two reasons to attribute:

  • Our PairRM specically designed model arch for pairwise comparison through bidirectional attention (See LLM-blender paper for more details)
  • The high-quality and large-scale human preference annotation data it was train on (see training dataset list on this Hugging Face page)



Citation & Credits

If you are using PairRM in your research, please cite LLM-blender.

@inproceedings{llm-blender-2023,
    title = "LLM-Blender: Ensembling Large Language Models with Pairwise Comparison and Generative Fusion",
    author = "Jiang, Dongfu and Ren, Xiang and Lin, Bill Yuchen",
    booktitle = "Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023)",
    year = "2023"
}

This HTML code includes a series of use cases for the Pairwise Reward Model (PairRM) in Python. The purpose of the PairRM is to take an instruction and a pair of output candidates as input, and output a score for each candidate that measures their relative quality. The use cases span from comparing and ranking output candidates to best-of-n sampling decoding enhancement to RLHF (Reinforcement Learning from Human Feedback) methods.

For instance, the first use case focuses on comparing/ranking candidates given an instruction. It includes examples of ranking a list of candidate responses and directly comparing two candidates. Additionally, there’s a detailed code example, along with a pair of function calls that demonstrate how to use Blender’s “best_of_n_generate” function to enhance response quality.

The second use case, “Best-of-n Sampling (Decoding Enhancement),” provides a detailed Python code example using the PairRM model from the ‘llm-blender’ repository. The respective Python code example demonstrates how to use the PairRM and Blender to enhance response quality using best-of-n sampling.

Lastly, “Use case 3: RLHF” refers to the application of PairRM in aligning large language models (LLMs) with RLHF methods. The HTML snippet also includes information regarding PairRM’s statistics, including context length, training datasets, and performance. Specifically, statistical tables present the context length, training datasets, and various performance metrics, including pairwise test data performance and HHH alignment and MT-bench human judgments. Finally, the snippet concludes with a “Citation & Credits” section that provides the required BibTex citation for using PairRM in research.

This HTML snippet showcases the various use cases of PairRM, demonstrating its application across different facets of natural language processing and language model enhancement. It also includes substantial Python code examples that illustrate how to apply PairRM in practical coding scenarios.