HuggingFace Guide

Welcome to the temporary HF space where you can try out the model and test the inference endpoint.

Model Testing:
– To try out the model, visit the Snorkel-Mistral-PairRM-DPO Space at https://huggingface.co/spaces/snorkelai/snorkelai_mistral_pairrm_dpo_text_inference
– The inference endpoint may take a few minutes to activate initially but will eventually operate at the standard speed of HF’s 7B model text inference endpoint.

Code Example:
Here is an example of querying the model using Python:

“`python
import requests

API_URL = “https://t1q6ks6fusyg1qq7.us-east-1.aws.endpoints.huggingface.cloud”
headers = {
“Accept” : “application/json”,
“Content-Type”: “application/json”
}

def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()

output = query({
“inputs”: “[INST] Recommend me some Hollywood movies [/INST]”,
“parameters”: {}
})
“`

Dataset:
– Training dataset: snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset
– Only the prompts from UltraFeedback are utilized, with no external LLM responses used.

Methodology:
1. Generate response variations for each prompt using Mistral-7B-Instruct-v0.2
2. Apply PairRM for response reranking
3. Update the LLM by applying Direct Preference Optimization (DPO)
4. Repeat the process three times in total

Training Recipe:
– The provided data is compatible with Hugging Face‘s Zephyr recipe.

Key Premises:
– Specialization Requirement
– Ease of Model Building
– Alignment Recipe

Applications:
– We focus on the general approach to alignment using PairRM and Mistral-7B-Instruct-v0.2

Results:
– The model scored 30.22 and ranked 3rd on Alpaca-Eval 2.0
– After post-processing with PairRM-best-of-16, the score increased to 34.86 and ranked 2nd

Limitations:
– The model does not have any moderation mechanisms.

Contemporary Work and Acknowledgements:
– Acknowledgement to Mistral AI Team, authors of innovative papers, and the HuggingFace team

The Snorkel AI Team:
Hoang Tran, Chris Glaze, Braden Hancock

For more details, refer to the Snorkel blog at https://snorkel.ai/blog/

Source link

Hugging Face Tutorial

Introduction

This tutorial is designed to help you understand and utilize the Hugging Face model for text inference and alignment. Hugging Face is a platform that provides various natural language processing models and tools to developers and data scientists.

Testing the Model

To get started, you can try out the model in the temporary HF space provided by Hugging Face. Simply visit Snorkel-Mistral-PairRM-DPO Space.

Additionally, you can use the provided inference endpoint to test the model. The code below demonstrates how to query the model using Python and the requests library.

“`python
import requests

API_URL = “https://t1q6ks6fusyg1qq7.us-east-1.aws.endpoints.huggingface.cloud”
headers = {
“Accept” : “application/json”,
“Content-Type”: “application/json”
}

def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()

output = query({
“inputs”: “[INST] Recommend me some Hollywood movies [/INST]”,
“parameters”: {}
})
“`

Dataset

The model is trained using the dataset available at snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset. It utilizes only the prompts from the UltraFeedback dataset, and no external LLM responses are used.

Methodology

  1. Generate response variations using the LLM – Mistral-7B-Instruct-v0.2.
  2. Apply PairRM for response re-ranking.
  3. Update the LLM using Direct Preference Optimization (DPO).
  4. Repeat the process for three iterations.

More detailed results and findings can be found on the Snorkel blog.

Training Recipe

  • The data is formatted to be compatible with Hugging Face‘s Zephyr recipe.

Key Premises

  • Specialization Requirement: Additional fine-tuning and alignment are necessary for enterprise use cases.
  • Ease of Model Building: Creating ranking/scoring/classification models is simpler than developing high-quality, manually annotated datasets for long-form responses.
  • Alignment Recipe: Using smaller but specialized teacher models can incrementally align LLMs towards specific axes.

Applications

The model is designed for general-purpose alignment. However, for specialized internal reward models, please contact the Snorkel AI team or consider attending the Enterprise LLM Summit.

Result

On Alpaca-Eval 2.0:

  • The model scored 30.22, ranking 3rd at the time of publication.
  • Using post-processing with PairRM-best-of-16, the model scored 34.86, ranking 2nd. The best model on the leaderboard is “gpt-4-turbo”.

It is noted that the Alpaca-Eval 2.0 benchmark serves as a suitable benchmark for aligning with general “human preferences”.

Limitations

The model is a quick demonstration and does not have moderation mechanisms.

Acknowledgements

  • The Mistral AI Team for developing and releasing the Mistral-7B-Instruct-v0.2 model.
  • The author of the Direct Preference Optimization paper.
  • The author of the Pairwise Reward Model for LLMs paper.
  • The HuggingFace team for the DPO implementation under The Alignment Handbook.

We also acknowledge contemporary work published by other researchers in the same field.

The Snorkel AI Team

Hoang Tran, Chris Glaze, Braden Hancock

For more detailed information and resources, please visit the official Hugging Face website and documentation.

We offer a temporary HF space for everyone to try out the model: -> Snorkel-Mistral-PairRM-DPO Space

We also provide an inference endpoint for everyone to test the model.
It may initially take a few minutes to activate, but will eventually operate at the standard speed of HF’s 7B model text inference endpoint.
The speed of inference depends on HF endpoint performance and is not related to Snorkel offerings.
This endpoint is designed for initial trials, not for ongoing production use. Have fun!

import requests

API_URL = "https://t1q6ks6fusyg1qq7.us-east-1.aws.endpoints.huggingface.cloud"
headers = {
    "Accept" : "application/json",
    "Content-Type": "application/json" 
}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

output = query({
    "inputs": "[INST] Recommend me some Hollywood movies [/INST]",
    "parameters": {}
})



Dataset:

Training dataset: snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset

We utilize ONLY the prompts from UltraFeedback; no external LLM responses used.



Methodology:

  1. Generate five response variations for each prompt from a subset of 20,000 using the LLM – to start, we used Mistral-7B-Instruct-v0.2.
  2. Apply PairRM for response reranking.
  3. Update the LLM by applying Direct Preference Optimization (DPO) on the top (chosen) and bottom (rejected) responses.
  4. Use this LLM as the base model for the next iteration, repeating three times in total.

This overview provides a high-level summary of our approach.
We plan to release more detailed results and findings in the coming weeks on the Snorkel blog.

The prompt format follows the Mistral model:

[INST] {prompt} [/INST]



Training recipe:

  • The provided data is formatted to be compatible with the Hugging Face‘s Zephyr recipe.
    We executed the n_th DPO iteration using the “train/test_iteration_{n}”.



Key Premises:

  • Specialization Requirement: For most enterprise use cases, using LLMs “off-the-shelf” falls short of production quality, necessitating additional fine-tuning and alignment.
  • Ease of Model Building: Creating ranking/scoring/classification models is simpler than developing high-quality, manually annotated datasets for long-form responses.
  • Alignment Recipe: Using smaller but specialized teacher models (reward models) can incrementally align LLMs towards specific axes.



Applications:

Unlike our customers, who have very specific use cases to align LLMs to,
the AlpacaEval 2.0 leaderboard measures the ability of LLMS to follow user instructions.
With this demonstration, we focus on the general approach to alignment.
Thus, we use a general-purpose reward model – the performant PairRM model.
We use the Mistral-7B-Instruct-v0.2 model as our base LLM.

For interest in building your specialized internal reward models
that reflect your enterprises’ needs
, please contact the Snorkel AI team or consider attending our
Enterprise LLM Summit: Building GenAI with Your Data on January 25, 2024
to learn more about “Programmatically scaling human preferences and alignment in GenAI”.



Result:

On Alpaca-Eval 2.0:

After applying the above methodology:

  • This model scored 30.22 – ranked 3rd and the highest for an open-source base model at the time of publication.
  • When post-processing the model outputs with PairRM-best-of-16, which involved generating 16 responses and selecting the highest-scoring response by PairRM, we scored 34.86 – ranked 2nd.
    The best model on the leaderboard is “gpt-4-turbo”, which is also the judge of optimal responses.

We recognize that the Alpaca-Eval 2.0 benchmark does not entirely capture the full range of capabilities and performances of LLMs.
However, in our current work, where the goal is to align with general “human preferences,” Alpaca-Eval 2.0 serves as a suitable and representative benchmark.
Moving forward, we anticipate further contributions from the community regarding new alignment axes, and conduct evaluations using other appropriate benchmarks.

The Alpaca-Eval 2.0 evaluator, “gpt-4-turbo,” exhibits a bias towards longer responses.
This tendency might also be present in our chosen reward model, resulting in our model producing lengthier responses after DPO iterations,
which can be among the factors to our higher ranks on the leaderboard.
Future work could include measures to control response length and other relevant metrics.



Limitations:

The model is a quick demonstration that the LLMs can be programmatically aligned using smaller specialized reward models.
It does not have any moderation mechanisms.
We look forward to continuing to engage with the research community and our customers exploring optimal methods for getting models to respect guardrails,
allowing for deployment in environments requiring moderated outputs.



Contemporary Work and Acknowledgements:

  • The Mistral AI Team for developing and releasing the advanced Mistral-7B-Instruct-v0.2 model.
  • The author of the Direct Preference Optimization paper for the innovative approach
  • The author of the Pairwise Reward Model for LLMs paper for the powerful general-purpose reward model
  • The HuggingFace team for the DPO implementation under The Alignment Handbook
  • We would also like to acknowledge contemporary work published independently on arXiv on 2024-01-18 by Meta & NYU (Yuan, et al) in a paper called Self-Rewarding Language Models,
    which proposes a similar general approach for creating alignment pairs from a larger set of candidate responses, but using the LLM as the reward model.
    While this may work for general-purpose models, our experience has shown that task-specific reward models guided by SMEs are necessary for most
    enterprise applications of LLMs for specific use cases, which is why we focus on the use of external reward models.



The Snorkel AI Team

Hoang Tran, Chris Glaze, Braden Hancock

The

tag in HTML is a container element that can be used to group together various elements on a webpage. It is a versatile tag that can be used for a wide range of purposes. In the provided code snippet, the

tag contains a variety of other elements such as

,

, , 

, , and