# Guide to Hugging Face

## Better Bilingual Multimodal Model

Hugging Face is a revolutionary platform that provides access to the Yi Visual Language (Yi-VL) model. This model is an open-source, multimodal version of the Yi Large Language Model (LLM) series, enabling content comprehension, recognition, and multi-round conversations about images. Below is a detailed overview of Hugging Face and its features.

### Models

Yi-VL has released the following versions:

– Yi-VL-6B
– Yi-VL-34B

### Features

Yi-VL offers the following features:

– Multi-round text-image conversations
– Bilingual text support (English and Chinese)
– Strong image comprehension
– Fine-grained image resolution (448×448)

### Architecture

Yi-VL adopts the LLaVA architecture, which consists of three primary components:

– Vision Transformer (ViT)
– Projection Module
– Large Language Model (LLM)

### Training

Yi-VL undergoes a comprehensive three-stage training process, which includes the training of different parameters for each stage.

### Limitations

This is the initial release of the Yi-VL, and there are known limitations that should be carefully evaluated before adoption.

### Benchmarks

Yi-VL outperforms all existing open-source models in various benchmarks including MMMU and CMMMU.

### Showcases

Below are some representative examples of detailed description and visual question answering, showcasing the capabilities of Yi-VL.

### Quick Start

Inference can be performed using the code from LLaVA. Detailed steps for startup and system prompt modifications are provided on the Hugging Face platform.

### Hardware Requirements

For model inference, the recommended GPU examples are provided based on the version of Yi-VL being used.

### Acknowledgements and Attributions

This section provides a list of open-source projects used in the development of Yi-VL.

### License

The Yi series models are fully open for academic research and free for commercial use. The Yi Series Models Community License Agreement 2.1 governs the usage of these models.

For more information and detailed documentation, visit the [Hugging Face website](https://huggingface.co). Also, feel free to ask questions or discuss ideas on the [GitHub page](https://github.com/01-ai/Yi/discussions) or join their community on WeChat (Chinese).

Happy Hugging Face modeling! 🤗

Source link
### HuggingFace – Better Bilingual Multimodal Model

Welcome to HuggingFace – the home of the Yi Visual Language (Yi-VL) model. The Yi-VL model is an open-source, multimodal version of the Yi Large Language Model (LLM) series, enabling content comprehension, recognition, and multi-round conversations about images.

#### Models
Yi-VL has released the following versions:

#### Features
Yi-VL offers the following features:
– Multi-round text-image conversations
– Bilingual text support
– Strong image comprehension
– Fine-grained image resolution

#### Architecture
Yi-VL adopts the LLaVA architecture, composed of a Vision Transformer (ViT), Projection Module, and Large Language Model (LLM).

#### Training
Yi-VL undergoes a comprehensive three-stage training process:
1. Stage 1: Training of ViT and projection module
2. Stage 2: Scaling up the image resolution of ViT
3. Stage 3: Training of the entire model

| Stage | Global batch size | Learning rate | Gradient clip | Epochs |
|————|——————-|—————|————–|———|
| Stage 1, 2 | 4096 | 1e-4 | 0.5 | 1 |
| Stage 3 | 256 | 2e-5 | 1.0 | 2 |

#### Limitations
This is the initial release of the Yi-VL model, which comes with some known limitations. It is recommended to carefully evaluate potential risks before adopting the models.

#### Benchmarks
Yi-VL outperforms all existing open-source models in advanced benchmarks including MMMU and CMMMU.

#### Showcases
Below are some representative examples of detailed description and visual question answering, showcasing the capabilities of Yi-VL.

#### Quick Start
To perform inference, you can use the code from LLaVA. For detailed steps, refer to the simple startup for pretraining.

#### Hardware Requirements
For model inference, the recommended GPU examples are listed below:
– Yi-VL-6B: RTX 3090, RTX 4090, A10, A30
– Yi-VL-34B: 4 x RTX 4090, A800 (80 GB)

#### Acknowledgements and Attributions
This project makes use of open-source software/components and we acknowledge and are grateful to the developers for their contributions.

#### License
Please refer to the acknowledgments and attributions as well as individual components for the license of source code. The Yi series models are fully open for academic research and free for commercial use.

For free commercial use, you only need to send an email to get official commercial permission.

For more detailed information, you can refer to the official GitHub page of HuggingFace.

specify theme context for images

Better Bilingual Multimodal Model

🤗 Hugging Face • 🤖 ModelScope • ✡️ WiseModel

👩‍🚀 Ask questions or discuss ideas on GitHub !

👋 Join us 💬 WeChat (Chinese) !

📚 Grow at Yi Learning Hub !


📕 Table of Contents



Overview

  • Yi Visual Language (Yi-VL) model is the open-source, multimodal version of the Yi Large Language Model (LLM) series, enabling content comprehension, recognition, and multi-round conversations about images.

  • Yi-VL demonstrates exceptional performance, ranking first among all existing open-source models in the latest benchmarks including MMMU in English and CMMMU in Chinese (based on data available up to January 2024).

  • Yi-VL-34B is the first open-source 34B vision language model worldwide.



Models

Yi-VL has released the following versions.



Features

Yi-VL offers the following features:

  • Multi-round text-image conversations: Yi-VL can take both text and images as inputs and produce text outputs. Currently, it supports multi-round visual question answering with one image.

  • Bilingual text support: Yi-VL supports conversations in both English and Chinese, including text recognition in images.

  • Strong image comprehension: Yi-VL is adept at analyzing visuals, making it an efficient tool for tasks like extracting, organizing, and summarizing information from images.

  • Fine-grained image resolution: Yi-VL supports image understanding at a higher resolution of 448×448.



Architecture

Yi-VL adopts the LLaVA architecture, which is composed of three primary components:

  • Vision Transformer (ViT): it’s initialized with CLIP ViT-H/14 model and used for image encoding.

  • Projection Module: it’s designed to align image features with text feature space, consisting of a two-layer Multilayer Perceptron (MLP) with layer normalizations.

  • Large Language Model (LLM): it’s initialized with Yi-34B-Chat or Yi-6B-Chat, demonstrating exceptional proficiency in understanding and generating both English and Chinese.

image/png



Training



Training process

Yi-VL is trained to align visual information well to the semantic space of Yi LLM, which undergoes a comprehensive three-stage training process:

  • Stage 1: The parameters of ViT and the projection module are trained using an image resolution of 224×224. The LLM weights are frozen. The training leverages an image caption dataset comprising 100 million image-text pairs from LAION-400M. The primary objective is to enhance the ViT’s knowledge acquisition within our specified architecture and to achieve better alignment between the ViT and the LLM.

  • Stage 2: The image resolution of ViT is scaled up to 448×448, and the parameters of ViT and the projection module are trained. It aims to further boost the model’s capability for discerning intricate visual details. The dataset used in this stage includes about 25 million image-text pairs, such as LAION-400M, CLLaVA, LLaVAR, Flickr, VQAv2, RefCOCO, Visual7w and so on.

  • Stage 3: The parameters of the entire model (that is, ViT, projection module, and LLM) are trained. The primary goal is to enhance the model’s proficiency in multimodal chat interactions, thereby endowing it with the ability to seamlessly integrate and interpret visual and linguistic inputs. To this end, the training dataset encompasses a diverse range of sources, totalling approximately 1 million image-text pairs, including GQA, VizWiz VQA, TextCaps, OCR-VQA, Visual Genome, LAION GPT4V and so on. To ensure data balancing, we impose a cap on the maximum data contribution from any single source, restricting it to no more than 50,000 pairs.

Below are the parameters configured for each stage.

Stage Global batch size Learning rate Gradient clip Epochs
Stage 1, 2 4096 1e-4 0.5 1
Stage 3 256 2e-5 1.0 2



Training resource consumption



Limitations

This is the initial release of the Yi-VL, which comes with some known limitations. It is recommended to carefully evaluate potential risks before adopting any models.



Benchmarks

Yi-VL outperforms all existing open-source models in MMMU and CMMMU, two advanced benchmarks that include massive multi-discipline multimodal questions (based on data available up to January 2024).

image/png

image/png



Showcases

Below are some representative examples of detailed description and visual question answering, showcasing the capabilities of Yi-VL.

image/png

image/png



Quick start

You can perform inference using the code from LLaVA. For detailed steps, see simple startup for pretraining.

Notes:

  • You need to modify the system prompt as follows.

    This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角色。仔细阅读所有的图像,并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。
    
    ### Human: <image_placeholder>
    What is it in the image?
    ### Assistant:
    
  • You need to set the parameter mm_vision_tower in config.json to the local ViT path.



Hardware requirements

For model inference, the recommended GPU examples are:

  • Yi-VL-6B: RTX 3090, RTX 4090, A10, A30

  • Yi-VL-34B: 4 × RTX 4090, A800 (80 GB)



Acknowledgements and attributions

This project makes use of open-source software/components. We acknowledge and are grateful to these developers for their contributions to the open-source community.



List of used open-source projects

  1. LLaVA
  • Authors: Haotian Liu, Chunyuan Li, Qingyang Wu, Yuheng Li, and Yong Jae Lee
  • Source: https://github.com/haotian-liu/LLaVA
  • License: Apache-2.0 license
  • Description: The codebase is based on LLaVA code.
  1. OpenClip
  • Authors: Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt
  • Source: https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K
  • License: MIT
  • Description: The ViT is initialized using the weights of OpenClip.

Notes

  • This attribution does not claim to cover all open-source components used. Please check individual components and their respective licenses for full details.

  • The use of the open-source components is subject to the terms and conditions of the respective licenses.

We appreciate the open-source community for their invaluable contributions to the technology world.



License

Please refer to the acknowledgments and attributions as well as individual components, for the license of source code.

The Yi series models are fully open for academic research and free for commercial use, permissions of which are automatically granted upon application.

All usage must adhere to the Yi Series Models Community License Agreement 2.1.

For free commercial use, you only need to send an email to get official commercial permission.

The

tag in HTML is a versatile and fundamental element with many use cases. One use case is for creating a layout structure for web pages. The

tag is often used to group and organize other HTML elements, such as text, images, and other content, within a container. It can be used to create sections, columns, and rows on a webpage, providing structure and organization for the content within.

In the given HTML snippet, the

is being used with the align attribute set to “center” to center align the content within it. This can be useful for creating visually appealing and well-organized web pages.

Another common use case for the

tag is to create interactive elements. In the provided HTML code, the

tag is used to contain a element, which in turn contains an element. The

tag can be used to wrap elements that are part of interactive components such as buttons, navigation bars, or modals on a webpage.

Additionally, the

tag can be utilized for providing structure and styling to content. By applying CSS styles to a

element, developers can create custom layouts, background colors, borders, and padding around different sections of a webpage.

When it comes to web development, the

tag is a powerful tool for building responsive and visually appealing websites. It allows developers to organize content, create interactive elements, and apply styles to enhance the user experience. In the provided HTML snippet, the use of the

tag aligns with these use cases, demonstrating how it can be used to organize, style, and structure content within a webpage.

2024-01-23T08:28:08+01:00