ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Llama 3.1
    Research/NLP_reference 2024. 9. 8. 17:38

    https://huggingface.co/blog/llama31 (July 23, 2024)


    Llama 3.1 - 405B, 70B & 8B with multilinguality and long context

    Llama 3.1 is out! Today we welcome the next iteration of the Llama family to Hugging Face. We are excited to collaborate with Meta to ensure the best integration in the Hugging Face ecosystem. Eight open-weight models (3 base models and 5 fine-tuned ones) are available on the Hub.

     

    Llama 3.1 comes in three sizes: 8B for efficient deployment and development on consumer-size GPU, 70B for large-scale AI native applications, and 405B for synthetic data, LLM as a Judge or distillation. All three come in base and instruction-tuned variants.

     

    In addition to the six generative models, Meta released two new models: Llama Guard 3 and Prompt Guard. Prompt Guard is a small classifier that detects prompt injections and jailbreaks. Llama Guard 3 is a safeguard model that can classify LLM inputs and generations.

     

    Among the features and integrations being released, we have:

    • Models on the Hub
    • Hugging Face Transformers and TGI integration
    • Hugging Chat integration for Meta Llama 3.1 405B Instruct
    • Inference & Deployment Integration with Inference Endpoints, Google Cloud, Amazon SageMaker & DELL Enterprise Hub
    • Quantization for FP8, AWQ and GPTQ for easier inference
    • Fine-tuning Llama 3.1 8B on a single GPU with 🤗 TRL
    • Generate synthetic data using Llama 3.1 70B and 405B with Distilabel

    What's new with Llama 3.1?

    Why is Llama 3.1 so exciting? On top of the features the predecessor offers, Llama 3.1 has some key new features:

    • A large context length of 128K tokens (vs original 8K)
    • Multilingual capabilities
    • Tool usage capabilities
    • A very large dense model of 405 billion parameters
    • A more permissive license

    Let’s dive into these!

     

    The Llama 3.1 release introduces six new open LLM models based on the Llama 3 architecture. They come in three sizes: 8B, 70B, and 405B parameters, each with base (pre-trained) and instruct-tuned versions. All the variants support a context length of 128K tokens and 8 languages, including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. Llama 3.1 continues to use Grouped-Query Attention (GQA), an efficient representation that should help with longer contexts.

    In addition to these 6 language models, Llama Guard 3 and Prompt Guard were released.

    • Llama Guard 3 is the latest iteration in the Llama Guard family, fine-tuned on Llama 3.1 8B. It is built for production use cases, with a 128k context length and multilingual capabilities. Llama Guard 3 can classify LLM inputs (prompts) and responses to detect content that would be considered unsafe in a risk taxonomy.
    • Prompt Guard, on the other hand, is a small 279M parameter BERT-based classifier that can detect prompt injection and jailbreaking. It was trained on a large corpus of attacks and is suggested to be further fine-tuned with application-specific data.

    New in Llama 3.1 compared to Llama 3 is that the instruct models are fine-tuned on tool calling for agentic use cases. There are two built-in tools (search, mathematical reasoning with Wolfram Alpha) that can be expanded with custom JSON functions.

     

    The Llama 3.1 models were trained on over 15 trillion tokens on a custom-built GPU cluster with a total of 39.3M GPU hours (1.46M for 8B, 7.0M for 70B, 30.84M for 405B). We don’t know the exact details of the training dataset mix, and we can only guess it has a more diverse curation for multilingualism. Llama 3.1 Instruct has been optimized for instruction following and was trained on publicly available instruction datasets, as well as over 25M synthetically generated examples with supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). Meta developed LLM-based classifiers to filter and curate high-quality prompts and responses during the creation of the data mix.

     

    Regarding the licensing terms, Llama 3.1 comes with a very similar license with one key difference: it enables using model outputs that can be used to improve other LLMs. This means that synthetic data generation and distillation are allowed, even with different models! This is especially important for the 405B model, as discussed later. The license allows for redistribution, fine-tuning, and creation of derivative work and still requires derived models to include "Llama" at the beginning of their name, and any derivative works or services must mention "Built with Llama". For full details, please make sure to read the official license.


    How much memory does Llama 3.1 need?

    Llama 3.1 brings exciting advancements. However, running it requires careful consideration of your hardware resources. We broke down the memory requirements for both training and inference across the three model sizes.

    Inference Memory Requirements

    For inference, the memory requirements depend on the model size and the precision of the weights. Here's a table showing the approximate memory needed for different configurations:

     

    Note: The above-quoted numbers indicate the GPU VRAM required just to load the model checkpoint. They don’t include torch reserved space for kernels or CUDA graphs.

     

    As an example, an H100 node (of 8x H100) has ~640GB of VRAM, so the 405B model would need to be run in a multi-node setup or run at a lower precision (e.g. FP8), which would be the recommended approach.

     

    Keep in mind that lower precision (e.g., INT4) may result in some loss of accuracy but can significantly reduce memory requirements and increase inference speed. In addition to the model weights, you will also need to keep the KV Cache in memory. It contains keys and values of all the tokens in the model’s context such that they don’t need to be recomputed when generating a new token. Especially when making use of the long available context length, it becomes a significant factor. In FP16, the KV cache memory requirements are:

    Especially for the small model the cache uses as much memory as the weights when approaching the context length maximum.

    Training Memory Requirements

    The following table outlines the approximate memory requirements for training Llama 3.1 models using different techniques:

    Note: These are estimated values and may vary based on specific implementation details and optimizations.


    Llama 3.1 evalation

    Note: We are currently evaluating Llama 3.1 individually on the new Open LLM Leaderboard 2 and will update this section later today. Below is an excerpt from the official evaluation from Meta.


    Using Hugging Face Transformers

    Llama 3.1 requires a minor modeling update to handle RoPE scaling effectively. With Transformers release 4.43.2, you can use the new Llama 3.1 models and leverage all the tools within the Hugging Face ecosystem. Make sure to use the latest transformers release:

    pip install "transformers>=4.43.2" --upgrade

     

    A couple of details:

    • Transformers loads the model in bfloat16 by default. This is the type used by the original checkpoint published by Meta, so it’s the recommended way to run to ensure the best precision or conduct evaluations.
    • Assistant responses may end with the special token <|eot_id|>, but we must also stop generation if the regular EOS token is found. We can stop generation early by providing a list of terminators in the eos_token_id parameter.
    • We used the default sampling parameters (temperature and top_p) taken from the original meta codebase. We haven’t had time to conduct extensive tests yet, feel free to explore!

    The following snippet shows how to use meta-llama/Meta-Llama-3.1-8B-Instruct. It requires about 16 GB of VRAM, which fits many consumer GPUs. The same snippet works for meta-llama/Meta-Llama-3.1-70B-Instruct, which, at 140GB of VRAM & meta-llama/Meta-Llama-3.1-405B-Instruct (requiring 810GB VRAM), makes it a very interesting model for production use cases. Memory consumption can be further reduced by loading in 8-bit or 4-bit mode.

    from transformers import pipeline
    import torch
    
    model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
    pipe = pipeline(
        "text-generation",
        model=model_id,
        model_kwargs={"torch_dtype": torch.bfloat16},
        device="cuda",
    )
    
    messages = [
        {"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
    ]
    outputs = pipe(
        messages,
        max_new_tokens=256,
        do_sample=False,
    )
    assistant_response = outputs[0]["generated_text"][-1]["content"]
    print(assistant_response)
    # Arrrr, me hearty! Yer lookin' fer a bit o' information about meself, eh? Alright then, matey! I be a language-generatin' swashbuckler, a digital buccaneer with a penchant fer spinnin' words into gold doubloons o' knowledge! Me name be... (dramatic pause)...Assistant! Aye, that be me name, and I be here to help ye navigate the seven seas o' questions and find the hidden treasure o' answers! So hoist the sails and set course fer adventure, me hearty! What be yer first question?

     

    You can also automatically quantize the model, loading it in 8-bit or even 4-bit mode with bitsandbytes. 4-bit loading of the large 70B version takes about 34 GB of memory to run. This is how you’d load the generation pipeline in 4-bit:

    pipeline = pipeline(
        "text-generation",
        model=model_id,
        model_kwargs={
            "torch_dtype": torch.bfloat16,
            "quantization_config": {"load_in_4bit": True}
        },
    )

     

    For more details on using the models with transformers, please check the model cards.

     

    Note: Transformers takes care of all pesky prompt template issues and more, if you want to know more about prompting then check out the next section.


    How to prompt Llama 3.1

    The base models have no prompt format. Like other base models, they can be used to continue an input sequence with a plausible continuation or for zero-shot/few-shot inference. They are also a great foundation for fine-tuning your own use cases.

     

    The Instruct versions support conversational format with 4 roles:

    1. system: Sets the context for the conversation. It allows including rules, guidelines, or necessary information that help to respond effectively. It’s also used to enable tool use when appropriate.
    2. user: User inputs, commands, and questions for the models.
    3. assistant: The assistant's response, based on the context provided in the ‘system’ and ‘user’ prompts.
    4. ipython: A new role introduced in Llama 3.1. This role is used as the output of a tool call when sent back to the LLM.

    The Instruct versions use the following conversation structure for simple conversations:

    <|begin_of_text|><|start_header_id|>system<|end_header_id|>
    
    {{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>
    
    {{ user_msg_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
    
    {{ model_answer_1 }}<|eot_id|>

     

    Llama 3.1 Instruct models now support tool calling, including three built-in tools (brave_search, wolfram_alpha, and code_interpreter) and custom tool calling via JSON function calling. The built-in tools use Python syntax. The ability to output Python code for function calling is part of the code interpreter tool, which must be enabled in the system prompt using the Environment keyword, as shown below.

    Built-in Tool calling

    Including “Environment: ipython” turns on the code interpreter mode, and the model can generate Python code that it expects to be executed. The message body of the assistant response starts with a special tag <|python_tag|> and ends with <|eom_id|> instead of just the standard <|eot_id|>. The latter indicates the turn is finished, while the former indicates continued multi-step reasoning.

    <|begin_of_text|><|start_header_id|>system<|end_header_id|>
    
    
    Environment: ipython
    Tools: brave_search, wolfram_alpha
    
    Cutting Knowledge Date: 01 March 2023
    Today's Date: 13 July 2024
    
    
    You are a helpful Assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
    
    Weather in Menlo Park, California<|eot_id|><|start_header_id|>assistant<|end_header_id|>

     

    The response from the model at this point would include Python code to call one of the supported tools (brave_search in this case):

    <|python_tag|>brave_search.call(query="current weather in Menlo Park, California")<|eom_id|>

     

    The response from executing the call is then sent back to the model to retrieve the final response. For brevity, the following would be appended to the message shown in the previous snippet:

    <|python_tag|>brave_search.call(query="Menlo Park California weather")<|eom_id|><|start_header_id|>ipython<|end_header_id|>
    
    {"query": "Menlo Park California weather", "top_k": [{"title": "10-Day Weather Forecast for West Menlo Park, CA - The Weather Channel | weather.com", "url": "https://weather.com/weather/tenday/l/West+Menlo+Park+CA?canonicalCityId=b2375713aa1943aad7d1a13a85e1c0adad13c1b10563b2bbaad70734dc61cf11", "description": "Be prepared with the most accurate 10-day forecast for West <strong>Menlo</strong> <strong>Park</strong>, CA with highs, lows, chance of precipitation from The <strong>Weather</strong> Channel and <strong>Weather</strong>.com", "type": "search_result"},....}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

     

    The final response from the LLM would then be:

    The current weather in Menlo Park, California is mostly sunny with a high of 77°F and a low of 56°F.<|eot_id|>

    Custom Tool calling

    Llama 3.1 Instruct supports custom function calls from a single user message. The following prompts provide an example of how custom functions can be called from the output of the model. In custom function calling, the model outputs <|eot_id|> instead of <|eom_id|>. The system prompt needs to be adjusted to inform the model how to deal with function call outputs.

    <|begin_of_text|><|start_header_id|>system<|end_header_id|>
    
    You are a helpful assistant with tool calling capabilities. When you receive a tool call response, use the output to format an answer to the orginal user question.<|eot_id|><|start_header_id|>user<|end_header_id|>
    
    Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.
    
    Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}. Do not use variables.
    
    {
        "type": "function",
        "function": {
        "name": "get_current_conditions",
        "description": "Get the current weather conditions for a specific location",
        "parameters": {
            "type": "object",
            "properties": {
            "location": {
                "type": "string",
                "description": "The city and state, e.g., San Francisco, CA"
            },
            "unit": {
                "type": "string",
                "enum": ["Celsius", "Fahrenheit"],
                "description": "The temperature unit to use. Infer this from the user's location."
            }
            },
            "required": ["location", "unit"]
        }
        }
    }
    
    Question: what is the weather like in Menlo Park?<|eot_id|><|start_header_id|>assitant<|end_header_id|>
    
    {"name": "get_current_conditions", "parameters": {"location": "Menlo Park, CA", "unit": "Fahrenheit"}}<|eot_id|><|start_header_id|>ipython<|end_header_id|>

     

    When we retrieve the output from the selected tool, we pass it back to the model using the same <|python_tag|> delimiter. <|python_tag|> does not imply Python use. It’s only meant to signal the beginning of outputs from any tool.

    <|python_tag|>{
        "tool_call_id": "get_current_conditions"
        "output": "Clouds giving way to sun Hi: 76° Tonight: Mainly clear early, then areas of low clouds forming Lo: 56°"
    }<|eot_id|><|start_header_id|>assistant<|end_header_id|>
    
    The weather in Menlo Park is currently cloudy with a high of 76° and a low of 56°, with clear skies expected tonight.<|eot_id|>

     

    This format has to be exactly reproduced for effective use. The chat template available in transformers makes it straightforward to format the prompt correctly.

    Demo

    You can experiment with the three Instruct models in the following demos:

    The whole stack is open-source. Hugging Chat is powered by chat-ui and text-generation-inference.


    Llama 3.1 405B quantization with FP8, AWQ, and GPTQ

    Meta created an official FP8 quantized version of Llama 3.1 405B with minimal accuracy degradation. To achieve this, FP8 quantization was only applied to the major linear operators of the model, such as the gate and up and down projections for the FFNs (covering 75% of the inference FLOPs). We worked together to ensure that this FP8 quantization checkpoint is compatible across the community (transformers, TGI, VLLM).

     

    Additionally, we created AWQ and GPTQ quantized variants in INT4 with AutoAWQ and AutoGPTQ, respectively. For AWQ, all the linear layers were quantized using the GEMM kernels performing zero-point quantization down to 4 bits with a group size of 128; and for GPTQ the same setting only using the GPTQ kernels instead. We ensured that the INT4 checkpoints are compatible with transformers and TGI, including Marlin kernel support to speed up inference in TGI for the GPTQ quants.

     

    Available quantized weights for Llama 3.1 405B:

    The Hugging Quants organization contains quantized checkpoints for the 70B and 8B version as well.


    Inference Integrations

    Hugging Face Inference API

    Hugging Face PRO users now have access to exclusive API endpoints hosting Llama 3.1 8B Instruct, Llama 3.1 70B Instruct and Llama 3.1 405B Instruct AWQ powered by text-generation-inference. All versions support the Messages API, so they are compatible with OpenAI client libraries, including LangChain and LlamaIndex.

     

    Note: Update to the latest huggingface_hub version with pip install "huggingface_hub>=0.24.1.

    from huggingface_hub import InferenceClient
    
    # Initialize the client, pointing it to one of the available models
    client = InferenceClient()
    
    chat_completion = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8",
        messages=[
            {"role": "system", "content": "You are a helpful an honest programming assistant."},
            {"role": "user", "content": "Is Rust better than Python?"},
        ],
        stream=True,
        max_tokens=500
    )
    
    # iterate and print stream
    for message in chat_completion:
        print(message.choices[0].delta.content, end="")

     

    For more details about the use of the Messages API, please check this post.

    Hugging Face Inference Endpoints

    You can deploy Llama 3.1 on Hugging Face's Inference Endpoints, which uses Text Generation Inference as the backend. Text Generation Inference is a production-ready inference container developed by Hugging Face with support for FP8, continuous batching, token streaming, tensor parallelism for fast inference on multiple GPUs. To deploy Llama 3.1, go to the model page and click on the Deploy -> Inference Endpoints widget:

    from huggingface_hub import InferenceClient
    
    # Initialize the client, pointing it to one of the available models
    client = InferenceClient(
        base_url="<ENDPOINT_URL>",
    )
    
    # Create a chat completion
    chat_completion = client.chat.completions.create(
        model="ENDPOINT",
        messages=[
            {"role": "system", "content": "You are a helpful an honest programming assistant."},
            {"role": "user", "content": "Is Rust better than Python?"},
        ],
        stream=True,
        max_tokens=500
    )
    
    # iterate and print stream
    for message in chat_completion:
        print(message.choices[0].delta.content, end="")

    Fine-tuning with Hugging Face TRL

    In this section, we’ll look at the tools available in the Hugging Face ecosystem to efficiently train Llama 3.1 on consumer-size GPUs. An example command to fine-tune Llama 3.1 8B on OpenAssistant’s chat dataset can be found below. We use 4-bit quantization and QLoRA to conserve memory to target all the attention blocks' linear layers.

     

    First, install the nightly version of 🤗 TRL and clone the repo to access the training script:

    pip install "transformers>=4.43.2" --upgrade
    pip install --upgrade bitsandbytes
    pip install --ugprade peft
    pip install git+https://github.com/huggingface/trl
    git clone https://github.com/huggingface/trl
    cd trl

     

    Then you can run the script:

    python \
        examples/scripts/sft.py \
        --model_name meta-llama/Meta-Llama-3.1-8B \
        --dataset_name OpenAssistant/oasst_top1_2023-08-25 \
        --dataset_text_field="text" \
        --per_device_train_batch_size 1 \
        --per_device_eval_batch_size 1 \
        --gradient_accumulation_steps 4 \
        --learning_rate 2e-4 \
        --report_to "none" \
        --bf16 \
        --max_seq_length 1024 \
        --lora_r 16 --lora_alpha 32 \
        --lora_target_modules q_proj k_proj v_proj o_proj \
        --load_in_4bit \
        --use_peft \
        --attn_implementation "flash_attention_2" \
        --logging_steps=10 \
        --gradient_checkpointing \
        --output_dir llama31

     

    If you have more GPUs to spare, you can run training with DeepSpeed and ZeRO Stage 3:

    accelerate launch --config_file=examples/accelerate_configs/deepspeed_zero3.yaml \
        examples/scripts/sft.py \
        --model_name meta-llama/Meta-Llama-3.1-8B \
        --dataset_name OpenAssistant/oasst_top1_2023-08-25 \
        --dataset_text_field="text" \
        --per_device_train_batch_size 1 \
        --per_device_eval_batch_size 1 \
        --gradient_accumulation_steps 4 \
        --learning_rate 2e-5 \
        --report_to wandb \
        --bf16 \
        --max_seq_length 1024 \
        --attn_implementation eager \
        --logging_steps=10 \
        --gradient_checkpointing \
        --output_dir models/llama

    Synthetic data generation with distilabel

    A big change in Llama 3.1’s license is that it allows using model outputs to improve other LLMs, which means you can generate synthetic datasets with Llama 3.1 models and use them to fine-tune smaller, more specialized models.

     

    Let’s look at an example of how to generate a preference dataset with distilabel, an open-source framework for synthetic data generation. This dataset can be used to fine-tune models with the preference optimization methods offered by TRL like DPO or KTO.

     

    First install the latest distilabel release including the hf-inference-endpoints extra with pip as follows:

    pip install “distilabel[hf-inference-endpoints]” --upgrade

     

    Then define a pipeline that:

    • loads a dataset with instructions from the Hugging Face Hub.
    • generates a response with Llama 3.1 70B Instruct and Llama 3.1 405B Instruct via Hugging Face Inference Endpoints.
    • finally, uses Llama 3.1 405B Instruct as a judge to rate the responses using UltraFeedback prompts. From these ratings, chosen and rejected responses can be selected and used to fine-tune a model with preference optimization methods.

    See the code below to define the pipeline or run it yourself using this Colab notebook and explore the generated dataset in the Hub.

    from distilabel.llms import InferenceEndpointsLLM
    from distilabel.pipeline import Pipeline
    from distilabel.steps import LoadDataFromHub, CombineColumns
    from distilabel.steps.tasks import TextGeneration, UltraFeedback
    
    llama70B = InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct"
    )
    llama405B = InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8"
    )
    
    with Pipeline(name="synthetic-data-with-llama3") as pipeline:
        # load dataset with prompts
        load_dataset = LoadDataFromHub(
            repo_id="argilla/10Kprompts-mini"
        )
        # generate two responses for each prompt
        generate = [
            TextGeneration(llm=llama70B),
            TextGeneration(llm=llama405B)
        ]
        # combine responses into one column
        combine = CombineColumns(
            columns=["generation", "model_name"],
            output_columns=["generations", "model_names"]
        )
        # rate responses with 405B LLM-as-a-judge
        rate = UltraFeedback(aspect="overall-rating", llm=llama405B)
        # define the pipeline
        load_dataset >> generate >> combine >> rate
    
    if __name__ == "__main__":
        distiset = pipeline.run()

     

    What’s next? Besides the example above, distilabel comes with exciting approaches for synthetic data generation with LLMs in a wide range of scenarios and topics. It includes implementations from the current SOTA literature for tasks like evaluating outputs with LLM-as-a-judge methods, evolving instructions, data filtering, as well as defining custom components.


    Additional Resources


     

    'Research > NLP_reference' 카테고리의 다른 글

    KV Cache  (0) 2024.09.29
    Gemma 2  (0) 2024.09.08
    Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA  (0) 2024.09.08
    How to Successfully Run a LLM Fine-Tuning Project  (0) 2024.09.07
    RoPE  (0) 2024.08.27
Designed by Tistory.