Introduction to Phi-3


Introduction to Phi-3

In this article, we will explore the Phi-3 model from the official technical report published by Microsoft.

Phi-3 family of models.
Figure 1. Phi-3 family of models (source: https://azure.microsoft.com/en-us/products/phi-3).

Language models, especially the Small Langauge Models (SLMs) have come a long way in the last couple of years. We have seen the scale of open-source language models reduce from 70B to 30B to 14B and now even below 7B parameters. All of this happens while trying to keep the performance the same as that of large language models to a certain extent. In this regard, perhaps the Phi family of models from Microsoft researchers are at the forefront. In this article, we will cover some of the important aspects of the Phi-3 model as mentioned in the Phi-3 technical report.

What will we cover in this article?

  • Why do we need Phi-3?
  • What is the architecture of Phi-3 and what are its variants?
  • What is the dataset collection strategy and training regime of Phi-3?
  • How does Phi-3 stack up against larger models and similar models in its class?

Note: This is not an in-depth guide to Phi-3. Instead, the aim is to condense the most important aspects of the technical report to get an instant overview of the model. This can be considered as a quick overview of the important elements of the Phi-3 technical report by Microsoft.

The Motivation Behind Phi-3

The primary reason behind the creation of Phi-3, especially the Phi-3 Mini with 3.8B parameters is rivaling it against larger models in the range of 7B parameters. At the moment, the AI industry considers 7B models as SLMs (Small Language Models). Although they are cheap in compute and can run fast enough locally, they are difficult to deploy on mobile and edge devices.

This led to the creation of the Phi-3 model across various scales competing with models much larger in parameters.

Phi-3 Architecture and Variants

The Phi-3 Mini model is based on the now well known Transformer decoder architecture. There are two variants of the model, one with a 4K context length and another with a 128K context length. It uses the same tokenizer as the Llama-2 and same vocabulary size as well (32064).

The following table shows more details.

ModelTokenizerVocabularyOther
Details
Phi-3 Mini (3.8B)Llama 2320643072 hidden dimensions, 32 heads, and 32 layers
Phi-3 Small (7B)Tiktoken1003524096 hidden dimensions, 32 heads, and 32 layers
Phi-3 Medium (14B)Llama 2320645120 hidden dimensions, 40 heads, 40 layers
Table 1. Ph-3 language model variants.

Additionally, we have the Phi-3 Vision model as well. It is a multimodal model capable of processing image and text prompts. It contains a Vision Transformer encoder and a Transformer Decoder. To be specific, the encoder is CLIP ViT-L/14. The model is capable of natural image understanding, reasoning, and has OCR capabilities as well.

Phi-3 vision capabilities.
Figure 2. Phi-3 vision capabilities (source: https://arxiv.org/abs/2404.14219).

Data Collection and Training Regine for Phi-3

Phi-3 follows the previous Phi model criteria for data collection, i.e., heavily filtered web, textbook, and academic data.

The models were trained in two phases: first mostly with web source data for teaching general knowledge, and second by merging with even more heavily filtered web data.

In the post-training phase, the models went under SFT (Supervised Fine Tuning) and DPO (Direct Preference Optimization).

You can find more details on fine-tuning Phi 1.5 with different methodologies in the following articles.

Phi-3 Benchmarks

The authors stack the Phi-3 models against several industry leading models such as Mistral, LLama-3, Gemma, Mixtral, and GPT-3.5.

Interestingly, the Phi-3 Mini model with just 3.8B parameters beats almost all the 7B parameter models in several benchmarks including MMLU, HellaSwag, Arc, and HumanEval. In fact, it is able to beat the Llama-3 8B model in some cases as well and the Phi-3 Small 7B model beats Llama-2 8B in all benchmarks.

The following table from the technical report gives a more detailed overview.

Phi-3 model benchmark.
Figure 3. Phi-3 model benchmark.

Phi-3 on Mobile

Given the small size of Phi-3 Mini, the authors were able to run the 4-bit quantized version of the model on an iPhone with A16 bionic chip natively. The model produced 12 tokens per second.

Phi-3 running natively on iPhone.
Figure 4. Phi-3 running natively on iPhone.

This gives us a much better idea of how the world of Small Language Models is going to be for mobile devices. We will be able to run very performant models locally on our mobile and edge devices instead of making API calls.

Notable Sections from the Technical Report to Go Through

Here are some sections from the technical report that are important but we skipped for brevity.

  • Section 4 – Safety
  • Section 5 – Weakness

Summary and Conclusion

We covered a short overview of the Phi-3 technical report here. We started from the motive for creating Phi-3, moved to the model architectures, and discussed the benchmarks briefly. I hope this gives you a good idea of where Phi-3 stands in the LLM landscape. Going through the entire report will surely give more insights.

If you have any doubts, thoughts, or suggestions, please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and Twitter.

References

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!
Become a patron at Patreon!

1 thought on “Introduction to Phi-3”

  1. apple says:

    Hi author,
    I has look fine tune mask2former of blog of you, and want to download code to study, but don’t success. So can you give a link to download it.
    Thank author very much.

Leave a Reply

Your email address will not be published. Required fields are marked *