Tiny Phi 3 Models Deliver GPT Power on Your Phone: Open Source and Ready for Production
2024-08-25
You’ve been watching these language models get bigger and bigger, thinking, “How the hell am I supposed to run a language model on anything other than a supercomputer?”
Everybody wants to maximize capabilities while keeping deployments cost-effective.
Whether you’re optimizing real-time interactions, autonomous systems, or apps that demand low latency, you want models to deliver the speed and efficiency you need.
Also whether you deploy it in the cloud, at the edge, or even on-device, they should give you the flexibility to integrate AI where it matters most.
SLMs such as Phi-3-mini are the answer.
In this article, I will explain,
- why Phi-3-mini is important for developers
- what’s under the hood of Phi-3-mini and why it’s significant
- how you can use it with Ollama + OpenAI + Python
Let’s GO!
Phi is a family of open AI models developed by Microsoft.
Phi Open Models
Phi Open Models are Microsoft’s suite of highly efficient, small language models (SLMs) designed to deliver exceptional performance with minimal cost and ultra-low latency.
If you have not been following the releases, here’s a quick recap:
- Phi-1 was for Python coding
- Phi-1.5 was for reasoning and understanding
- Phi-2 was for language comprehension
- The latest release Phi-3 models are for language understanding and reasoning tasks
Phi-3 also performs well on coding benchmarks, and Microsoft states that you can use Phi-3 for production use-cases as it’s been through rigorous safety post-training.
Comparison of harmful response percentages by Microsoft AI Red Team between phi-3-mini before and after the safety alignment.
The smallest one, Phi-3-mini has 3.8 billion parameters, and it’s going toe-to-toe with the big guns like GPT-3.5 and Mixtral.
Have a look at these benchmarks:
This level of performance is achieved not by just cramming more data or layers into it, but by getting super smart about the data it’s trained on and how it’s architected.
Instead of just dumping tons of random data into the training pipeline and hoping for the best, the team behind Phi-3-mini focused on what they call a “data-optimal” regime.
Phi-3-mini is built on a transformer decoder architecture — nothing too crazy there — but it’s got 32 attention heads and 32 layers, with a hidden dimension of 3072.
What’s really cool is the blocksparse attention they’ve implemented, which is a method to keep the memory footprint low by applying sparsity to the key-value cache.
They didn’t just stop at making it efficient — they made it fast too.
The training uses a custom Triton kernel based on Flash Attention (because who doesn’t love shaving milliseconds off their compute time?), and for inference, they’ve optimized everything with a paged attention kernel.
So, whether you’re running this thing on a server or ON YOUR PHONE, it’s going to be blazing fast.
Speaking of running it on your phone, that’s where things get really interesting. Phi-3-mini is designed with developers like us in mind.
It’s got a similar block structure and uses the same tokenizer as Llama-2 (with vocabulary size of 32064), which means you don’t have to reinvent the wheel to get this thing working in your existing projects.
Plus, it’s small enough to run locally on a modern smartphone, which is honestly wild when you think about what that means for offline AI capabilities.
The Game Changer: Big Brains in Tiny Packages
A language model with 3.8 billion parameters is considered small by today’s standards.
Historically, increasing a model’s size (i.e., adding more parameters) has been the primary way to improve performance.
This is because of scaling laws — larger models trained on more data generally yield better results.
However, larger models are computationally expensive and difficult to deploy on resource-constrained devices like smartphones.
Phi-3-mini defies this trend by achieving comparable performance with a fraction of the parameters.
This means we can now deploy powerful AI models on devices with limited hardware capabilities, such as smartphones, IoT devices, or even edge computing environments.
The great thing is that Phi can be quantized to 4-bit, occupying just 1.8GB of memory, and it is capable of generating more than 12 tokens per second on an iPhone 14 with an A16 Bionic chip, running fully offline.**
The key takeaway is that this model is small enough to be run locally on mobile devices, eliminating the need for constant cloud connectivity.
This is a game-changer for real-time, on-device AI applications.
I’ll explain why should you care as a developer, but let add a brief and kind request here:
Stay in the loop
Why Should You Care as a Developer?
This is important especially in areas where connectivity is spotty, privacy is crucial, or low latency is required.
Think of rural healthcare, agruculture, battlefield operations and intelligence gathering.
By embedding a powerful AI directly on the device, we can build applications that are faster, more responsive, and more secure.
User data remains on the device, which is a huge win for privacy-conscious applications.
Additionally, we reduce reliance on cloud infrastructure, cutting down on latency and operational costs.
It’s All About the Data
The breakthrough came from rethinking how data is used during training.
In fact, investing in data yields significant returns across all applications, typically enhancing performance, insights, and decision-making capabilities.
That’s why the team focused on optimizing the quality of the data rather than just increasing its quantity.
Phi-3-mini leverages highly curated datasets and synthetic data:
- Data Curation: Heavily filtered web data, focused on quality over quantity. The training data was selected based on its educational value and its ability to improve the model’s reasoning abilities.
- Synthetic Data: Generated by other large language models, this data helps the smaller model learn patterns and reasoning skills more effectively.
As for the training phases:
- Phase 1 was general knowledge and language understanding from web sources.
- Phase 2 focused on logical reasoning and niche skills, mixing filtered web data with synthetic data.
This data-centric approach allows Phi-3-mini to perform tasks usually reserved for much larger models, making it a powerful tool even with its compact size.
Developer-Friendly Architecture
Phi-3-mini’s architecture is designed to be familiar and easy to work with, especially for developers who have experience with models like Llama-2.
It’s built on a transformer decoder architecture, which is a well-known architecture in the AI community.
Because Phi-3-mini uses a similar block structure and tokenizer as Llama-2, developers can adapt existing tools and packages with minimal effort.
This compatibility means you can get up and running quickly, leveraging existing knowledge and resources.
Here are some details:
- Transformer Decoder: Standard structure with 32 heads and 32 layers.
- Context Length: Default of 4K, with a long context version supporting up to 128K using LongRope (phi-3-mini-128K)
- Hidden Dimension: 3072, providing a good balance between computational efficiency and performance.
- Blocksparse Attention: Optimizes the use of memory and computational resources by applying sparsity patterns, reducing the KV cache needed during inference.
- Custom Kernels: Designed for speed, using Flash Attention for training and optimized kernels for inference.
To add, Phi-3-mini is not just a static model — it’s a platform that invites further customization and optimization.
The model is already chat-finetuned, but additional finetuning can be applied using your specific datasets.
Chat template of Phi-mini
While primarily trained in English, the model’s architecture supports further training for multilingual tasks, making it a candidate for global applications.
By leveraging these capabilities, you can create AI applications that are not just powerful but also tailored to the specific needs of your users.
Running Phi-3-mini with Ollama, OpenAI and Python
Phi-3 is currently available through the Azure AI Studio model catalog, Hugging Face, and Ollama.
Let’s first set up a virtual environment and install libraries. Open your command line interface — this could be your Command Prompt, Terminal, or any other CLI tool you’re comfortable with — and run the following commands:
# Create a virtual environment
mkdir phi-3-mini && cd phi-3-mini
python3 -m venv phi-3-mini-env
source phi-3-mini-env/bin/activate
pip3 install openai
then setup Ollama following the instructions here.
We will use Open AI client with Ollama here, since Ollama provides an OpenAI-compatible endpoint at “http://localhost:11434/v1".
import openai
client = openai.OpenAI(
base_url="http://localhost:11434/v1",
api_key="nokeyneeded",
)
We can now use the OpenAI SDK to generate a response for a conversation.
This request should generate a haiku about cats:
MODEL_NAME = "phi3:mini"
response = client.chat.completions.create(
model=MODEL_NAME,
temperature=0.7,
n=1,
messages=[
{"role": "system", "content": "You are a knowledgeable marketing assistant."},
{"role": "user", "content": "What are the most effective strategies for social media marketing?"},
],
)
print("Response:")
print(response.choices[0].message.content)
Prompt Engineering
The first message sent to the language model is known as the “system message” or “system prompt,” which sets the overall instructions for the model. You can customize this system prompt to guide the language model in generating output in a specific style or manner. Modify the SYSTEM_MESSAGE below to answer in a particular tone or persona, or take inspiration from other system prompts available online.
In this example, we’ll customize the system message to respond like a social media marketing expert. You can replace the prompt with your preferred style or tone.
SYSTEM_MESSAGE = """
You are a seasoned social media marketing expert with years of experience in helping businesses grow their online presence. Respond with detailed, actionable advice and a professional tone.
"""
USER_MESSAGE = """
What are the most effective strategies for increasing engagement on social media platforms?
"""
response = client.chat.completions.create(
model=MODEL_NAME,
temperature=0.7,
n=1,
messages=[
{"role": "system", "content": SYSTEM_MESSAGE},
{"role": "user", "content": USER_MESSAGE},
],
)
print("Response:")
print(response.choices[0].message.content)
Another approach to guide a language model is by using “few-shot” examples — a series of question-and-answer pairs that demonstrate the desired response style.
In the example below, we’re instructing the model to act like a social media marketing consultant by providing a few sample interactions.
The model is then prompted with a new question to generate a response based on the patterns it has learned from the examples.
Feel free to experiment with the SYSTEM_MESSAGE, EXAMPLES, and USER_MESSAGE to fit different scenarios.
SYSTEM_MESSAGE = """
You are a knowledgeable social media marketing consultant.
Instead of giving general advice, you provide specific, actionable recommendations.
"""
EXAMPLES = [
(
"How can I increase engagement on Instagram?",
"Have you tried using Instagram Stories and engaging with followers through polls and questions?"
),
(
"What's the best time to post on Twitter?",
"The best time to post on Twitter often depends on your target audience, but generally, early mornings and late afternoons work well."
),
(
"How do I grow my LinkedIn network?",
"Consider sharing industry-relevant articles and engaging with content from leaders in your field to increase visibility."
),
]
USER_MESSAGE = "What kind of content works best for Facebook ads?"
response = client.chat.completions.create(
model=MODEL_NAME,
temperature=0.7,
n=1,
messages=[
{"role": "system", "content": SYSTEM_MESSAGE},
{"role": "user", "content": EXAMPLES[0][0]},
{"role": "assistant", "content": EXAMPLES[0][1]},
{"role": "user", "content": EXAMPLES[1][0]},
{"role": "assistant", "content": EXAMPLES[1][1]},
{"role": "user", "content": EXAMPLES[2][0]},
{"role": "assistant", "content": EXAMPLES[2][1]},
{"role": "user", "content": USER_MESSAGE},
],
)
print("Response:")
print(response.choices[0].message.content)
Retrieval Augmented Generation (RAG) enhances the accuracy of a language model for specific domains by first fetching relevant data from a knowledge source and then crafting a response based on that data.
In the following example, we assume you have a local CSV file containing social media marketing statistics.
The code reads the CSV file, searches for relevant data matching the user’s query, and uses that data to inform the generated response.
Since this approach involves additional processing, it might take longer to generate a response.
If the output isn’t well-grounded in the data, consider refining the system prompt or testing with different models. RAG is generally more effective with larger models or those fine-tuned for specific tasks.
import csv
SYSTEM_MESSAGE = """
You are a knowledgeable assistant that answers questions about social media marketing using data from a marketing statistics dataset.
You must use the data set to answer the questions, and you should not provide any information that is not in the provided sources.
"""
USER_MESSAGE = "What is the average engagement rate for Instagram in 2024?"
# Open the CSV and store in a list
with open("social_media_stats.csv", "r") as file:
reader = csv.reader(file)
rows = list(reader)
# Normalize the user question to replace punctuation and make lowercase
normalized_message = USER_MESSAGE.lower().replace("?", "").replace("(", " ").replace(")", " ")
# Search the CSV for user question using a very basic search
words = normalized_message.split()
matches = []
for row in rows[1:]:
# if the word matches any word in row, add the row to the matches
if any(word in row[0].lower().split() for word in words) or any(word in row[5].lower().split() for word in words):
matches.append(row)
# Format as a markdown table, since language models understand markdown
matches_table = " | ".join(rows[0]) + "\n" + " | ".join(" --- " for _ in range(len(rows[0]))) + "\n"
matches_table += "\n".join(" | ".join(row) for row in matches)
print(f"Found {len(matches)} matches:")
print(matches_table)
# Now we can use the matches to generate a response
response = client.chat.completions.create(
model=MODEL_NAME,
temperature=0.7,
n=1,
messages=[
{"role": "system", "content": SYSTEM_MESSAGE},
{"role": "user", "content": USER_MESSAGE + "\nSources: " + matches_table},
],
)
print("Response:")
print(response.choices[0].message.content)
Have a look at the offical page and cookbook released by Microsoft.
Wanna read more? Here’s some excellent bonus content for you.
Thank you for stopping by, and being an integral part of our community.
Happy building!