Logo
Datadrifters Blog Header Image

Tiny Phi 3 Models Deliver GPT Power on Your Phone: Open Source and Ready for Production

2024-08-25


You’ve been watching these language models get bigger and bigger, thinking, “How the hell am I supposed to run a language model on anything other than a supercomputer?


Everybody wants to maximize capabilities while keeping deployments cost-effective.


Whether you’re optimizing real-time interactions, autonomous systems, or apps that demand low latency, you want models to deliver the speed and efficiency you need.


Also whether you deploy it in the cloud, at the edge, or even on-device, they should give you the flexibility to integrate AI where it matters most.


SLMs such as Phi-3-mini are the answer.


In this article, I will explain,



Let’s GO!



Phi is a family of open AI models developed by Microsoft.


Phi Open Models


Phi Open Models are Microsoft’s suite of highly efficient, small language models (SLMs) designed to deliver exceptional performance with minimal cost and ultra-low latency.


If you have not been following the releases, here’s a quick recap:



Phi-3 also performs well on coding benchmarks, and Microsoft states that you can use Phi-3 for production use-cases as it’s been through rigorous safety post-training.



Comparison of harmful response percentages by Microsoft AI Red Team between phi-3-mini before and after the safety alignment.


The smallest one, Phi-3-mini has 3.8 billion parameters, and it’s going toe-to-toe with the big guns like GPT-3.5 and Mixtral.


Have a look at these benchmarks:



This level of performance is achieved not by just cramming more data or layers into it, but by getting super smart about the data it’s trained on and how it’s architected.


Instead of just dumping tons of random data into the training pipeline and hoping for the best, the team behind Phi-3-mini focused on what they call a “data-optimal” regime.



Phi-3-mini is built on a transformer decoder architecture — nothing too crazy there — but it’s got 32 attention heads and 32 layers, with a hidden dimension of 3072.


What’s really cool is the blocksparse attention they’ve implemented, which is a method to keep the memory footprint low by applying sparsity to the key-value cache.



They didn’t just stop at making it efficient — they made it fast too.


The training uses a custom Triton kernel based on Flash Attention (because who doesn’t love shaving milliseconds off their compute time?), and for inference, they’ve optimized everything with a paged attention kernel.


So, whether you’re running this thing on a server or ON YOUR PHONE, it’s going to be blazing fast.


Speaking of running it on your phone, that’s where things get really interesting. Phi-3-mini is designed with developers like us in mind.


It’s got a similar block structure and uses the same tokenizer as Llama-2 (with vocabulary size of 32064), which means you don’t have to reinvent the wheel to get this thing working in your existing projects.


Plus, it’s small enough to run locally on a modern smartphone, which is honestly wild when you think about what that means for offline AI capabilities.


The Game Changer: Big Brains in Tiny Packages


A language model with 3.8 billion parameters is considered small by today’s standards.


Historically, increasing a model’s size (i.e., adding more parameters) has been the primary way to improve performance.


This is because of scaling laws — larger models trained on more data generally yield better results.


However, larger models are computationally expensive and difficult to deploy on resource-constrained devices like smartphones.


Phi-3-mini defies this trend by achieving comparable performance with a fraction of the parameters.


This means we can now deploy powerful AI models on devices with limited hardware capabilities, such as smartphones, IoT devices, or even edge computing environments.


The great thing is that Phi can be quantized to 4-bit, occupying just 1.8GB of memory, and it is capable of generating more than 12 tokens per second on an iPhone 14 with an A16 Bionic chip, running fully offline.**



The key takeaway is that this model is small enough to be run locally on mobile devices, eliminating the need for constant cloud connectivity.


This is a game-changer for real-time, on-device AI applications.


I’ll explain why should you care as a developer, but let add a brief and kind request here:


Stay in the loop


Why Should You Care as a Developer?


This is important especially in areas where connectivity is spotty, privacy is crucial, or low latency is required.


Think of rural healthcare, agruculture, battlefield operations and intelligence gathering.


By embedding a powerful AI directly on the device, we can build applications that are faster, more responsive, and more secure.


User data remains on the device, which is a huge win for privacy-conscious applications.


Additionally, we reduce reliance on cloud infrastructure, cutting down on latency and operational costs.


It’s All About the Data


The breakthrough came from rethinking how data is used during training.


In fact, investing in data yields significant returns across all applications, typically enhancing performance, insights, and decision-making capabilities.


That’s why the team focused on optimizing the quality of the data rather than just increasing its quantity.


Phi-3-mini leverages highly curated datasets and synthetic data:



As for the training phases:



This data-centric approach allows Phi-3-mini to perform tasks usually reserved for much larger models, making it a powerful tool even with its compact size.


Developer-Friendly Architecture


Phi-3-mini’s architecture is designed to be familiar and easy to work with, especially for developers who have experience with models like Llama-2.


It’s built on a transformer decoder architecture, which is a well-known architecture in the AI community.


Because Phi-3-mini uses a similar block structure and tokenizer as Llama-2, developers can adapt existing tools and packages with minimal effort.


This compatibility means you can get up and running quickly, leveraging existing knowledge and resources.


Here are some details:



To add, Phi-3-mini is not just a static model — it’s a platform that invites further customization and optimization.


The model is already chat-finetuned, but additional finetuning can be applied using your specific datasets.



Chat template of Phi-mini


While primarily trained in English, the model’s architecture supports further training for multilingual tasks, making it a candidate for global applications.


By leveraging these capabilities, you can create AI applications that are not just powerful but also tailored to the specific needs of your users.


Running Phi-3-mini with Ollama, OpenAI and Python


Phi-3 is currently available through the Azure AI Studio model catalog, Hugging Face, and Ollama.



Let’s first set up a virtual environment and install libraries. Open your command line interface — this could be your Command Prompt, Terminal, or any other CLI tool you’re comfortable with — and run the following commands:


# Create a virtual environment  
mkdir phi-3-mini && cd phi-3-mini  
python3 -m venv phi-3-mini-env  
source phi-3-mini-env/bin/activate  
  
pip3 install openai

then setup Ollama following the instructions here.


We will use Open AI client with Ollama here, since Ollama provides an OpenAI-compatible endpoint at “http://localhost:11434/v1".

import openai  
  
client = openai.OpenAI(  
    base_url="http://localhost:11434/v1",  
    api_key="nokeyneeded",  
)

We can now use the OpenAI SDK to generate a response for a conversation.


This request should generate a haiku about cats:

MODEL_NAME = "phi3:mini"  
  
response = client.chat.completions.create(  
    model=MODEL_NAME,  
    temperature=0.7,  
    n=1,  
    messages=[  
        {"role": "system", "content": "You are a knowledgeable marketing assistant."},  
        {"role": "user", "content": "What are the most effective strategies for social media marketing?"},  
    ],  
)  
  
print("Response:")  
print(response.choices[0].message.content)

Prompt Engineering

The first message sent to the language model is known as the “system message” or “system prompt,” which sets the overall instructions for the model. You can customize this system prompt to guide the language model in generating output in a specific style or manner. Modify the SYSTEM_MESSAGE below to answer in a particular tone or persona, or take inspiration from other system prompts available online.


In this example, we’ll customize the system message to respond like a social media marketing expert. You can replace the prompt with your preferred style or tone.


SYSTEM_MESSAGE = """  
You are a seasoned social media marketing expert with years of experience in helping businesses grow their online presence. Respond with detailed, actionable advice and a professional tone.  
"""  
  
USER_MESSAGE = """  
What are the most effective strategies for increasing engagement on social media platforms?  
"""  
  
response = client.chat.completions.create(  
   model=MODEL_NAME,  
   temperature=0.7,  
   n=1,  
   messages=[  
   {"role": "system", "content": SYSTEM_MESSAGE},  
   {"role": "user", "content": USER_MESSAGE},  
   ],  
)  
  
print("Response:")  
print(response.choices[0].message.content)

Another approach to guide a language model is by using “few-shot” examples — a series of question-and-answer pairs that demonstrate the desired response style.


If you want to build a full-stack GenAI SaaS Products that people love — don’t miss out on our upcoming cohort-based course. Together, we’ll build, ship, and scale your GenAI product alongside a community of like-minded people!


In the example below, we’re instructing the model to act like a social media marketing consultant by providing a few sample interactions.


The model is then prompted with a new question to generate a response based on the patterns it has learned from the examples.


Feel free to experiment with the SYSTEM_MESSAGE, EXAMPLES, and USER_MESSAGE to fit different scenarios.


SYSTEM_MESSAGE = """  
You are a knowledgeable social media marketing consultant.  
Instead of giving general advice, you provide specific, actionable recommendations.  
"""  
  
EXAMPLES = [  
    (  
        "How can I increase engagement on Instagram?",  
        "Have you tried using Instagram Stories and engaging with followers through polls and questions?"  
    ),  
    (  
        "What's the best time to post on Twitter?",  
        "The best time to post on Twitter often depends on your target audience, but generally, early mornings and late afternoons work well."  
    ),  
    (  
        "How do I grow my LinkedIn network?",  
        "Consider sharing industry-relevant articles and engaging with content from leaders in your field to increase visibility."  
    ),  
]  
  
USER_MESSAGE = "What kind of content works best for Facebook ads?"  
  
response = client.chat.completions.create(  
    model=MODEL_NAME,  
    temperature=0.7,  
    n=1,  
    messages=[  
        {"role": "system", "content": SYSTEM_MESSAGE},  
        {"role": "user", "content": EXAMPLES[0][0]},  
        {"role": "assistant", "content": EXAMPLES[0][1]},  
        {"role": "user", "content": EXAMPLES[1][0]},  
        {"role": "assistant", "content": EXAMPLES[1][1]},  
        {"role": "user", "content": EXAMPLES[2][0]},  
        {"role": "assistant", "content": EXAMPLES[2][1]},  
        {"role": "user", "content": USER_MESSAGE},  
    ],  
)  
  
print("Response:")  
print(response.choices[0].message.content)

Retrieval Augmented Generation (RAG) enhances the accuracy of a language model for specific domains by first fetching relevant data from a knowledge source and then crafting a response based on that data.


In the following example, we assume you have a local CSV file containing social media marketing statistics.


The code reads the CSV file, searches for relevant data matching the user’s query, and uses that data to inform the generated response.


Since this approach involves additional processing, it might take longer to generate a response.


If the output isn’t well-grounded in the data, consider refining the system prompt or testing with different models. RAG is generally more effective with larger models or those fine-tuned for specific tasks.


import csv  
  
SYSTEM_MESSAGE = """  
You are a knowledgeable assistant that answers questions about social media marketing using data from a marketing statistics dataset.  
You must use the data set to answer the questions, and you should not provide any information that is not in the provided sources.  
"""  
  
USER_MESSAGE = "What is the average engagement rate for Instagram in 2024?"  
  
# Open the CSV and store in a list  
with open("social_media_stats.csv", "r") as file:  
    reader = csv.reader(file)  
    rows = list(reader)  
  
# Normalize the user question to replace punctuation and make lowercase  
normalized_message = USER_MESSAGE.lower().replace("?", "").replace("(", " ").replace(")", " ")  
  
# Search the CSV for user question using a very basic search  
words = normalized_message.split()  
matches = []  
for row in rows[1:]:  
    # if the word matches any word in row, add the row to the matches  
    if any(word in row[0].lower().split() for word in words) or any(word in row[5].lower().split() for word in words):  
        matches.append(row)  
  
# Format as a markdown table, since language models understand markdown  
matches_table = " | ".join(rows[0]) + "\n" + " | ".join(" --- " for _ in range(len(rows[0]))) + "\n"  
matches_table += "\n".join(" | ".join(row) for row in matches)  
print(f"Found {len(matches)} matches:")  
print(matches_table)  
  
# Now we can use the matches to generate a response  
response = client.chat.completions.create(  
    model=MODEL_NAME,  
    temperature=0.7,  
    n=1,  
    messages=[  
        {"role": "system", "content": SYSTEM_MESSAGE},  
        {"role": "user", "content": USER_MESSAGE + "\nSources: " + matches_table},  
    ],  
)  
  
print("Response:")  
print(response.choices[0].message.content)

Have a look at the offical page and cookbook released by Microsoft.


Wanna read more? Here’s some excellent bonus content for you.


Say Hello to ‘Her’: Real-Time AI Voice Agents with 500ms Latency, Now Open Source

Fine-Tune Meta’s Latest AI Model: Customize Llama 3.1 5x Faster with 80% Less Memory

Fine Tuning FLUX: Personalize AI Image Models on Minimal Data for Custom Look and Feel

Data Management with Drizzle ORM, Supabase and Next.js for Web & Mobile Applications


Thank you for stopping by, and being an integral part of our community.


Happy building!