Logo
Datadrifters Blog Header Image

Llama 3.2-Vision for High-Precision OCR with Ollama

2024-10-31


With the new Llama 3.2 release, Meta seriously leveled up here — now you’ve got vision models (11B and 90B) that don’t just read text but also analyze images, recognize charts, and even caption visuals.


Benchmarks for vision instruction-tuned models are impressive:



Plus, they’ve got these smaller, text-only models (1B and 3B) that fit right onto edge devices like mobile, and they’re surprisingly powerful for tasks like summarization, instruction-following, and more.


One of the coolest parts is Llama Stack, which makes working with these models a breeze whether you’re deploying on-prem, in the cloud, or on mobile.


They’ve even optimized everything for Qualcomm, MediaTek, and Arm hardware, so you can run it all locally if you want — super fast and private.


Let me walk you through a local development workflow that you can try now.


Let’s GO!



The Birth of Llama 3.2


Setting-up the environment for Llama 3.2-Vision and Ollama


llama3.2-vision requires Ollama 0.4.0, which is currently in pre-release, here’s how you can install it (from home directory):

curl -L https://github.com/ollama/ollama/releases/download/v0.4.0-rc6/ollama-linux-amd64.tgz -o ollama-040-rc6.tgz  
sudo tar -C /usr -xzf ~/ollama-040-rc6.tgz


After installation is finished, you can check the installed version.

cd /usr/bin  
./ollama --version



We can now start working with Llama 3.2-Vision model by running following command:

ollama run x/llama3.2-vision:11b



The 11 billion parameter model provides high accuracy in vision tasks, let’s see it in action.


Now let’s create a virutal environment and install required libraries.

mkdir llama32-ollama && cd llama32-ollama  

python3 -m venv llama32-ollama-env  
source llama32-ollama-env/bin/activate  

pip3 install ollama  
pip install pillow  
pip3 install ipykernel jupyter


Great — to continue, you can either create .py file or .ipynb file (notebook). I will continue with Jupyter notebook to run code in blocks and interactively inspect the results.


Step 1: Importing Required Libraries


We begin by importing the libraries needed for image processing, base64 encoding, and interacting with the Llama 3.2 Vision model via ollama.

import base64  
import io  
from PIL import Image  
import ollama



Step 2: Encoding the Image to Base64


To send the image to the OCR model, we need to convert it into a base64 string format. This makes it easy to transmit the image data within API requests. Here’s the function to handle this process:

def encode_image_to_base64(image_path: str, format: str = "PNG") -> str:  
    """Encodes an image file to a base64 string.  

    Args:  
        image_path (str): Path to the image file.  
        format (str): Format to save the image in memory (default is PNG).  

    Returns:  
        str: Base64-encoded image.  
    """  
    with Image.open(image_path) as img:  
        buffered = io.BytesIO()  
        img.save(buffered, format=format)  
        return base64.b64encode(buffered.getvalue()).decode('utf-8')



Next cohort will start soon! Reserve your spot for building full-stack GenAI SaaS applications!


Step 3: Sending the Encoded Image to the OCR Model


With the image encoded in base64, we’re ready to interact with the Llama 3.2 Vision model to perform OCR. The function below sends the base64-encoded image along with a specific prompt and retrieves structured OCR output.

def get_ocr_output_from_image(image_base64: str, model: str = "x/llama3.2-vision:latest") -> str:  
    """Sends an image to the Llama OCR model and returns structured text output.  

    Args:  
        image_base64 (str): Base64-encoded image string.  
        model (str): The model version to use for OCR (default is latest Llama 3.2 Vision).  

    Returns:  
        str: Extracted and structured text from the image.  
    """  
    response = ollama.chat(  
        model=model,  
        messages=[{  
            "role": "user",  
            "content": "The image is a book cover. Output should be in this format - <Name of the Book>: <Name of the Author>. Do not output anything else",  
            "images": [image_base64]  
        }]  
    )  
    return response.get('message', {}).get('content', '').strip()



Step 4: Bringing It All Together in the Main Block


Finally, we set up a main block to use the functions we’ve created. This helps to keep the code modular and allows us to run it directly as a script:

if __name__ == "__main__":  
    image_path = 'examples/image.png'  # Replace with your image path  
    base64_image = encode_image_to_base64(image_path)  
    ocr_text = get_ocr_output_from_image(base64_image)  
    print(ocr_text)



Here are some interesting examples to try:



Let us know what you find out in the comments!


Full Code


Here’s the complete code:

import base64  
import io  
from PIL import Image  
import ollama  

def encode_image_to_base64(image_path: str, format: str = "PNG") -> str:  
    """Encodes an image file to a base64 string.  

    Args:  
        image_path (str): Path to the image file.  
        format (str): Format to save the image in memory (default is PNG).  

    Returns:  
        str: Base64-encoded image.  
    """  
    with Image.open(image_path) as img:  
        buffered = io.BytesIO()  
        img.save(buffered, format=format)  
        return base64.b64encode(buffered.getvalue()).decode('utf-8')  

def get_ocr_output_from_image(image_base64: str, model: str = "x/llama3.2-vision:latest") -> str:  
    """Sends an image to the Llama OCR model and returns structured text output.  

    Args:  
        image_base64 (str): Base64-encoded image string.  
        model (str): The model version to use for OCR (default is latest Llama 3.2 Vision).  

    Returns:  
        str: Extracted and structured text from the image.  
    """  
    response = ollama.chat(  
        model=model,  
        messages=[{  
            "role": "user",  
            "content": "The image is a book cover. Output should be in this format - <Name of the Book>: <Name of the Author>. Do not output anything else",  
            "images": [image_base64]  
        }]  
    )  
    return response.get('message', {}).get('content', '').strip()  

if __name__ == "__main__":  
    image_path = 'examples/image.png'  # Replace with your image path  
    base64_image = encode_image_to_base64(image_path)  
    ocr_text = get_ocr_output_from_image(base64_image)  
    print(ocr_text)


Now, with Llama 3.2-Vision, you can handle everything in one go — analyzing the image, recognizing the text, and structuring the output — without needing to switch between multiple models.


This upgrade simplifies the workflow significantly, making it faster and more efficient.


Bonus Content : Building with AI


And don’t forget to have a look at some practitioner resources that we published recently: