Logo
Datadrifters Blog Header Image

GOT-OCR2.0 in Action: Optical Character Recognition Applications and Code Examples

2024-10-31


I’ve been diving into GOT-OCR2.0 lately, and it’s pretty impressive.


I thought I’d walk you through some code examples and share what I’ve learned so far, since it could be a key component for some of your projects.


GOT-OCR2.0 stands for General OCR Theory 2.0, and it’s a fresh take on optical character recognition.


Traditional OCR systems (what they call OCR-1.0) usually involve complex pipelines with multiple modules — think element detection, region cropping, character recognition, and so on.


Each of these modules can be a pain to maintain and optimize.


GOT-OCR2.0 simplifies this by introducing an end-to-end architecture. It’s built on an encoder-decoder paradigm:




What’s really awesome about GOT-OCR2.0 is how it streamlines everything into a single model.


No more dealing with complicated pipelines or multiple modules — it’s all unified.


And it’s not just limited to text recognition.


This thing can handle sheet music, mathematical formulas, charts, and even geometric shapes. Plus, it supports outputs in LaTeX and Markdown, which is super handy for documentation and academic papers.



From a product development standpoint, there are some real advantages here.



Let’s see it in action!



Setting up the environment


Setting up GOT-OCR2.0 is straightforward. Here’s how you can get it running in your environment.


Prerequisites:



First, clone the repository and setup a virtual or conda environment

git clone https://github.com/Ucas-HaoranWei/GOT-OCR2.0.git  
cd GOT-OCR2.0  
  
conda create -n got python=3.10 -y  
conda activate got


Then install the GOT-OCR2.0 package

pip install -e .


You can also install Flash-Attention, which accelerates the attention mechanism in transformer models.

pip install ninja  
pip install flash-attn --no-build-isolation


Now, you are ready to use GOT-OCR2.0.


Part 1: Running the OCR Script and Capturing Raw Output


Here’s how you can get the raw output for a given image:

import os  
import subprocess  
  
def run_ocr(client, model_name, image_path):  
    # Define the OCR script directory  
    ocr_dir = "/path/to/GOT-OCR2.0/GOT-OCR-2.0-master"  
      
    # Get the relative path of the image file  
    rel_image_path = os.path.relpath(image_path, ocr_dir)  
  
    command = [  
        "python3",  
        "GOT/demo/run_ocr_2.0.py",  
        "--model-name",  
        "./GOT_weights/",  
        "--image-file",  
        rel_image_path,  
        "--type",  
        "ocr"  
    ]  
      
    try:  
        result = subprocess.run(  
            command,   
            capture_output=True,   
            text=True,   
            check=True,   
            cwd=ocr_dir  
        )  
        raw_output = result.stdout.strip()


In this first part, we’re setting up and executing the OCR script provided by GOT-OCR2.0 to process the image and capture the raw OCR output.



Next cohort will start soon! Reserve your spot for building full-stack GenAI SaaS applications


Then we construct a list called command that includes all the arguments needed to run the OCR script:



Then we use subprocess.run to execute the OCR command:



For example, for image such as the following:



You will get the following output:

lana 3.1 bb Home( Open Web u https: / / ella ma. com/ library/ llama   
3.1 bb 88888888888888888888 888888888888888888  Blog Disc or d G   
it Hub Q Search models Models Sign in Download oll on a run l   
loma 3.1 bb > > > tell me a do d joke about computers Here' s one:   
Why did the computer goto the doctor?  Because it had a virus!  Get   
it?  > > > Hello my name is Dan Nice to meet you, Dan. So, I' ve got a   
bit of a\u201c byte\u201d of info on you now. . . ( sorry, couldn' t  
resist another dod joke)  > > > What is my name?  Your name is Dan!  
Me' ve already established that!  > > > Send a message. ( ? for help)   
Read me Meta Llama 3.1 405 B


Which is not very useful, and that’s where you can use LLM like Claude to clean and extract meaningful parts.


Part 2: Extracting OCR Text and Preparing the Prompt


We will prepare a utility function and a prompt to extract text from raw OCR output:

def extract_ocr_text(raw_output):  
    # Find the OCR text starting from "<|im_start|>assistant\n\n"  
    match = re.search(r'<\|im_start\|>assistant\n\n(.*?)(?=<\|im_end\|>|$)', raw_output, re.DOTALL)  
    if match:  
        return match.group(1).strip()  
    else:  
        return "No OCR text found"  
  
# Extract the OCR text  
ocr_text = extract_ocr_text(raw_output)  # You'll need to implement this function  
  
prompt = f"""  
The following text is raw OCR output from an image. Please extract any meaningful text or code snippets from it,   
ignoring any noise or irrelevant information. If it's code, format it properly. If it's text, clean it up and   
present it in a readable format.  
  
Raw OCR output:  
{ocr_text}  
  
Output should include ONLY extracted content. No intro, no outro, no formatting, no comments, no explanations, no nothing:  
"""


In the second part, we process the raw OCR output to extract meaningful content and prepare it for further refinement.



The prompt instructs the language model to:



We include the extracted OCR text in the prompt so the language model knows what to work with, and we specify that the output should include only the extracted content, with no additional comments or explanations.


Part 3: Using the Language Model to Clean Up Text and Returning Results


We can now call the LLM.

message = client.messages.create(  
        max_tokens=1024,  
        messages=[  
            {  
                "role": "user",  
                "content": prompt,  
            }  
        ],  
        model=model_name,  
    )  
      
    extracted_content = message.content[0].text.strip()  
      
    return {  
        "raw_ocr": ocr_text,  
        "extracted_content": extracted_content  
    }  
except subprocess.CalledProcessError as e:  
    print(f"Error running OCR on {image_path}: {e}")  
    return None


Which will return:

"ocr_text": {  
    "raw_ocr": "lana 3.1 bb Home( Open Web u https: / / ella ma. com/ library/ llama 3.1 bb 88888888888888888888 888888888888888888  Blog Disc or d G it Hub Q Search models Models Sign in Download oll on a run l loma 3.1 bb > > > tell me a do d joke about computers Here' s one:  Why did the computer goto the doctor?  Because it had a virus!  Get it?  > > > Hello my name is Dan Nice to meet you, Dan. So, I' ve got a bit of a\u201c byte\u201d of info on you now. . . ( sorry, couldn' t resist another dod joke)  > > > What is my name?  Your name is Dan! Me' ve already established that!  > > > Send a message. ( ? for help)  Read me Meta Llama 3.1 405 B",  
    "extracted_content": "Q: Why did the computer go to the doctor?\nA: Because it had a virus!\n\nHello my name is Dan\n\nYour name is Dan! We've already established that!"  
  }


In the final part, we refined the extracted OCR text and then return the results.


The result is close to what we want, and to truly understand its capabilities, you need to run experiements on your own data.


And let’s not forget the community and ongoing developments.


As more engineers and researchers adopt and contribute to GOT-OCR2.0, we’ll see continuous improvements and innovative features that keep pushing the boundaries of what’s possible with OCR technology.


Feel free to reach out if you hit any snags or have cool ideas to share.


Bonus Content : Building with AI


And don’t forget to have a look at some practitioner resources that we published recently: And don’t forget to have a look at some practitioner resources that we published recently:


Llama 3.2-Vision for High-Precision OCR with Ollama

LitServe: FastAPI on Steroids for Serving AI Models — Tutorial with Llama 3.2 Vision

Run FLUX Models Locally on Your Mac!

GOT-OCR2.0 in Action: Optical Character Recognition Applications and Code Examples


Thank you for stopping by, and being an integral part of our community.


Happy building!