Logo
Datadrifters Blog Header Image

Qwen2.5-Coder, Cosmos Tokenizer, OpenCoder, and New SentenceTransformers: Great Times for Open Source

2024-11-13


I want to highlight some standout open-source advancements that have really caught my eye:



Let’s dive in!


Qwen2.5-Coder Series: Open-Sourcing a SOTA Code LLM Rivaling GPT-4


Alibaba Cloud announced the open-source release of the Qwen2.5-Coder series — models that are Powerful, Diverse, and Practical — dedicated to propelling the evolution of open code large language models (LLMs).


The flagship model, Qwen2.5-Coder-32B-Instruct, sets a new benchmark as the state-of-the-art (SOTA) open-source code model, matching the coding capabilities of GPT-4. It excels in general-purpose and mathematical reasoning.



Expanding upon previous releases of 1.5B and 7B models, they introduced four additional model sizes: 0.5B, 3B, 14B, and 32B. Qwen2.5-Coder now accommodates a wide spectrum of developer requirements, covering six mainstream model sizes.


They have also explored the applicability of Qwen2.5-Coder in real-world scenarios, including code assistants and artifact generation.


Practical examples highlight the model’s potential in enhancing developer productivity and code quality.


Benchmark Achievements






You can find more info on github.


Cosmos Tokenizer: Advanced Neural Tokenizers for Efficient Image and Video Compression


The Cosmos Tokenizer is a comprehensive suite of neural tokenizers designed for images and videos.


You can now convert raw visual data into efficient, compressed representations.



By discovering latent spaces through unsupervised learning, these tokenizers facilitate large-scale model training and reduce computational demands during inference.


Types of Tokenizers:




Key Features:




Performance Highlights:




Evaluation and Resources:



More information on NVIDIA’s official blog post.


Next cohort will start soon! Reserve your spot for building full-stack GenAI SaaS applications


**OpenCoder: A Fully Open-Source Code LLM Trained on 2.5T Tokens**


OpenCoder introduces a new family of open-source code language models, including base and chat models at 1.5B and 8B parameter scales.


Supporting both English and Chinese languages, OpenCoder is trained from scratch on an extensive dataset of 2.5 trillion tokens, comprising 90% raw code and 10% code-related web data.


The model reaches performance levels comparable to leading code LLMs.



Key Contributions:



More information on official announcement.


SentenceTransformers Accelerates CPU Inference with 4x Speed Boost


The latest release of SentenceTransformers introduces significant performance enhancements, delivering up to a 4x speedup on CPU inference using OpenVINO’s int8 static quantization.


This update optimizes both training and inference workflows for developers working with large-scale natural language processing tasks.



Key Enhancements:



You can find more info on github.


Bonus Content : Building with AI


And don’t forget to have a look at some practitioner resources that we published recently:


Llama 3.2-Vision for High-Precision OCR with Ollama

LitServe: FastAPI on Steroids for Serving AI Models — Tutorial with Llama 3.2 Vision

Fine Tuning FLUX: Personalize AI Image Models on Minimal Data for Custom Look and Feel

Fine Tuning FLUX: Personalize AI Image Models on Minimal Data for Custom Look and Feel


Thank you for stopping by, and being an integral part of our community.


Happy building!