In this article, we explore Qwen2.5-Omni, a multimodal generative AI model that can accept text, image, video, and audio as inputs while outputting both text and audio. ...
Qwen2.5-Omni: An Introduction

In this article, we explore Qwen2.5-Omni, a multimodal generative AI model that can accept text, image, video, and audio as inputs while outputting both text and audio. ...
In this article, we are fine-tuning the SmolVLM-256M model for receipt OCR on the SROIE v2 dataset after generating the ground truth data using QwenVL-2B model. ...
In this article, we explore Gemma 3. We start with the need for Gemma 3, its architecture and multimodal capabilities, and carry out inference using Hugging Face. ...
In this article, we cover the SmolVLM model by Hugging Face. It is a compact 2.2B parameter model for vision understanding. ...
In this article, we build a simple Gradio application with Qwen2.5-VL for image captioning, video captioning, and object detection. ...
Business WordPress Theme copyright 2025