Video Summarizer Using Qwen2.5-Omni

Qwen2.5-Omni is an end-to-end multimodal model. It can accept text, images, videos, and audio as input while generating text and natural speech as output. Given its strong capabilities, we will build a simple video summarizer using Qwen2.5-Omni 3B. We will use the model from Hugging Face and build the UI with Gradio.

Jump to Download Code

Figure 1. Video summarization demo using Qwen2.5-Omni 3B.

This is going to be a small code-focused article. We will entirely focus on building the simple application while keeping aside the conceptual parts.

You can refer to the introductory article on Qwen2.5-Omni to know more about it.

What will we cover while building the video summarizer application using Qwen2.5-Omni 3B?

Setting up the environment and installing the necessary packages.
Coding the video summarizer application. This includes the approach and logic to handle long videos.
Running experiments using the built application.

Project Directory Structure

Following is the directory structure that we follow.

├── input
│   ├── video_1.mp4
│   └── video_2.mp4
└── video_summarizer.py

We have a single script, video_summarizer.py which contains all the code.
The input directory contains the videos for which we will generate the summary.

Download Code

Installing the Requirements

We need the framework and specific library related installations. Along with that, we also need to install Flash Attention 2, which is necessary to use the speedups and GPU memory saving techniques in Ampere GPUs and above.

It is recommended to install these requirements in a new Anaconda/Miniconda environment.

PyTorch with CUDA 12.4

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124

Hugging Face Transformers, Accelerate, and BitsAndBytes

pip install -U transformers accelerate bitsandbytes

OpenCV and Qwen Utils for Image/Video Processing

pip install -U opencv-python qwen-omni-utils[decord]

Gradio for UI

pip install gradio

Flash-Attention

pip install flash-attn --no-build-isolation

Coding the Video Summarizer Application

Our application will take a video uploaded by the user, process it in segments (chunks), and generate an overall summary. The UI will provide real-time feedback on the progress and display the results.

All the experiments shown in this article were run on a 10GB RTX 3080 GPU.

Overall Approach

Let’s outline the overall approach that we will follow while creating the Qwen2.5-Omni 3B video summarizer application.

Configuration and Model Loading

First, we define constants for the model ID (Qwen/Qwen2.5-Omni-3B) and flags for enabling Flash Attention 2 and 4-bit quantization.
Next, we load the Qwen2.5-Omni 3B model and its processor from Hugging Face. We apply the specified optimizations (4-bit quantization via BitsAndBytesConfig and Flash Attention 2 via attn_implementation). This step is crucial for running a relatively large model on consumer hardware.

System Prompts

We define two system prompts: one (SYSTEM_PROMPT_ANALYTICS) to guide the model in analyzing individual video chunks, and another (SYSTEM_PROMPT_SUMMARY) to guide it in generating the final summary from the collected chunk analyses.

Video Chunking (save_video_chunk_for_model)

Figure 2. Video upload space, time chunking, and analyze button in the UI.

Processing longer chunks of video at once is GPU intensive (check the Qwen Omni requirements). To mitigate this, we will break the input video into smaller chunks.
A helper function, save_video_chunk_for_model, will use OpenCV (cv2) to extract a segment of the video (defined by start and end frames) and save it as a temporary MP4 file. We will then feed this temporary file to the Qwen model.

Main Analysis Logic (analyze_video)

Figure 3. Log accumulation text bot in the Qwen2.5-Omni 3B video summarization UI.

This is the core Gradio event handler function, implemented as a Python generator (yield) to stream updates to the UI.
State Reset: Crucially, at the beginning of each call, it resets key state variables to ensure that subsequent analyses of different videos don’t interfere with each other.
Input Validation: Checks if a video is uploaded and if the chosen chunk duration is within a reasonable range.
Video Processing Loop:
- Opens the video file using cv2.VideoCapture.
- Calculates the number of chunks based on the video’s FPS and the user-defined chunk duration.
- Iterates through each chunk:
  - Updates a progress text in the UI showing the current chunk being processed and the overall percentage complete.
  - Saves the current video chunk to a temporary file using save_video_chunk_for_model.
  - Constructs a multimodal conversation input for Qwen, including the video chunk and a text prompt (using SYSTEM_PROMPT_ANALYTICS) asking it to describe the segment.
  - Uses qwen_omni_utils.process_mm_info and the processor to prepare the inputs for the model.
  - Calls model.generate() to get the textual analysis for the current chunk.
  - Appends this chunk’s analysis to an accumulating log (current_accumulated_log) which is displayed in one of the UI textboxes.
  - Stores the clean analysis for later use in the final summary.
  - Deletes the temporary video chunk file.

Final Summary Generation

Figure 4. Finally summary text box in the video summarizer UI.

After all chunks are processed, the collected individual analyses are concatenated.
This concatenated text is then used to form a new conversation input for Qwen, this time with SYSTEM_PROMPT_SUMMARY, asking for an in-depth overall summary.
To stream the final summary token by token, transformers.TextIteratorStreamer is used.

Error Handling and Cleanup

A try…except…finally block ensures that any errors during processing are caught and reported in the UI, and that temporary files and directories are cleaned up. It also attempts to clear the PyTorch CUDA cache to free up GPU memory for subsequent runs.

Gradio User Interface

gr.Textbox (Accumulated Chunk Analysis Log): This displays a running log of the analysis for each processed chunk and appends new information as it becomes available.
gr.Textbox (Overall Progress & Status): This shows text updates like “Processing Chunk X/Y (Z%)”, “Generating Final Summary…”, or “Analysis Complete!”.
gr.Textbox (Final In-depth Summary): For streaming the final summary generated by the model, token by token.
The analyze_btn_main.click event links the button to the analyze_video function, specifying the input and output components.
demo.queue().launch() is used to start the Gradio app, enabling proper handling of the generator function and streaming updates.

Explanation of Key Code Sections

Let’s focus on the key code sections.

Model Loading and Configuration

The initial part of the script handles loading the Qwen/Qwen2.5-Omni-3B model.

# Configuration.
MODEL_ID = 'Qwen/Qwen2.5-Omni-3B'
USE_FLASH_ATTENTION_2 = True
USE_4BIT_QUANTIZATION = True

# Model loading.
print('Loading model... This might take a while.')
quantization_config = None
if USE_4BIT_QUANTIZATION:
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_compute_dtype=(
            torch.bfloat16 if torch.cuda.is_bf16_supported()
            else torch.float16
        ),
        bnb_4bit_use_double_quant=True,
    )
model_kwargs = {'device_map': 'auto'}
if quantization_config:
    model_kwargs['quantization_config'] = quantization_config
if USE_FLASH_ATTENTION_2:
    if torch.cuda.is_bf16_supported():
        model_kwargs['torch_dtype'] = torch.bfloat16
    else:
        model_kwargs['torch_dtype'] = torch.float16
    model_kwargs['attn_implementation'] = 'flash_attention_2'
else:
    model_kwargs['torch_dtype'] = 'auto'

try:
    model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
        MODEL_ID, **model_kwargs
    )
    processor = Qwen2_5OmniProcessor.from_pretrained(MODEL_ID)
    print('Model loaded successfully.')
except Exception as e:
    print(f"Error loading model: {e}. Exiting.")
    exit()

Here we apply 4-bit quantization using BitsAndBytesConfig and enable Flash Attention 2 by setting attn_implementation='flash_attention_2' and the appropriate torch_dtype. These are crucial for making such a model runnable on systems with limited VRAM.

The Main Analysis Function: analyze_video

The analyze_video gets triggered when the user clicks the “Analyze Video!” button.

def analyze_video(
    video_path_main,
    chunk_duration_ui,
    use_audio_in_video_flag
):
    # Explicit State Reset for each call.
    run_temp_dir = None
    cap = None
    all_individual_analyses_for_summary = []
    current_accumulated_log = ''
    # End State Reset.

    # Initial UI clear and status
    yield {
        accumulated_chunk_text_out: 'Awaiting video upload...',
        # ... other initial yields ...
    }
    # ... (input validation) ...
    try:
        run_temp_dir = tempfile.mkdtemp(prefix='video_run_')
        # ... (video capture and chunking loop) ...
        for i in range(num_iterations):
            # ... (save chunk, prepare inputs) ...
            
            # Process chunk with Qwen (Chunk Analysis).
            # ... (conversation_chunk, inputs_chunk setup) ...
            generated_ids_chunk = model.generate(
                **inputs_chunk, return_audio=False, # ...
            )
            # ... (decode response_text_for_chunk) ...
            current_accumulated_log += response_text_for_chunk + '\n'
            all_individual_analyses_for_summary.append(
                # ... formatted chunk analysis ...
            )
            yield {accumulated_chunk_text_out: current_accumulated_log}
            # ... (cleanup chunk file) ...
        
        # ... (after loop, prepare for final summary) ...

        if not all_individual_analyses_for_summary:
            # ... (handle no content case) ...
        else:
            # TextIteratorStreamer for Final Summary.
            # ... (setup inputs_for_summary, streamer, generation_kwargs) ...
            thread = threading.Thread(
                target=model.generate, kwargs=generation_kwargs
            )
            thread.start()
            generated_summary_text = ''
            for new_text_token in streamer:
                generated_summary_text += new_text_token
                yield {final_summary_out: generated_summary_text}
            thread.join()
            yield {final_summary_out: generated_summary_text} # Final assurance
    # ... (except and finally blocks) ...

Key aspects here are:

Generator Function: We use yield to make this a generator, allowing Gradio to receive multiple updates and stream them to the UI components.
State Reset: The variables at the top are reset each time, crucial for handling multiple video uploads correctly.
Looping through Chunks: The for i in range(num_iterations): loop processes the video segment by segment.
Updating accumulated_chunk_text_out: Inside the loop, after each chunk is analyzed, we append the chunk to current_accumulated_log, and yield this growing log to the UI.
Streaming Final Summary: We use TextIteratorStreamer and threading.Thread for the final summary. The for new_text_token in streamer: loop effectively pulls tokens from the background generation process and yields them to update the final_summary_out textbox in real-time.
finally Block: This ensures that resources like the video capture object (cap) are released and the temporary directory (run_temp_dir) is deleted, even if errors occur. The torch.cuda.empty_cache() call helps manage GPU memory between runs.

Gradio Interface Definition

We construct the UI using gr.Blocks.

# Gradio Interface.
orange_theme = gr.themes.Monochrome( # ... theme definition ... )

with gr.Blocks(theme=orange_theme) as demo:
    # ... (Markdown titles) ...
    with gr.Row(): # Input controls
        with gr.Column(scale=1):
            video_input_main = gr.Video(...)
            chunk_slider_main = gr.Slider(...)
            use_audio_main = gr.Checkbox(...)
            analyze_btn_main = gr.Button(...)
    with gr.Row(): # Output areas
        with gr.Column(scale=1): # Accumulated log
            gr.Markdown('### 📜 Accumulated Chunk Analysis Log')
            accumulated_chunk_text_out = gr.Textbox(...)
        with gr.Column(scale=1): # Progress and Final Summary
            gr.Markdown('### 📊 Overall Progress & Final Summary')
            overall_progress_text_out = gr.Textbox(...)
            final_summary_out = gr.Textbox(...)

    outputs_for_click = [ # List of components to update
        accumulated_chunk_text_out,
        overall_progress_text_out,
        final_summary_out
    ]
    analyze_btn_main.click(
        fn=analyze_video, # Changed function name to match code
        inputs=[video_input_main, chunk_slider_main, use_audio_main],
        outputs=outputs_for_click
    )

if __name__ == '__main__':
    # ... (launch app) ...

Layout: We use gr.Row and gr.Column to organize the input and output components.
outputs_for_click: This list explicitly tells Gradio which components the analyze_video function will be yielding updates to. The keys in the dictionaries yielded by analyze_video must correspond to these component objects.

This structure allows for a responsive UI that provides continuous feedback to the user while the video is being processed in the background.

Inference Experiments and Results for Qwen2.5-Omni 3B Video Summarizer

Let’s jump into creating video summaries using our video summarizer. We will test a few videos here and analyze how well our pipeline works.

The first video is a traffic intersection overview.

Video 1. Summarizing a traffic intersection scene using Qwen2.5-OMni 3B model.

Here, we choose a chunk duration of 5 seconds. You can choose a larger time chunk if you have more VRAM.

We can see that the video was divided into four parts. After accumulating the logs, the program gave a final summary, which is fairly accurate.

Let’s check the result on a bit longer 30-minute video.

Video 2. Summarizing an indor retail scene using the Qwen Omni model.

Initially, the model starts generating the chunk summaries correctly. However, around the third chunk, it starts telling these scenes are from the Minecraft video game. For some of the chunks later on, it goes on to describe them correctly. However, most of them are wrong, which makes the final summary partly wrong as well.

Note: Although we are chunking the videos into 5 second time duration, OOM (Out Of Memory) errors can still occur during the final summary generation. If the video is extremely long, say more than 2 minutes, then it will be around 100-170 chunks. All these accumulated chunk summaries will go the final summary generator, leading to a larger user input which can result in OOM.

Now, one final test. We will take an extremely simple video of two people walking and test our system.

Video 3. Using Qwen2.5-Omni 3B to summarize a video of two people walking in a snowy forest.

Interestingly, the result is mostly wrong in this case. The model hallucinates, concluding that these are corrupted pixels and only describes the snowy forest once. Pointing out the exact reason behind this is difficult. It might be that running the model in full precision (FP16/BF16) will give correct results. But I have not tested that at the moment.

Feel free to play around with your own videos and see where our system falters.

Improvements and Future Plans

We can make several improvements to this system. Instead of just a video summarizer, we can build a complete open-source video analytics platform like Azure Vision Studio powered by VLMs and speech synthesis.

Finding frames pertaining to a specific scenario/instance/incident that a user searches through natural language.
Adding time stamps to where these incidents occur.
Adding speech capabilities using Qwen2.5-Omni to use the full spectrum of features of the model.
Using the audio tracks present in videos to create more dense, in-depth summaries.
Finding the issues with current approach and why the model misunderstands the frames at times.

However, Gradio might pose some limitations in terms of UI for such an advanced application. It most probably will need a full-fledged UI.

Summary and Conclusion

In this article, we built a simple video summarizer using the Qwen2.5-Omni 3B model. It is a simple application, and we observed where the model performs well and where it fails. We also discussed some of the future improvements.

If you have any questions, thoughts, or suggestions, please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and X.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!

Video Summarizer Using Qwen2.5-Omni

What will we cover while building the video summarizer application using Qwen2.5-Omni 3B?

Project Directory Structure

Download Code

Installing the Requirements

Coding the Video Summarizer Application

Overall Approach

Configuration and Model Loading

System Prompts

Video Chunking (save_video_chunk_for_model)

Main Analysis Logic (analyze_video)

Final Summary Generation

Error Handling and Cleanup

Gradio User Interface

Explanation of Key Code Sections

Model Loading and Configuration

The Main Analysis Function: analyze_video

Gradio Interface Definition

Inference Experiments and Results for Qwen2.5-Omni 3B Video Summarizer

Improvements and Future Plans

Summary and Conclusion

Leave a Reply Cancel reply