Getting Started with GLM-4.6V


Getting Started with GLM-4.6V

In this article, we will cover the GLM-4.6V Vision Language Model. The GLM-4.6V and GLM-4.6V-Flash are the two latest models in the GLM Vision family by z.ai. Here, we will discuss the capabilities of the models and carry out inference for various tasks using the Hugging Face Transformers library.

 Image-to-HTML demo using GLM-4.6V-Flash.
Figure 1. Image-to-HTML demo using GLM-4.6V-Flash.

Both of these GLM Vision models, although smaller than some of their counterparts, such as Qwen3-VL and Step3, offer superior performance. We will uncover some of these details by discussing the important points from the technical article.

We will cover the following points while discussing GLM-4.6V:

  • What are the different model sizes in the latest GLM vision family?
  • What makes these models stand out from similar models?
  • How to run inference with Hugging Face Transformers using GLM-4.6V?
  • Creating a Gradio application with GLM-4.6V for OCR and image to HTML use cases.

What is GLM-4.6V and Why is it Unique?

In this section, we summarize the most important points from the official technical post about the model

The GLM-4.6V series of models is the latest in the multimodal model series from z.ai. They move beyond simple image captioning and VQA tasks, towards fully agentic and tool calling capabilities.

The GLM-4.6V series contains two models:

  • GLM-4.6V (106B): The foundation model is meant for the best response quality, to be deployed on cloud GPUs and clusters.
  • GLM-4.V-Flash (9B): The lightweight model, which is meant for local deployment.

Going further, in this article, we will focus on the smaller version of the model to create a local VLM application.

Both models have 128K context length, enabling them to process long documents and have legnthy chats with users.

Native Multimodal Tool Use

The models support native multimodal tool calling. This means that we do not need any LLMs + other vision models in the loop to achieve a multimodal workflow. This reduces friction as the same model can give text output, handle image/video/document input, and call tools whenever necessary.

The GLM-4.6V models can handle images, screenshots, and document pages. No need to convert them to text first. The models can also understand the outputs from the tools, such as rendered webpages, search results, and statistical charts.

This closes the loop from chaining multiple models to just using a single model for everything, starting from perception, to reasoning, and execution.

GLM-4.6V is capable of multiple tool calling for visual understanding.
Figure 2. GLM-4.6V is capable of multiple tool calling for visual understanding.

This is one step closer to a multimodal agentic workflow that we can run today, locally.

Core Capabilities

The technical post by the authors lays out several strengths of the models.

Rich Text Understanding

The GLM-4.6V models can accept research papers, reports, and slide decks as inputs to generate structured output.

Rich text understanding with GLM-4.6V.
Figure 3. Rich text understanding with GLM-4.6V.

It can understand tables, charts, and images in the documents. Furthermore, it can call tools internally to crop images to carry out a visual audit for checking the relevance of the components.

Visual Web Search

The models also support end-to-end multimodal web search, which allows them to:

  • Carry out intent recognition and search planning
  • Comprehend the mix of text and visual input returned by web search tools

Visual web search with GLM-4.6V.
Figure 4. Visual web search with GLM-4.6V.

Front End Replication + Visual Interaction

The GLM-4.6V models are adept at generating HTML mockups from screenshots of web pages. This can speedup the replication of UI elements. They can also generate high quality CSS and JS code while accepting natural language edits on screenshots.

Front-end replication with GLM-4.6V.
Figure 5. Front-end replication with GLM-4.6V.

Long Context Understanding

The 128K context length of the models allows them process huge documents, say, 150 pages of PDF, or 200 pages of slide decks, or upto 1 hour of video.

Long context understanding with GLM-4.6V.
Figure 6. Long context understanding with GLM-4.6V.

This unlocks huge arenas for teaching, research, and auditing applications.

GLM-4.6V Benchmarks

GLM-4.6V has been evaluated across 20+ multimodal benchmarks, including:

  • MMBench
  • MathVista
  • OCRBench

It achieves state-of-the-art performance among open-source models of comparable scale in multimodal reasoning, logical reasoning, and long-context understanding.

GLM-4.6V benchmarks.
Figure 7. GLM-4.6V benchmarks.

Under-the-Hood Improvements

Several technical choices enable these capabilities:

  • Continual pretraining on large-scale long-context image-text data
  • Visual-language compression alignment (inspired by Glyph)
  • Billion-scale multimodal world knowledge datasets
  • Agentic synthetic data training
  • Reinforcement Learning for multimodal tool invocation
  • A “Visual Feedback Loop” allowing self-correction in UI/code scenario

These improvements push GLM-4.6V toward robust multimodal agent behavior rather than isolated task execution.

From the next section onward, we will jump into the inference using GLM-4.6V.

Directory Structure

The following is the directory structure we have for the inference experiments.

.
├── html_code
│   ├── generated_20260221_151716.html
│   └── generated_20260221_205900.html
├── input
│   ├── image_1.png
│   ├── image_2.png
│   └── image_2_resized.png
├── app.py
├── glm_v_infer.py
└── requirements.txt
  • We have a CLI executable script, glm_v_infer.py, and a Gradio application, app.py.
  • The input directory contains the images that we can experiment our GLM-4.6V inference pipeline with.
  • The Gradio application supports an “image-to-HTML” mode. For this, all the HTML generated code is stored in the html_code directory.

All the code and requirements files are available for download in the form of a zip file.

Download Code

Installing Dependencies

We can install the necessary libraries and frameworks using the requirements file.

pip install -r requirements.txt

Inference Using GLM-4.6V

Let’s jump into the codebase now. We will cover the glm_v_infer.py script in detail here. The app.py containing the code for the Gradio application is mostly an extension of the logic from the former, with a few additional components.

The following code that we are discussing is present in glm_v_infer.py file.

Import Statements

The first code block covers all the imports that we need.

from transformers import (
    AutoProcessor, 
    Glm4vForConditionalGeneration,
    BitsAndBytesConfig,
    TextIteratorStreamer
)
from threading import Thread

import torch
import argparse

We are importing BitsAndBytesConfig as we are providing the option to load the model in INT4 or FP16 format. Also, TextIteratorStreamer and threading will help us manage streaming output text.

Helper Functions to Load Model and Manage Prompts

The following code block contains the functions to load the model and manage the prompt messages for the model.

def get_quant_cofig():
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_use_double_quant=True,
    )

    return quantization_config

def load_model(model_path, stream=False, quantization_config=None):
    processor = AutoProcessor.from_pretrained(model_path)

    model = Glm4vForConditionalGeneration.from_pretrained(
        pretrained_model_name_or_path=model_path,
        quantization_config=quantization_config,
        # torch_dtype='auto',
        device_map='auto'
    )

    if stream:
        streamer = TextIteratorStreamer(
                processor.tokenizer, 
                skip_prompt=True, 
                skip_special_tokens=False
            )
    else:
        streamer = None

    return processor, model, streamer

def get_prompt(image_path, user_input):
    messages = [
        {
            'role': 'user',
            'content': [
                {
                    'type': 'image',
                    'url': image_path
                },
                {
                    'type': 'text',
                    'text': user_input
                }
            ],
        }
    ]

    return messages

The model gets loaded in INT4 format if quantization_config is True. Also, we have an option whether to initialize the text streamer or not based on a command line argument.

The get_prompt function appends the user prompt and the image to the messages list. We can pass these as command line arguments.

Functions to Run Inference

We have two inference functions, one for non-streaming output and another for streaming output.

def run_inference(processor, model, image_path, user_input):
    messages = get_prompt(image_path, user_input)

    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors='pt'
    ).to(model.device)

    inputs.pop('token_type_ids', None)
    
    generated_ids = model.generate(
        **inputs, 
        max_new_tokens=8192
    )

    output_text = processor.decode(
        generated_ids[0][inputs['input_ids'].shape[1]:], 
        skip_special_tokens=False
        )
    return output_text

def run_inference_stream(processor, model, image_path, user_input, streamer=None):
    messages = get_prompt(image_path, user_input)

    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors='pt'
    ).to(model.device)

    inputs.pop('token_type_ids', None)
    
    generate_kwargs = dict(
        **inputs,
        streamer=streamer,
        max_new_tokens=8192,
    )   

    thread = Thread(target=model.generate, kwargs=generate_kwargs)
    thread.start()

    for new_text in streamer:
        yield new_text

This can help us choose non-streaming output where streaming is not supported.

The Main Block

Finally, the main code block.

if __name__ == '__main__':
    MODEL_PATH = 'zai-org/GLM-4.6V-Flash'

    parser = argparse.ArgumentParser(description='GLM-4.6V-Flash Inference')
    parser.add_argument(
        '--model_path', 
        type=str, 
        default=MODEL_PATH, 
        help='Path/Hugging Face model identifier for GLM-4.6V-Flash'
    )
    parser.add_argument(
        '--int4', 
        action='store_true', 
        help='Use 4-bit quantization for inference' 
    )
    parser.add_argument(
        '--input',
        type=str,
        default='input/image_1.png',
        help='Path to the input image for inference'
    )
    parser.add_argument(
        '--prompt',
        type=str,
        default='describe this image',
        help='User input for inference'
    )
    parser.add_argument(
        '--stream',
        dest='stream',
        action='store_true',
        help='Enable streaming output'
    )

    args = parser.parse_args()
    
    # Load model.
    quantization_config = get_quant_cofig() if args.int4 else None

    processor, model, streamer = load_model(
        args.model_path, 
        stream=args.stream, 
        quantization_config=quantization_config
    )

    # Run inference.
    if not args.stream:
        output = run_inference(
            processor, 
            model, 
            args.input, 
            args.prompt, 
        )
        print(output)
    else:
        output = run_inference_stream(
            processor, 
            model, 
            args.input, 
            args.prompt, 
            streamer
        )
        for new_text in output:
            print(new_text, end='', flush=True)

We are specifically using the GLM-4.6V-Flash model here for inference. When loaded in quantized format, it can run within 10GB VRAM requirements.

Depending on whether the user passes the --stream argument or not, we have two different print logics at the end.

Running the CLI Inference with GLM-4.6V-Flash

The following experiments were run in an RTX 3080 10GB VRAM GPU.

To use the default image and prompts, we can simply execute the following for streaming output in INT4 quantized format.

python glm_v_infer.py --stream --int4

By default, the model just describes the image.

We can choose a different image and a custom prompt as well.

python glm_v_infer.py --stream --int4 --input input/image_2.png --prompt "describe the image and what does it show"
CLI streaming output for image description with GLM-4.6V.
Figure 8. CLI streaming output for image description with GLM-4.6V.

If you have access to ~20GB VRAM, you can also run the model in full precision.

Running the Gradio Application

The Gradio application (code contained in app.py) is an extension of the CLI script. There are a few additional components, such as the following:

  • There are three tabs – image description, OCR, and image-to-HTML.
  • The OCR tab contains a default prompt for giving the OCR text only.
  • Similarly, the Image to HTML tab contains a default prompt. The script contains a logic to extract the HTML code which is saved in the html_code directory.
  • Finally, there is a toggle to hide/show the thinking text.

Execute the following in the terminal and open the http://127.0.0.1:7860/ URL in the browser.

python app.py

The following video shows the OCR output using the GLM-4.6V Gradio application.

Video 1. Image to OCR with Gradio application using GLM-4.6V.

The next video shows the image-to-HTML functionality.

Video 2. Image to HTML code with Gradio application using GLM-4.6V.

We can achieve better results by running the model in FP16/BF16. Also, you can load the larger model from the dropdown if you have enough VRAM.

Also, note that we are not strictly analyzing the results of the model here, but rather creating a simple workflow for different experiments.

Summary and Conclusion

In this article, we covered the GLM-4.6V series of models. We discussed the important concepts from the technical blog post, the special features, and the benchmarks. Next, we created a simple inference workflow with GLM-4.6V-Flash along with a Gradio application. In the next article, we will focus on creating more vision-reasoning agentic workflow and fine-tuning the model.

If you have any questions, thoughts, or suggestions, please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and X.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!
Become a patron at Patreon!

Leave a Reply

Your email address will not be published. Required fields are marked *