Image-to-3D: Incremental Optimizations for VRAM, Multi-Mesh Output, and UI Improvements

This is the third article in the Image-to-3D series. In the first two, we covered image-to-mesh generation and then extended the pipeline to include texture generation. This article focuses on practical and incremental optimizations for image-to-3D. These include VRAM requirements, generating multiple meshes and textures from a single image using prompts, and minor yet meaningful UI improvements. None of these changes is huge on its own, but together they noticeably improve the workflow and user experience.

Jump to Download Code

Figure 1. Multi-object image to 3D mesh + texture generation with optimizations.

In the last article, we established that the entire pipeline requires ~29GB VRAM. That is a hefty requirement because anything above 24GB VRAM means renting a 32GB GPU in the cloud if you can’t run locally. So, our aim is to bring down the VRAM requirement to under 24GB VRAM, while running all models on the GPU, accepting a slight increase in runtime as the trade-off.

What are we going to cover in image-to-3D with incremental improvements?

What changes do we need to make to bring down the VRAM usage?
How do we generate multiple 3D meshes and textures from a single image with prompts?
How do we tweak the Gradio UI to make the user experience better?
Covering the codebase.
Running a few inference experiments and analyzing the results.

Project Directory Structure

The following is the project directory structure.

├── bg_removed
│   ├── 20251130_211318
│   ...
│   └── 20251130_212839
├── BiRefNet
│   ├── evaluation
│   ├── models
│   ...
│   ├── sub.sh
│   ├── test.sh
│   ├── train.py
│   ├── train.sh
│   ├── train_test.sh
│   └── utils.py
├── birefnet_weights
│   └── BiRefNet-general-epoch_244.pth
├── build_setup
│   └── Hunyuan3D-2
├── cropped_images
│   ├── 20251130_211318
│   ...
│   └── 20251130_212839
├── input
│   ├── image_1.jpg
│   ...
│   └── image_5.jpg
├── outputs
│   ├── 20251130_211318
│   ...
│   └── 20251130_212839
├── hunyuan3d_final_req.txt
├── image_to_texture.py
├── README.md
└── setup.sh

The bg_removed, cropped_images, and outputs directories contain the cleaned images (images whose background is removed), the cropped objects after detection, and the output 3D .glb files, respectively. For each run, we create a timestamped subdirectory because there can be multiple objects, crops, and 3D meshes for each image. So, with this approach, we do not lose any data.
The BiRefNet directory is the cloned BiRefNet repository. The birefnet_weights directory contains the pretrained checkpoint that we will use for background removal.
All the experimental images are present in the input directory.
image_to_texture.py is the executable Python file that we will run to start the application.
All the setup is handled by setup.sh and hunyuan3d_final_req.txt. The setup.sh file creates the build_setup directory and installs everything necessary for Hunyuan3D models and BiRefNet models. The final version specific requirements is completed by hunyuan3d_final_req.txt. The entire setup process is automated via the shell script.

The executable Python file and input images are provided via a zip file in the download section. If you wish to run the code locally, please install the requirements by following the necessary steps.

Download Code

Setup and Installing Dependencies

Using a virtual environment with Python 3.10 is a hard requirement to set up everything successfully. Anaconda/Miniconda is recommended to create the virtual environment.

First, create a virtual environment with Python 3.10 and install PyTorch with CUDA. The code in this article used PyTorch 2.9.1. However, future versions should work without issues.

Second, run the setup.sh file.

sh setup.sh

This will install everything that is necessary.

Third, create the birefnet_weights directory, download the weights from here, and copy them into the directory.

This completes the entire setup.

What are We Doing to Achieve Incremental Optimizations for Image-to-3D?

We are primarily targeting three optimization processes for the image-to-3D pipeline.

Figure 2. Image-to-3D pipeline with optimizations for VRAM reduction, multi-object generation, and UI improvements.

Reducing VRAM Requirements

The original code that we used in the previous image-to-texture generation article required ~29GB VRAM when everything was loaded onto the GPU. To tackle this issue, we use offloading techniques. Here, once the operation of a certain model completes, we delete the model and release the CUDA memory & cache associated with it.

Most models require less than 6GB VRAM, with the exception of the texture generation model, which needs ~18.8GB VRAM. So, the entire pipeline can be run with a 20GB VRAM GPU. An RTX 4000 ADA is enough for this, which is easily accessible via platforms like Runpod.

Of course, this offloading technique increases the execution time of the entire pipeline. However, it also allows us to run the models with a cheaper system.

Multi-Mesh Generation from a Single Image with Prompts

The pipeline supports user prompts, allowing users to upload an image with multiple objects and provide a prompt for the object for which to generate a 3D mesh and texture. However, previously, only single-object generations were possible. We change that to “almost as many objects as possible” via prompting.

Now, users can provide the names of multiple objects to generate the 3D meshes and textures for them. We will see that in working in the demo section.

UI Improvements

We also include a few minor UI improvements. Now, up to 8 3D meshes/objects are visible in the UI. If there are fewer objects for which 3D meshes and textures are generated, then that many will be visible. If there are more objects, then the first 8 objects will be visible in the UI; however, all the resulting files will be stored in the timestamped output directories.

Code for Incremental Improvements for Image-to-3D Generation

Let’s jump into the code now.

The logical flow of the code remains the same. All the changes that we discussed above are structural. So, we don’t have to go through the entire explanation in depth.

If you wish to get a detailed introductory walkthrough of the code, please visit the image-to-3D mesh article. It covers the initial explanation of the code in detail.

The entire code is present in the image_to_texture.py file.

Import Statements, Handling Global Variables, and Argument Parsers

The following code block handles all the necessary imports, defining the global variables that we will need along the way, and defining the argument parser for command line arguments.

import sys
import argparse
import gc
import torch
import ast
import os
import cv2
import shutil
import datetime
import gradio as gr
from PIL import Image
from torchvision import transforms

sys.path.append('BiRefNet')

from image_proc import refine_foreground
from models.birefnet import BiRefNet
from utils import check_state_dict

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from hy3dgen.shapegen import (
    Hunyuan3DDiTFlowMatchingPipeline, 
    FaceReducer, 
    FloaterRemover, 
    DegenerateFaceRemover, 
    MeshlibCleaner
)
from hy3dgen.texgen import Hunyuan3DPaintPipeline

# Global variables for models to allow loading/unloading
qwen_model = None
qwen_processor = None
birefnet = None
pipeline_shape = None
pipeline_texture = None

def parse_args():
    parser = argparse.ArgumentParser(description="Image to 3D mesh + texture pipeline")
    parser.add_argument('--birefnet_device', type=str, default='cuda', choices=['cuda', 'cpu'], help='Device for BiRefNet')
    parser.add_argument('--qwen_device', type=str, default='cuda', choices=['cuda', 'cpu'], help='Device for Qwen3-VL')
    return parser.parse_args()

args = parse_args()

num_inference_steps = 50
seed = 42

Along with the imports, we also add BiRefNet directory to the Python path to use the necessary modules.

The global variables for the several models will allow us to load and unload the model to and from the GPU. This is necessary to maintain a global state.

We have two command line arguments as well. If necessary, we can load the BiRefNet and Qwen3-VL models on the CPU. However, most of the time, that might not be necessary as they require less than 5GB VRAM and also follow the loading/unloading protocol to save VRAM.

Other than that, we define the number of inference steps for the 3D mesh generation and a seed for reproducibility.

Loading and Unloading Model To and From GPU

The next code block defines the functions to load and unload all the models to and from GPU memory.

def cleanup_memory():
    gc.collect()
    torch.cuda.empty_cache()

def load_qwen(device):
    global qwen_model, qwen_processor
    print(f"Loading Qwen3-VL on {device}...")
    model_id = 'Qwen/Qwen3-VL-2B-Instruct'
    qwen_model = Qwen3VLForConditionalGeneration.from_pretrained(
        model_id,
        dtype=torch.bfloat16,
        device_map=device
    )
    qwen_processor = AutoProcessor.from_pretrained(model_id)

def unload_qwen():
    global qwen_model, qwen_processor
    print("Unloading Qwen3-VL...")
    del qwen_model
    del qwen_processor
    qwen_model = None
    qwen_processor = None
    cleanup_memory()

def load_birefnet(device):
    global birefnet
    print(f"Loading BiRefNet on {device}...")
    model_name = 'BiRefNet'
    birefnet = BiRefNet(bb_pretrained=False)
    state_dict = torch.load(
        'birefnet_weights/BiRefNet-general-epoch_244.pth', 
        map_location=device
    )
    state_dict = check_state_dict(state_dict)
    birefnet.load_state_dict(state_dict)
    if device == 'cuda':
        torch.set_float32_matmul_precision(['high', 'highest'][0])
    birefnet.to(device)
    birefnet.eval()
    if device == 'cuda':
        birefnet.half()
    print('BiRefNet is ready to use.')

def unload_birefnet():
    global birefnet
    print("Unloading BiRefNet...")
    del birefnet
    birefnet = None
    cleanup_memory()

def load_hunyuan_shape():
    global pipeline_shape
    print("Loading Hunyuan3D Shape model on cuda...")
    pipeline_shape = Hunyuan3DDiTFlowMatchingPipeline.from_pretrained(
        'tencent/Hunyuan3D-2mini',
        subfolder='hunyuan3d-dit-v2-mini',
        use_safetensors=True,
        device='cuda'
    )

def unload_hunyuan_shape():
    global pipeline_shape
    print("Unloading Hunyuan3D Shape model...")
    del pipeline_shape
    pipeline_shape = None
    cleanup_memory()

def load_hunyuan_texture():
    global pipeline_texture
    print("Loading Hunyuan3D Texture model on cuda...")
    pipeline_texture = Hunyuan3DPaintPipeline.from_pretrained(
        'tencent/Hunyuan3D-2',
        device='cuda',
    )

def unload_hunyuan_texture():
    global pipeline_texture
    print("Unloading Hunyuan3D Texture model...")
    del pipeline_texture
    pipeline_texture = None
    cleanup_memory()

Each model has its own loading and unloading function. After the operation ends for a particular component in the pipeline, we unload the model from memory and call garbage collection to free up the CUDA memory.

Helper Functions for Image-to-3D Operations

There are several helper functions that we need along the way for the entire pipeline.

# BiRefNet image transforms.
def get_transform_image(model_name='BiRefNet'):
    return transforms.Compose([
        transforms.Resize((1024, 1024) if '_HR' not in model_name else (2048, 2048)),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ])

def qwen_object_boxes(model, processor, image_path, prompt):
    """Use Qwen3-VL to generate bounding boxes for natural-language prompts."""
    messages = [{
        'role': 'user',
        'content': [
            {'type': 'image', 'image': image_path},
            {'type': 'text', 'text': prompt},
        ],
    }]

    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors='pt'
    ).to(model.device)

    generated = model.generate(**inputs, max_new_tokens=4096)
    trimmed = [out[len(inp):] for inp, out in zip(inputs.input_ids, generated)]
    decoded = processor.batch_decode(trimmed, skip_special_tokens=True)[0]

    # Parse Qwen output.
    json_str = decoded[8:-3] if decoded.startswith('```json') else decoded
    detections = ast.literal_eval(json_str)
    return detections

def crop_dets(image_path, detections, save_dir):
    """
    Crop the detection area of objects and save them.
    """
    image_bgr = cv2.imread(image_path)
    image_rgb = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2RGB)
    h, w, _ = image_bgr.shape

    print(f"Detections: {detections}")
        
    count = 0
    for i, det in enumerate(detections):
        box = det['bbox_2d']
        x1 = int(box[0] / 1000 * w)
        y1 = int(box[1] / 1000 * h)
        x2 = int(box[2] / 1000 * w)
        y2 = int(box[3] / 1000 * h)

        crop = image_rgb[y1:y2, x1:x2]
        crop_bgr = cv2.cvtColor(crop, cv2.COLOR_RGB2BGR)
        cv2.imwrite(os.path.join(save_dir, f'crop_{i}.png'), crop_bgr)
        count += 1
    
    return count

def remove_bg(image_path, device):
    """Feed image to BiRefNet for background removal."""
    # Assumes BiRefNet is already loaded
    
    image = Image.open(image_path)
    transform_image = get_transform_image()
    input_images = transform_image(image).unsqueeze(0).to(device)
    if device == 'cuda':
        input_images = input_images.half()
    
    # Prediction
    with torch.no_grad():
        preds = birefnet(input_images)[-1].sigmoid().cpu()
    pred = preds[0].squeeze()
    pred_pil = transforms.ToPILImage()(pred)
    pred_pil = pred_pil.resize(image.size)
    image_masked = refine_foreground(image, pred_pil)
    image_masked.putalpha(pred_pil)
    
    return image_masked

def setup_directories():
    """Create timestamped directories for this run."""
    timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
    
    run_outdir = os.path.join('outputs', timestamp)
    crop_dir = os.path.join('cropped_images', timestamp)
    bg_dir = os.path.join('bg_removed', timestamp)
    
    for d in [run_outdir, crop_dir, bg_dir]:
        os.makedirs(d, exist_ok=True)
        
    return run_outdir, crop_dir, bg_dir

def image_to_3d(text, image_path, do_texture):
    run_outdir, crop_dir, bg_dir = setup_directories()
    
    fix_holes = False
    
    # Object Detection & Cropping
    images_to_process = []
    
    if len(text) > 0:
        load_qwen(args.qwen_device)
        prompt = f"Locate every instance that belongs to the following categories: {text}. Report bbox coordinates in JSON format."
    
        detections = qwen_object_boxes(qwen_model, qwen_processor, image_path, prompt)
        print(f"Qwen3-VL detections: {len(detections)} objects")
        unload_qwen()
        
        num_crops = crop_dets(image_path, detections, crop_dir)
        if num_crops > 0:
            for f in sorted(os.listdir(crop_dir)):
                images_to_process.append(os.path.join(crop_dir, f))
    
    # If no prompt or no detections, use original image
    if not images_to_process:
        images_to_process.append(image_path)

    # Background Removal
    load_birefnet(args.birefnet_device)
    processed_images = []
    for i, img_path in enumerate(images_to_process):
        image_masked = remove_bg(img_path, args.birefnet_device)
        save_path = os.path.join(bg_dir, f'bg_removed_{i}.png')
        image_masked.save(save_path)
        processed_images.append(save_path)
    unload_birefnet()

    # Shape Generation
    load_hunyuan_shape()
    meshes = []
    for img_path in processed_images:
        mesh = pipeline_shape(
            image=img_path,
            num_inference_steps=num_inference_steps,
            generator=torch.manual_seed(seed)
        )[0]
        meshes.append(mesh)
    unload_hunyuan_shape()

    # 4. Mesh Processing & Texture Generation
    final_paths = []
    
    if do_texture:
        load_hunyuan_texture()

    for i, mesh in enumerate(meshes):
        mesh = FloaterRemover()(mesh)
        mesh = DegenerateFaceRemover()(mesh)
        if fix_holes:
            mesh = MeshlibCleaner()(mesh)
        mesh = FaceReducer()(mesh)
        
        if do_texture:
            mesh = pipeline_texture(mesh, Image.open(processed_images[i]))
        
        save_path = os.path.join(run_outdir, f'model_{i}.glb')
        mesh.export(save_path)
        final_paths.append(save_path)

    if do_texture:
        unload_hunyuan_texture()

    # Pad with None to match fixed output count (8)
    while len(final_paths) < 8:
        final_paths.append(None)
        
    return final_paths[:8]

We have helper functions for:

Defining the image transforms for BiRefNet background removal operations.
Detecting objects using the Qwen3-VL 2B model.
Cropping the detected objects for further processing.
Removing the background using BiRefNet.
Setting up directories for each run.
And finally, the image_to_3d function that combines all the operations.

Running the Gradio Application

In the end, we define the Gradio blocks and launch the applications.

with gr.Blocks() as demo:
    gr.Markdown("# Image to 3D Mesh + Texture Pipeline")
    
    with gr.Row():
        with gr.Column(scale=1):
            input_image = gr.Image(type='filepath', label="Input Image")
            prompt = gr.Text(label="Object Prompt (Optional, e.g., 'cup, spoon')")
            do_texture = gr.Checkbox(label="Generate Texture", value=True)
            submit_btn = gr.Button("Generate 3D Models")
        
        with gr.Column(scale=3):
            with gr.Row():
                out1 = gr.Model3D(label="Result 1", height=600)
                out2 = gr.Model3D(label="Result 2", height=600)
            with gr.Row():
                out3 = gr.Model3D(label="Result 3", height=600)
                out4 = gr.Model3D(label="Result 4", height=600)
            with gr.Row():
                out5 = gr.Model3D(label="Result 5", height=600)
                out6 = gr.Model3D(label="Result 6", height=600)
            with gr.Row():
                out7 = gr.Model3D(label="Result 7", height=600)
                out8 = gr.Model3D(label="Result 8", height=600)

    submit_btn.click(
        fn=image_to_3d,
        inputs=[prompt, input_image, do_texture],
        outputs=[out1, out2, out3, out4, out5, out6, out7, out8]
    )

if __name__ == "__main__":
    demo.launch(share=True)

Note: When multiple objects are detected and cropped for the image-to-3D operation, we show up to 8 objects in the UI. If there are more than that, they will be stored in the output directory for post-analysis.

Running Inference and Experiments

We can simply execute the script to start the application.

python image_to_texture.py

The default UI looks like the following.

We have a text box where the user can optionally type the name of the objects to detect, crop, and convert to 3D with texturing.

There is also an option to just generate 3D meshes without the texturing pipeline. This can run in less than 6GB of VRAM.

Here are some experiments and results. We are not discussing the quality of the result here. The primary aim is to check how well the end-to-end image-to-3D pipeline with optimizations works when applying the texturing pipeline on top of the 3D meshes. Future articles will focus more on the quality of results and improving everything overall.

Video 1. Image-to-3D mesh with the improved pipeline.

In the above video, we give the prompt to extract multiple objects and not apply the texturing pipeline on the 3D meshes. As we can see, the multiple output boxes are working quite well.

Video 2. Generating image-to-3D mesh + texture with optimized pipeline for a porcelain structure.

The above shows the entire workflow for a porcelain structure with intricate details.

Video 3. Image-to-3D for connected objects.

This is similar to the previous result, where we see how well the background removal works, where there might be connected objects of interest.

Video 4. The complete image-to-3D pipeline with optimizations and multi-object UI visualization.

In the final experiment, we check how the pipeline works when prompting for 8 different objects. As we can see, the results are quite good.

Note: There might be cases where the object detection via Qwen3-VL 2B is not up to expectations because of the complexity of the scene. In such situations, simply using the 4B or 8B model will surely give better results.

Summary and Conclusion

In this article, we focused on the optimization of the image-to-3D pipeline for generating 3D meshes and applying textures. We covered VRAM optimizations, the generation of multiple 3D objects from a single image with prompting, and creating a better UI. We can take this project a lot further. Hopefully, we can cover these updates in future articles.

If you have any questions, thoughts, or suggestions, please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and X.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!

Image-to-3D: Incremental Optimizations for VRAM, Multi-Mesh Output, and UI Improvements

Project Directory Structure

Download Code

Setup and Installing Dependencies

What are We Doing to Achieve Incremental Optimizations for Image-to-3D?

Reducing VRAM Requirements

Multi-Mesh Generation from a Single Image with Prompts

UI Improvements

Code for Incremental Improvements for Image-to-3D Generation

Import Statements, Handling Global Variables, and Argument Parsers

Loading and Unloading Model To and From GPU

Helper Functions for Image-to-3D Operations

Running the Gradio Application

Running Inference and Experiments

Summary and Conclusion

Leave a Reply Cancel reply