This is the third article in the Image-to-3D series. In the first two, we covered image-to-mesh generation and then extended the pipeline to include texture generation. This article focuses on practical and incremental optimizations for image-to-3D. These include VRAM requirements, generating multiple meshes and textures from a single image using prompts, and minor yet meaningful UI improvements. None of these changes is huge on its own, but together they noticeably improve the workflow and user experience.
In the last article, we established that the entire pipeline requires ~29GB VRAM. That is a hefty requirement because anything above 24GB VRAM means renting a 32GB GPU in the cloud if you can’t run locally. So, our aim is to bring down the VRAM requirement to under 24GB VRAM, while running all models on the GPU, accepting a slight increase in runtime as the trade-off.
What are we going to cover in image-to-3D with incremental improvements?
- What changes do we need to make to bring down the VRAM usage?
- How do we generate multiple 3D meshes and textures from a single image with prompts?
- How do we tweak the Gradio UI to make the user experience better?
- Covering the codebase.
- Running a few inference experiments and analyzing the results.
Project Directory Structure
The following is the project directory structure.
├── bg_removed │ ├── 20251130_211318 │ ... │ └── 20251130_212839 ├── BiRefNet │ ├── evaluation │ ├── models │ ... │ ├── sub.sh │ ├── test.sh │ ├── train.py │ ├── train.sh │ ├── train_test.sh │ └── utils.py ├── birefnet_weights │ └── BiRefNet-general-epoch_244.pth ├── build_setup │ └── Hunyuan3D-2 ├── cropped_images │ ├── 20251130_211318 │ ... │ └── 20251130_212839 ├── input │ ├── image_1.jpg │ ... │ └── image_5.jpg ├── outputs │ ├── 20251130_211318 │ ... │ └── 20251130_212839 ├── hunyuan3d_final_req.txt ├── image_to_texture.py ├── README.md └── setup.sh
- The
bg_removed,cropped_images, andoutputsdirectories contain the cleaned images (images whose background is removed), the cropped objects after detection, and the output 3D.glbfiles, respectively. For each run, we create a timestamped subdirectory because there can be multiple objects, crops, and 3D meshes for each image. So, with this approach, we do not lose any data. - The
BiRefNetdirectory is the cloned BiRefNet repository. Thebirefnet_weightsdirectory contains the pretrained checkpoint that we will use for background removal. - All the experimental images are present in the
inputdirectory. image_to_texture.pyis the executable Python file that we will run to start the application.- All the setup is handled by
setup.shandhunyuan3d_final_req.txt. Thesetup.shfile creates thebuild_setupdirectory and installs everything necessary for Hunyuan3D models and BiRefNet models. The final version specific requirements is completed byhunyuan3d_final_req.txt. The entire setup process is automated via the shell script.
The executable Python file and input images are provided via a zip file in the download section. If you wish to run the code locally, please install the requirements by following the necessary steps.
Download Code
Setup and Installing Dependencies
Using a virtual environment with Python 3.10 is a hard requirement to set up everything successfully. Anaconda/Miniconda is recommended to create the virtual environment.
First, create a virtual environment with Python 3.10 and install PyTorch with CUDA. The code in this article used PyTorch 2.9.1. However, future versions should work without issues.
Second, run the setup.sh file.
sh setup.sh
This will install everything that is necessary.
Third, create the birefnet_weights directory, download the weights from here, and copy them into the directory.
This completes the entire setup.
What are We Doing to Achieve Incremental Optimizations for Image-to-3D?
We are primarily targeting three optimization processes for the image-to-3D pipeline.

Reducing VRAM Requirements
The original code that we used in the previous image-to-texture generation article required ~29GB VRAM when everything was loaded onto the GPU. To tackle this issue, we use offloading techniques. Here, once the operation of a certain model completes, we delete the model and release the CUDA memory & cache associated with it.
Most models require less than 6GB VRAM, with the exception of the texture generation model, which needs ~18.8GB VRAM. So, the entire pipeline can be run with a 20GB VRAM GPU. An RTX 4000 ADA is enough for this, which is easily accessible via platforms like Runpod.
Of course, this offloading technique increases the execution time of the entire pipeline. However, it also allows us to run the models with a cheaper system.
Multi-Mesh Generation from a Single Image with Prompts
The pipeline supports user prompts, allowing users to upload an image with multiple objects and provide a prompt for the object for which to generate a 3D mesh and texture. However, previously, only single-object generations were possible. We change that to “almost as many objects as possible” via prompting.
Now, users can provide the names of multiple objects to generate the 3D meshes and textures for them. We will see that in working in the demo section.
UI Improvements
We also include a few minor UI improvements. Now, up to 8 3D meshes/objects are visible in the UI. If there are fewer objects for which 3D meshes and textures are generated, then that many will be visible. If there are more objects, then the first 8 objects will be visible in the UI; however, all the resulting files will be stored in the timestamped output directories.
Code for Incremental Improvements for Image-to-3D Generation
Let’s jump into the code now.
The logical flow of the code remains the same. All the changes that we discussed above are structural. So, we don’t have to go through the entire explanation in depth.
If you wish to get a detailed introductory walkthrough of the code, please visit the image-to-3D mesh article. It covers the initial explanation of the code in detail.
The entire code is present in the image_to_texture.py file.
Import Statements, Handling Global Variables, and Argument Parsers
The following code block handles all the necessary imports, defining the global variables that we will need along the way, and defining the argument parser for command line arguments.
import sys
import argparse
import gc
import torch
import ast
import os
import cv2
import shutil
import datetime
import gradio as gr
from PIL import Image
from torchvision import transforms
sys.path.append('BiRefNet')
from image_proc import refine_foreground
from models.birefnet import BiRefNet
from utils import check_state_dict
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from hy3dgen.shapegen import (
Hunyuan3DDiTFlowMatchingPipeline,
FaceReducer,
FloaterRemover,
DegenerateFaceRemover,
MeshlibCleaner
)
from hy3dgen.texgen import Hunyuan3DPaintPipeline
# Global variables for models to allow loading/unloading
qwen_model = None
qwen_processor = None
birefnet = None
pipeline_shape = None
pipeline_texture = None
def parse_args():
parser = argparse.ArgumentParser(description="Image to 3D mesh + texture pipeline")
parser.add_argument('--birefnet_device', type=str, default='cuda', choices=['cuda', 'cpu'], help='Device for BiRefNet')
parser.add_argument('--qwen_device', type=str, default='cuda', choices=['cuda', 'cpu'], help='Device for Qwen3-VL')
return parser.parse_args()
args = parse_args()
num_inference_steps = 50
seed = 42
Along with the imports, we also add BiRefNet directory to the Python path to use the necessary modules.
The global variables for the several models will allow us to load and unload the model to and from the GPU. This is necessary to maintain a global state.
We have two command line arguments as well. If necessary, we can load the BiRefNet and Qwen3-VL models on the CPU. However, most of the time, that might not be necessary as they require less than 5GB VRAM and also follow the loading/unloading protocol to save VRAM.
Other than that, we define the number of inference steps for the 3D mesh generation and a seed for reproducibility.
Loading and Unloading Model To and From GPU
The next code block defines the functions to load and unload all the models to and from GPU memory.
def cleanup_memory():
gc.collect()
torch.cuda.empty_cache()
def load_qwen(device):
global qwen_model, qwen_processor
print(f"Loading Qwen3-VL on {device}...")
model_id = 'Qwen/Qwen3-VL-2B-Instruct'
qwen_model = Qwen3VLForConditionalGeneration.from_pretrained(
model_id,
dtype=torch.bfloat16,
device_map=device
)
qwen_processor = AutoProcessor.from_pretrained(model_id)
def unload_qwen():
global qwen_model, qwen_processor
print("Unloading Qwen3-VL...")
del qwen_model
del qwen_processor
qwen_model = None
qwen_processor = None
cleanup_memory()
def load_birefnet(device):
global birefnet
print(f"Loading BiRefNet on {device}...")
model_name = 'BiRefNet'
birefnet = BiRefNet(bb_pretrained=False)
state_dict = torch.load(
'birefnet_weights/BiRefNet-general-epoch_244.pth',
map_location=device
)
state_dict = check_state_dict(state_dict)
birefnet.load_state_dict(state_dict)
if device == 'cuda':
torch.set_float32_matmul_precision(['high', 'highest'][0])
birefnet.to(device)
birefnet.eval()
if device == 'cuda':
birefnet.half()
print('BiRefNet is ready to use.')
def unload_birefnet():
global birefnet
print("Unloading BiRefNet...")
del birefnet
birefnet = None
cleanup_memory()
def load_hunyuan_shape():
global pipeline_shape
print("Loading Hunyuan3D Shape model on cuda...")
pipeline_shape = Hunyuan3DDiTFlowMatchingPipeline.from_pretrained(
'tencent/Hunyuan3D-2mini',
subfolder='hunyuan3d-dit-v2-mini',
use_safetensors=True,
device='cuda'
)
def unload_hunyuan_shape():
global pipeline_shape
print("Unloading Hunyuan3D Shape model...")
del pipeline_shape
pipeline_shape = None
cleanup_memory()
def load_hunyuan_texture():
global pipeline_texture
print("Loading Hunyuan3D Texture model on cuda...")
pipeline_texture = Hunyuan3DPaintPipeline.from_pretrained(
'tencent/Hunyuan3D-2',
device='cuda',
)
def unload_hunyuan_texture():
global pipeline_texture
print("Unloading Hunyuan3D Texture model...")
del pipeline_texture
pipeline_texture = None
cleanup_memory()
Each model has its own loading and unloading function. After the operation ends for a particular component in the pipeline, we unload the model from memory and call garbage collection to free up the CUDA memory.
Helper Functions for Image-to-3D Operations
There are several helper functions that we need along the way for the entire pipeline.
# BiRefNet image transforms.
def get_transform_image(model_name='BiRefNet'):
return transforms.Compose([
transforms.Resize((1024, 1024) if '_HR' not in model_name else (2048, 2048)),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
def qwen_object_boxes(model, processor, image_path, prompt):
"""Use Qwen3-VL to generate bounding boxes for natural-language prompts."""
messages = [{
'role': 'user',
'content': [
{'type': 'image', 'image': image_path},
{'type': 'text', 'text': prompt},
],
}]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors='pt'
).to(model.device)
generated = model.generate(**inputs, max_new_tokens=4096)
trimmed = [out[len(inp):] for inp, out in zip(inputs.input_ids, generated)]
decoded = processor.batch_decode(trimmed, skip_special_tokens=True)[0]
# Parse Qwen output.
json_str = decoded[8:-3] if decoded.startswith('```json') else decoded
detections = ast.literal_eval(json_str)
return detections
def crop_dets(image_path, detections, save_dir):
"""
Crop the detection area of objects and save them.
"""
image_bgr = cv2.imread(image_path)
image_rgb = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2RGB)
h, w, _ = image_bgr.shape
print(f"Detections: {detections}")
count = 0
for i, det in enumerate(detections):
box = det['bbox_2d']
x1 = int(box[0] / 1000 * w)
y1 = int(box[1] / 1000 * h)
x2 = int(box[2] / 1000 * w)
y2 = int(box[3] / 1000 * h)
crop = image_rgb[y1:y2, x1:x2]
crop_bgr = cv2.cvtColor(crop, cv2.COLOR_RGB2BGR)
cv2.imwrite(os.path.join(save_dir, f'crop_{i}.png'), crop_bgr)
count += 1
return count
def remove_bg(image_path, device):
"""Feed image to BiRefNet for background removal."""
# Assumes BiRefNet is already loaded
image = Image.open(image_path)
transform_image = get_transform_image()
input_images = transform_image(image).unsqueeze(0).to(device)
if device == 'cuda':
input_images = input_images.half()
# Prediction
with torch.no_grad():
preds = birefnet(input_images)[-1].sigmoid().cpu()
pred = preds[0].squeeze()
pred_pil = transforms.ToPILImage()(pred)
pred_pil = pred_pil.resize(image.size)
image_masked = refine_foreground(image, pred_pil)
image_masked.putalpha(pred_pil)
return image_masked
def setup_directories():
"""Create timestamped directories for this run."""
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
run_outdir = os.path.join('outputs', timestamp)
crop_dir = os.path.join('cropped_images', timestamp)
bg_dir = os.path.join('bg_removed', timestamp)
for d in [run_outdir, crop_dir, bg_dir]:
os.makedirs(d, exist_ok=True)
return run_outdir, crop_dir, bg_dir
def image_to_3d(text, image_path, do_texture):
run_outdir, crop_dir, bg_dir = setup_directories()
fix_holes = False
# Object Detection & Cropping
images_to_process = []
if len(text) > 0:
load_qwen(args.qwen_device)
prompt = f"Locate every instance that belongs to the following categories: {text}. Report bbox coordinates in JSON format."
detections = qwen_object_boxes(qwen_model, qwen_processor, image_path, prompt)
print(f"Qwen3-VL detections: {len(detections)} objects")
unload_qwen()
num_crops = crop_dets(image_path, detections, crop_dir)
if num_crops > 0:
for f in sorted(os.listdir(crop_dir)):
images_to_process.append(os.path.join(crop_dir, f))
# If no prompt or no detections, use original image
if not images_to_process:
images_to_process.append(image_path)
# Background Removal
load_birefnet(args.birefnet_device)
processed_images = []
for i, img_path in enumerate(images_to_process):
image_masked = remove_bg(img_path, args.birefnet_device)
save_path = os.path.join(bg_dir, f'bg_removed_{i}.png')
image_masked.save(save_path)
processed_images.append(save_path)
unload_birefnet()
# Shape Generation
load_hunyuan_shape()
meshes = []
for img_path in processed_images:
mesh = pipeline_shape(
image=img_path,
num_inference_steps=num_inference_steps,
generator=torch.manual_seed(seed)
)[0]
meshes.append(mesh)
unload_hunyuan_shape()
# 4. Mesh Processing & Texture Generation
final_paths = []
if do_texture:
load_hunyuan_texture()
for i, mesh in enumerate(meshes):
mesh = FloaterRemover()(mesh)
mesh = DegenerateFaceRemover()(mesh)
if fix_holes:
mesh = MeshlibCleaner()(mesh)
mesh = FaceReducer()(mesh)
if do_texture:
mesh = pipeline_texture(mesh, Image.open(processed_images[i]))
save_path = os.path.join(run_outdir, f'model_{i}.glb')
mesh.export(save_path)
final_paths.append(save_path)
if do_texture:
unload_hunyuan_texture()
# Pad with None to match fixed output count (8)
while len(final_paths) < 8:
final_paths.append(None)
return final_paths[:8]
We have helper functions for:
- Defining the image transforms for BiRefNet background removal operations.
- Detecting objects using the Qwen3-VL 2B model.
- Cropping the detected objects for further processing.
- Removing the background using BiRefNet.
- Setting up directories for each run.
- And finally, the
image_to_3dfunction that combines all the operations.
Running the Gradio Application
In the end, we define the Gradio blocks and launch the applications.
with gr.Blocks() as demo:
gr.Markdown("# Image to 3D Mesh + Texture Pipeline")
with gr.Row():
with gr.Column(scale=1):
input_image = gr.Image(type='filepath', label="Input Image")
prompt = gr.Text(label="Object Prompt (Optional, e.g., 'cup, spoon')")
do_texture = gr.Checkbox(label="Generate Texture", value=True)
submit_btn = gr.Button("Generate 3D Models")
with gr.Column(scale=3):
with gr.Row():
out1 = gr.Model3D(label="Result 1", height=600)
out2 = gr.Model3D(label="Result 2", height=600)
with gr.Row():
out3 = gr.Model3D(label="Result 3", height=600)
out4 = gr.Model3D(label="Result 4", height=600)
with gr.Row():
out5 = gr.Model3D(label="Result 5", height=600)
out6 = gr.Model3D(label="Result 6", height=600)
with gr.Row():
out7 = gr.Model3D(label="Result 7", height=600)
out8 = gr.Model3D(label="Result 8", height=600)
submit_btn.click(
fn=image_to_3d,
inputs=[prompt, input_image, do_texture],
outputs=[out1, out2, out3, out4, out5, out6, out7, out8]
)
if __name__ == "__main__":
demo.launch(share=True)
Note: When multiple objects are detected and cropped for the image-to-3D operation, we show up to 8 objects in the UI. If there are more than that, they will be stored in the output directory for post-analysis.
Running Inference and Experiments
We can simply execute the script to start the application.
python image_to_texture.py
The default UI looks like the following.
We have a text box where the user can optionally type the name of the objects to detect, crop, and convert to 3D with texturing.
There is also an option to just generate 3D meshes without the texturing pipeline. This can run in less than 6GB of VRAM.
Here are some experiments and results. We are not discussing the quality of the result here. The primary aim is to check how well the end-to-end image-to-3D pipeline with optimizations works when applying the texturing pipeline on top of the 3D meshes. Future articles will focus more on the quality of results and improving everything overall.
In the above video, we give the prompt to extract multiple objects and not apply the texturing pipeline on the 3D meshes. As we can see, the multiple output boxes are working quite well.
The above shows the entire workflow for a porcelain structure with intricate details.
This is similar to the previous result, where we see how well the background removal works, where there might be connected objects of interest.
In the final experiment, we check how the pipeline works when prompting for 8 different objects. As we can see, the results are quite good.
Note: There might be cases where the object detection via Qwen3-VL 2B is not up to expectations because of the complexity of the scene. In such situations, simply using the 4B or 8B model will surely give better results.
Summary and Conclusion
In this article, we focused on the optimization of the image-to-3D pipeline for generating 3D meshes and applying textures. We covered VRAM optimizations, the generation of multiple 3D objects from a single image with prompting, and creating a better UI. We can take this project a lot further. Hopefully, we can cover these updates in future articles.
If you have any questions, thoughts, or suggestions, please leave them in the comment section. I will surely address them.
You can contact me using the Contact section. You can also find me on LinkedIn, and X.



