blog bg

June 03, 2025

Remote VAEs for Decoding with Inference Endpoints

Share what you learn in this blog to prepare for your interview, create your forever-free profile now, and explore how to monetize your valuable knowledge.

Remote VAEs for Decoding with Inference Endpoints

 

Have you ever seen your GPU struggle, stutter, or crash while running high-resolution image generating models? Frustrating, right? We need AI-generated images, but technology often limits us. How about offloading some of the tedious tasks to free up your GPU without losing speed or quality? 

Here come Remote Variational Autoencoders (VAEs). You may create stunning images while keeping your devices cool and efficient by moving the decoding working to a distant inference endpoint. How does this work and why does it impact AI-generated content? 

 

Why Use Remote VAEs?

VAEs takes memory. Stable Diffusion and Flux diffusion models may use almost all your VRAM during decoding, making high-resolution images practically unfeasible on inexpensive GPUs. Offloading and tiling are workarounds, but they cause problems:

  • Offloading slows down data transfers between devices. 
  • Tiling reduces image quality by splitting images into fragments that may not mix flawlessly.

Then what's the second option? Remote VAEs. We offload decoding to a powerful distant endpoint and let dedicated servers compute. The result? Reduced GPU RAM, faster image production, and no bottlenecks.

 

Setting Up Remote VAEs

Let's set up everything before writing code. Install dependencies if needed:

pip install git+https://github.com/huggingface/diffusers@main

 

Let's develop our Stable Diffusion pipeline for remote VAEs. The pipeline usually loads a VAE model on your local system, but we will instruct it to use an external endpoint.

from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
   "stable-diffusion-v1-5/stable-diffusion-v1-5",
   torch_dtype=torch.float16,
    vae=None# No local VAE, we’ll use remote decoding
).to("cuda")

That's it! Your local setup is ready! Let's create some pictures. 

 

Basic Example of Remote VAE Decoding 

Start with a basic example. Consider decoding random noise tensors using a remote VAE. We transmit it to the remote inference endpoint instead of your GPU to handle everything. 

import torch

video = remote_decode(
   endpoint="https://your-endpoint-url",
   tensor=torch.randn([1, 16, 3, 40, 64], dtype=torch.float16),
   output_type="mp4",
)

with open("output_video.mp4", "wb") as f:
    f.write(video)

Pretty nice, huh? This is particularly beneficial for video models like HunyuanVideo, because decoding multiple frames locally would overload most consumer GPUs. 

 

Generating Images Using Remote VAEs 

Generation of pictures instead of random noise is interesting! This works with different models, however we will use Stable Diffusion v1.5. 

prompt = "A futuristic city skyline at sunset, high detail, cinematic"
latent = pipe(prompt=prompt, output_type="latent").images

image = remote_decode(
   endpoint="https://your-endpoint-url",
    tensor=latent,
   scaling_factor=0.18215,
)

image.save("output.jpg")

 

Did you notice anything intriguing? The VAE is remote, thus your GPU does not decode. This is particularly useful for 4K and higher quality photos, where local memory is usually a barrier. 

If you are testing AI models, try Flux: 

from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
   "black-forest-labs/FLUX.1-schnell",
   torch_dtype=torch.bfloat16,
    vae=None,
).to("cuda")

latent = pipe(prompt="A cyberpunk warrior", output_type="latent").images

Once you have the latent representation, just send it to the remote endpoint for decoding.

 

Queueing Remote VAE Requests

Ever tried creating so many images at once and felt it took forever? Here comes queueing. We can submit many queries and let them process simultaneously instead of waiting for each picture to decode. 

Use Python's queue and threading to achieve it: 

import queue, threading
q = queue.Queue()

def decode_worker():
    while True:
        latent = q.get()
        if latent is None:
            break
        image = remote_decode(endpoint="https://your-endpoint-url", tensor=latent)
       image.save("output.jpg")
        q.task_done()

thread = threading.Thread(target=decode_worker, daemon=True)
thread.start()

 

Add latent representations to the queue whenever you produce them:

latent = pipe(prompt="A futuristic city skyline", output_type="latent").images
q.put(latent)

Just like that, you can analyze many images simultaneously without overpowering your machine.

 

Advantages of Using Remote VAEs

You most likely now understand why Remote VAEs matter. But to recap:

  1. No VRAM difficulties: Remote VAE prevents GPU overload.
  2. Faster processing: Generate images simultaneously without waiting for each one to complete.
  3. Improved scalability: Create images, videos, and animations without memory constraints.
  4. Supports 4K image output? No issue.

 

Conclusion

Is Remote VAE worth switching to? Yes, if you regularly hit VRAM restrictions, have sluggish inference times, or wish to run diffusion models more effectively. The versatility and scalability make it perfect for artists, developers, and academics.

The best part? This is only the start. This technique may lead to optimized VAE endpoints for higher-resolution photos and videos.

Try alternative models and tell me, what will you make using Remote VAEs?

132 views

Please Login to create a Question