Back to news

How-to Guide

How to use DeepSeek V3.5 for production workloads.

Deploy DeepSeek V3.5, the open-weights model with 1M context at $0.15/$0.60, for high-throughput production workloads with load balancing, caching, and fallback strategies.

AI Kick Start editorial image for How to use DeepSeek V3.5 for production workloads.

Decision

Shortlist

Score tools by workflow fit, data handling, owner readiness, and cost at scale before buying seats.

Risk to watch

Shelfware

A capable tool still fails if nobody owns the workflow or checks whether it is used weekly.

Proof to collect

Pilot score

Run one real task through each shortlisted tool and record quality, time saved, and support burden.

TL;DR

TL;DR: "DeepSeek V3.5" is reportedly being talked about as a 1M-context model priced at $0.15/$0.60 per million tokens, which would make it the cheapest capable model going around. That model and price point are unconfirmed, no DeepSeek V3.5 appears in [DeepSeek's official release notes or pricing](https://api-docs.deepseek.com/quick_start/pricing-details-usd), where the real lineup runs V3, V3.2, and V4. The deployment patterns below still hold up for whichever DeepSeek model you actually run: vLLM serving, request batching, caching, and fallback chains that keep costs down without sacrificing reliability.

Key takeaways

  • Pricing: Reportedly $0.15/$0.60, would be ~20x cheaper than GPT-5.5 for equivalent context, but the price point is unconfirmed against any real DeepSeek model
  • Context: 1M tokens claimed; 236B parameters (37B active per token), both figures unverified and the parameter count is internally inconsistent with DeepSeek's real architecture
  • MoE: [Mixture-of-Experts architecture](https://arxiv.org/abs/2412.19437) for efficient inference (confirmed for the real DeepSeek family)
  • Open weights: [Self-host for even lower cost](https://huggingface.co/deepseek-ai/DeepSeek-V3); no data leaves your infra
  • Production: vLLM + continuous batching for a large throughput improvement (often cited around 10x, though it's workload-dependent)

Analysis

Journalist's Take

Cheap, capable AI models have become the story of the year, and the rumour mill keeps feeding it. The latest version doing the rounds is a "DeepSeek V3.5" supposedly offering a million tokens of context for $0.15 in and $0.60 out, a price that, if real, would undercut the big American models by an order of magnitude.

Here's the catch worth saying plainly before anyone reworks a budget around it: there is no DeepSeek V3.5. DeepSeek's published lineup is V3, the V3.2 update released in December 2025, and V4. The "$0.15/$0.60 at 1M context" combination matches none of them. The real numbers are close enough to feel plausible, V3 sits near $0.14/$0.28 at 128K context, and V4 Flash reportedly hits $0.14/$0.28 at 1M, which is exactly why a fabricated spec spreads so easily.

So treat the model name and headline price in this guide as unconfirmed. What is solid is the engineering. DeepSeek's open-weight models are genuinely cheap to run, genuinely self-hostable, and the production architecture for serving them, batching, caching, sensible fallbacks, is the same regardless of which version you pick. That's the part worth your time.

The rest of this piece walks through that architecture. Swap in a real model id (DeepSeek V3.2 or V4) where the samples say deepseek-v3.5, and the patterns carry over cleanly.

Analysis

Prerequisites

  • GPU server with 8x A100 80GB or 4x H100 (for self-hosting)
  • OR: DeepSeek API key (for managed access)
  • Python 3.10+, vLLM, transformers
  • Docker for containerised deployment

Step-by-Step Framework

Step 1: API Access (Quickest Start)

Start here if you just want a working call. Note the model id below uses deepseek-v3.5; substitute a real one (such as the V3.2 or V4 id from DeepSeek's docs) before you ship.

# deepseek_api.py
from openai import OpenAI

client = OpenAI(
    api_key="sk-your-deepseek-key",
    base_url="https://api.deepseek.com/v1"
)

response = client.chat.completions.create(
    model="deepseek-v3.5",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    max_tokens=1000
)
print(response.choices[0].message.content)

Step 2: Self-Host with vLLM

Self-hosting is where the cost story gets real, since DeepSeek's weights are open and nothing leaves your own infrastructure. One caveat on the download line: the deepseek-ai/DeepSeek-V3.5 path and the ~475GB figure are tied to a model that doesn't exist publicly. Point this at a real checkpoint (the V3 FP8 weights land in a similar size range). vLLM does support the DeepSeek family.

# Install vLLM with DeepSeek support
pip install vllm==0.6.0

# Download model (this is large, ~475GB for FP8)
huggingface-cli download deepseek-ai/DeepSeek-V3.5 --local-dir ./deepseek-v3.5

# Launch server with tensor parallelism
python -m vllm.entrypoints.openai.api_server \
  --model ./deepseek-v3.5 \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 1 \
  --max-num-seqs 256 \
  --max-model-len 65536 \
  --quantization fp8 \
  --port 8000

Step 3: Implement Request Batching

Batching is the single biggest lever for throughput once you're past prototyping. vLLM's continuous batching handles this at the serving layer, but if you're sitting in front of a managed API, a client-side batcher like this groups requests by size or a short deadline before firing them off.

# batching.py
import asyncio
from openai import AsyncOpenAI
import time

class BatchedDeepSeek:
    def __init__(self, base_url: str, api_key: str):
        self.client = AsyncOpenAI(base_url=base_url, api_key=api_key)
        self.batch_size = 32
        self.max_wait_ms = 50
        self.queue = asyncio.Queue()
        self.results = {}

    async def submit(self, request_id: str, messages: list) -> str:
        future = asyncio.Future()
        await self.queue.put((request_id, messages, future))
        return await future

    async def _batch_processor(self):
        while True:
            batch = []
            deadline = time.time() + self.max_wait_ms / 1000

            # Collect requests until batch is full or deadline
            while len(batch) < self.batch_size:
                timeout = max(0, deadline - time.time())
                try:
                    item = await asyncio.wait_for(self.queue.get(), timeout=timeout)
                    batch.append(item)
                except asyncio.TimeoutError:
                    break

            if not batch:
                continue

            # Execute batch
            try:
                response = await self.client.chat.completions.create(
                    model="deepseek-v3.5",
                    messages=[{"role": "user", "content": b[1][0]["content"]} for b in batch],
                    max_tokens=1000
                )

                for i, (req_id, _, future) in enumerate(batch):
                    if not future.done():
                        future.set_result(response.choices[i].message.content)
            except Exception as e:
                for _, _, future in batch:
                    if not future.done():
                        future.set_exception(e)

    async def start(self):
        asyncio.create_task(self._batch_processor())

Step 4: Response Caching

A lot of production traffic is the same question asked over and over. Cache the answer and you stop paying for it twice. This decorator keys on the message payload plus parameters, checks Redis first, and only hits the model on a miss.

# caching.py
import hashlib
import redis
import json
from functools import wraps

cache = redis.Redis(host='localhost', port=6379, db=0)
CACHE_TTL = 3600  # 1 hour

def cached_llm_call(ttl_seconds: int = CACHE_TTL):
    def decorator(func):
        @wraps(func)
        async def wrapper(messages, **kwargs):
            # Create cache key from messages + params
            cache_data = json.dumps({"messages": messages, **kwargs}, sort_keys=True)
            cache_key = f"llm:{hashlib.sha256(cache_data.encode()).hexdigest()}"

            # Check cache
            cached = cache.get(cache_key)
            if cached:
                return json.loads(cached)

            # Call LLM
            result = await func(messages, **kwargs)

            # Cache result
            cache.setex(cache_key, ttl_seconds, json.dumps(result))
            return result
        return wrapper
    return decorator

@cached_llm_call(ttl_seconds=3600)
async def deepseek_call(messages, **kwargs):
    response = await client.chat.completions.create(
        model="deepseek-v3.5",
        messages=messages,
        **kwargs
    )
    return response.choices[0].message.content

Step 5: Fallback Chain

One provider will eventually have a bad day, so don't bet the whole system on it. This chain tries DeepSeek first, then falls back to MiniMax M3 (a real 1M-context model launched in June 2026) and OpenRouter in order of priority, returning the first response that succeeds.

# fallback.py
import asyncio
from openai import AsyncOpenAI

class FallbackLLM:
    def __init__(self):
        self.providers = [
            {"name": "deepseek", "client": AsyncOpenAI(base_url="https://api.deepseek.com/v1"), "model": "deepseek-v3.5", "priority": 1},
            {"name": "minimax", "client": AsyncOpenAI(base_url="https://api.minimax.chat/v1"), "model": "minimax-m3", "priority": 2},
            {"name": "openrouter", "client": AsyncOpenAI(base_url="https://openrouter.ai/api/v1"), "model": "deepseek-v3.5", "priority": 3}
        ]

    async def complete(self, messages, max_tokens=1000, timeout=30):
        for provider in sorted(self.providers, key=lambda x: x["priority"]):
            try:
                response = await asyncio.wait_for(
                    provider["client"].chat.completions.create(
                        model=provider["model"],
                        messages=messages,
                        max_tokens=max_tokens
                    ),
                    timeout=timeout
                )
                print(f"Response from {provider['name']}")
                return response.choices[0].message.content
            except Exception as e:
                print(f"{provider['name']} failed: {e}")
                continue

        raise Exception("All providers failed")

Do/Don't

DoDon't
Use vLLM with FP8 quantisation for servingRun FP16 without 16x A100 80GB
Implement response caching for repeated queriesCall the API for identical requests
Use batching for high-throughput scenariosSend one request at a time
Set up fallback to other providersRely on a single provider
Monitor token usage and latency per requestDeploy without usage monitoring

Cost Comparison

A note before reading the table: the DeepSeek column uses the unconfirmed $0.15/$0.60 figure, so treat the DeepSeek row and the resulting savings as illustrative, not gospel. The competitor prices are accurate, GPT-5.5 is $5.00/$30.00 and Claude Sonnet 4.6 is $3.00/$15.00 per million tokens. Run the comparison again with a real DeepSeek price (V3 at roughly $0.14/$0.28, for instance) before you quote any of it to a finance team.

UsageDeepSeek V3.5GPT-5.5Claude Sonnet 4.6Savings
1M input tokens$0.15$5.00$3.0020-33x
1M output tokens$0.60$30.00$15.0025-50x
10M tokens/day$7,500/mo$1,050,000/mo$540,000/mo72-140x

The 10M-tokens/day row is the most speculative line in the table: the GPT-5.5 monthly figure leans on the $30/M output rate at very high volume, the blended assumptions behind it aren't stated, and the DeepSeek baseline rests on the unconfirmed V3.5 price. Useful as a rough sense of the gap, not a quote.

Conclusion

The real takeaway survives the fact-check even if the headline model doesn't. DeepSeek's open-weight models are among the cheapest capable options around, and the architecture in this guide, vLLM with FP8 quantisation, request batching, response caching, and a fallback chain, is what makes them dependable in production. Just build it around a model that actually exists: check DeepSeek's own docs for the current V3.2 or V4 ids and pricing, then run your own cost numbers before you commit. The gap between open-weight and proprietary pricing is real and large; the specific "V3.5 at $0.15/$0.60" framing is not something I'd bank on yet.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

  1. Write the job-to-be-done before looking at another product.
  2. Score each shortlisted tool for workflow fit, data handling, cost, and owner readiness.
  3. Run one small pilot and remove anything the team does not use weekly.

Want help applying this? Explore the AI tools directory.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: How to use DeepSeek V3.5 for production workloads

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call