blog bg

February 24, 2025

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Share what you learn in this blog to prepare for your interview, create your forever-free profile now, and explore how to monetize your valuable knowledge.

 

An innovative development in open-source code intelligence, DeepSeek-Coder-V2 questions the dominance of closed-source models in AI-powered coding support. DeepSeek-Coder-V2 offers improved efficiency in code generation, completion, and chat-based interactions with several model variants meant for various use cases. Its model variants are discussed in this article along with comprehensive instruction on running DeepSeek-Coder-V2 locally. 

 

Model Variants 

DeepSeek-Coder-V2-Lite-Base 

Trained especially for code generating activities, this is a basic model. It is perfect for developers seeking AI-powered support in developing effective and tidy code as it shines in code completion and insertion. 

 

DeepSeek-Coder-V2-Lite-Instruct 

Based on the Lite-Base model, this variation adds instruction tweaking to enhance response quality when provided with particular stimuli. It fits interactive coding projects and chat-based artificial intelligence development really well. 

 

DeepSeek-Coder-V2-Base 

Professional software developers would find the Base model, a more potent variant meant for bigger-scale code creation and analysis, to be rather suitable for handling difficult programming issues. 

 

DeepSeek-Coder-V2-Instruct 

Designed for chat-based interactions and instruction-following tasks, this is the most complex version. Strongly competitive with closed-source solutions, it can interpret and create code with great precision. 

 

How to Run Locally 

Make sure your system satisfies the hardware requirements if you want DeepSeek-Coder-V2 running locally. Particularly operating DeepSeek-Coder-V2 in BF16 style calls for 80GB*8 GPUs for best inference performance. 

 

Inference with Huggingface's Transformers 

Huggingface's Transformers archive allows you to infer DeepSeek-Coder-V2. Here are a few examples: 

 

Code Completion:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-Coder-V2-Lite-Base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-Coder-V2-Lite-Base", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()

input_text = "#write a quick sort algorithm"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

 

Code Insertion:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-Coder-V2-Lite-Base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-Coder-V2-Lite-Base", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()

input_text = """<|fim▁begin|>def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[0]
    left = []
    right = []
<|fim▁hole|>
        if arr[i] < pivot:
            left.append(arr[i])
        else:
            right.append(arr[i])
    return quick_sort(left) + [pivot] + quick_sort(right)<|fim▁end|>"""

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)

print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input_text):])

 

Chat Completion:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()

messages=[
    {'role': 'user', 'content': "write a quick sort algorithm in python."}
]

inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, top_k=50, top_p=0.95, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)

print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))

 

The full chat template is located in tokenizer_config.json in the Huggingface model repository.

 

Inference with SGLang (Recommended)

For greater efficiency, SGLang offers MLA optimizations, such as FP8 (W8A8), FP8 KV Cache, and Torch Compile. The following instructions start an OpenAI API-compatible server:

 

Run Server with BF16 and Tensor Parallelism = 8

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct --tp 8 --trust-remote-code

 

Enable Torch Compilation (Takes a Few Minutes)

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --trust-remote-code --enable-torch-compile

 

Run Server with FP8, Tensor Parallelism = 8, and FP8 KV Cache

python3 -m sglang.launch_server --model neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --tp 8 --trust-remote-code --kv-cache-dtype fp8_e5m2

 

Query the API

import openai

client = openai.Client(
    base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print(response)

 

Inference with vLLM (Recommended)

To use vLLM for model inference, merge this pull request into your vLLM codebase.

 

Example Code

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

max_model_len, tp_size = 8192, 1
model_name = "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True)

sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])

messages_list = [
    [{"role": "user", "content": "Who are you?"}],
    [{"role": "user", "content": "write a quick sort algorithm in python."}],
    [{"role": "user", "content": "Write a piece of quicksort code in C++."}],
]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]
outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

 

Conclusion

One strong open-source substitute for commercial artificial intelligence coding assistance is DeepSeek-Coder-V2. Its many forms provide answers for intelligent code help, chat-based interactions, and code creation. DeepSeek-Coder-V2 enables developers to use high-quality code intelligence regardless of Huggingface Transformers, SGLang, or vLLM, hence breaking boundaries in AI-driven software development.

423 views

Please Login to create a Question