
June 23, 2025
Introducing HELMET: Holistically Evaluating Long-context Language Models
Introducing HELMET: Holistically Evaluating Long-context Language Models
Have you thought whether your chosen language model really understands lengthy documents or just pretends doing it? I used to believe that longer would be better. More information and larger context windows imply better models, right? But it's not always true. GPT-4o and Claude-3 can handle hundreds of thousands of tokens, exploding the long-context language model (LCLM) field. However, these models may use extended inputs, but assessing their long-term performance is different.
That's why I'm discussing HELMET today.
Why Just Bigger Isn't Always Better in Language Models
Let's discuss reality. Most people are using long-input models. You need reliability when summarizing legal docs or asking a model to draw conclusions from medical studies. Traditional benchmarks? They live in the past.
I often see Scrolls or "needle-in-a-haystack" benchmarks. They may appear good on paper, but they do not tell us anything in reality. Perplexity ratings and made-up tasks do not represent a model's performance on complicated tasks like citation-based creation or multi-document reasoning.
We needed more. Something genuine.
HELMET: A Real Solution to a Real Problem
HELMET (How to Evaluate Long-context Models Effectively and Thoroughly) was different when I first saw it. It evaluates and pushes models.
HELMET does not use mock stats or fluff. Instead, it tests models on real-world tasks including summarization, retrieval-augmented generation (RAG), citation-heavy outputs, re-ranking, and more.
Not simply diversified. Controllable. Adjust context length, test models with 8K tokens, or use 128K+. The best part? It supports basic and instruction-tuned models.
Running HELMET from Scratch
Like me, you want to evaluate your model locally to see how it compares. Good news: HELMET simplifies this.
First, get the code:
git clone https://github.com/princeton-nlp/HELMET
cd HELMET
pip install -r requirements.txt
To show this, let's use the RAG task. Allow me to show you an example configuration:
input_max_length: 131072
datasets: kilt_nq
generation_max_length: 20
test_files: data/kilt/nq-dev.jsonl
demo_files: data/kilt/nq-train.jsonl
use_chat_template: true
max_test_samples: 100
shots: 2
stop_new_line: true
model_name_or_path: meta-llama/Llama-3.1-8B-Instruct
use_tgi_serving: false
To kick off the evaluation, just run:
python eval.py --config configs/rag.yaml --model_name_or_path meta-llama/Llama-3.1-8B-Instruct
To start you do not need a huge cluster of computers. For quicker endpoints, you can use vLLM or Hugging Face TGI as HELMET supports it.
What We Learned from Running HELMET
The good part: results. After running HELMET on 59 models, researchers observed something surprising. Some tasks are better handled by smaller open-source models. Open-source models lagged on difficult tasks like citation-based answers.
My models behaved differently as input length grew, which shocked me. As context increased, even the strongest models worsened, notably for re-ranking. Massive. It indicates we cannot assume a million-token context window model always gives better responses.
No winner fits all. Certain models excel in summarization, others at ICL or retrieval. HELMET showed that these tasks do not necessarily correspond, making multi-level evaluation important.
So, How Do You Fit HELMET into Your Workflow?
Start with Recall and RAG if you are creating or modifying an LCLM. They are quick and accurate in assessing performance.
Better still, you may benchmark your model against the 59 previously benchmarked models. Check HELMET's leaderboard instead of reinventing the wheel.
Indeed, it scales! HELMET offers many serving structures for local testing and Intel Gaudi accelerator deployment.
What Next? HELMET + LongProc = Next-level Evaluation
Think HELMET is impressive? I would recommend try it with LongProc for even more better results.
HELMET evaluates models on long-context tasks, whereas LongProc evaluates models on long-form procedural reasoning. Consider multi-step assignments that need the model to produce 8,000 tokens or more.
The aim? Develop the ultimate LCLM evaluation suite with HELMET and LongProc, which handles extended inputs and outputs.
Conclusion
HELMET is a game-changer for language models that read, interpret, and reason through lengthy texts. Microsoft and AI21 use it since it is open-source and proven. Try HELMET, and you will never see long-context evaluation the same.
155 views