general

Small AI Model Outperforms GPT-5.2 at Detecting CEO Evasion

Eva-4B is a 4-billion parameter language model that detects when corporate executives evade questions during earnings calls, outperforming larger models by

4B Model Detects CEO Evasion Better Than GPT-5.2

What It Is

Eva-4B is a specialized language model designed to detect when corporate executives dodge questions during earnings calls. Built on the Qwen3 architecture with just 4 billion parameters, it analyzes Q&A exchanges and classifies responses into three categories: direct answers, intermediate evasion, and fully evasive responses. The classification follows the Rasiah framework, an academic standard for measuring answer quality in corporate communications.

The model was trained on 30,000 labeled samples where both Claude Opus and Gemini agreed on the evasion classification, creating a high-quality dataset focused specifically on financial communication patterns. Despite its compact size, Eva-4B achieves 81.5% accuracy on this task - outperforming GPT-5.2’s 80.5% while requiring a fraction of the computational resources.

This represents a practical example of domain-specific fine-tuning beating general-purpose models at specialized tasks. Rather than relying on massive parameter counts and broad training, Eva-4B demonstrates how targeted training data and task-specific optimization can produce superior results for narrow applications.

Why It Matters

Financial analysts, journalists, and investors spend countless hours reviewing earnings call transcripts to extract meaningful information. When executives deflect questions about declining margins, regulatory issues, or strategic missteps, it often signals problems worth investigating further. Automating this detection process could save research teams significant time while surfacing red flags that might otherwise go unnoticed.

The model’s efficiency advantage matters beyond just cost savings. Running a 4B parameter model locally means processing entire quarters of earnings transcripts without API rate limits, cloud costs, or data privacy concerns. Hedge funds and research firms can analyze thousands of calls to identify patterns in evasive behavior across industries or time periods.

More broadly, Eva-4B illustrates an important trend in AI development: specialized models trained for specific domains often outperform general-purpose giants. This challenges the assumption that bigger always means better and suggests that organizations with narrow use cases might achieve better results by fine-tuning smaller models rather than relying on expensive API calls to frontier models.

Getting Started

The easiest way to test Eva-4B is through the hosted demo at https://huggingface.co/spaces/FutureMa/financial-evasion-detection. Users can paste question-answer pairs from earnings transcripts and receive immediate classification results.

For developers wanting to integrate the model into analysis pipelines, the model is available at https://huggingface.co/FutureMa/Eva-4B. Here’s a basic implementation:


model = AutoModelForCausalLM.from_pretrained("FutureMa/Eva-4B")
tokenizer = AutoTokenizer.from_pretrained("FutureMa/Eva-4B")

question = "What caused the 15% revenue decline this quarter?"
answer = "We remain focused on long-term value creation and strategic initiatives."

prompt = f"Question: {question}\nAnswer: {answer}\nClassification:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
result = tokenizer.decode(outputs[0])

The model runs comfortably on consumer GPUs with 8GB VRAM, making it accessible for individual researchers and small teams without enterprise infrastructure.

Context

Traditional approaches to detecting evasive language relied on linguistic features like word count, hedge words, or sentiment analysis. These rule-based systems struggled with the nuanced ways executives can technically answer questions while revealing nothing substantive. Large language models improved detection but introduced new problems: high inference costs, API dependencies, and inconsistent performance on financial jargon.

Eva-4B’s training methodology - using agreement between Claude Opus and Gemini as ground truth - represents an interesting approach to dataset creation. This “model consensus” labeling reduces individual model biases while scaling annotation beyond what human labelers could reasonably accomplish.

Limitations remain. The model was trained specifically on earnings calls, so performance on other corporate communications like press releases or investor presentations may vary. The three-category classification also simplifies what is often a spectrum of evasiveness. Some answers might be partially direct while still omitting crucial details.

Alternative approaches include using GPT-4 or Claude with carefully crafted prompts, though this incurs higher costs per analysis. Open-source options like fine-tuned Llama models could offer similar specialization, but Eva-4B’s public availability and documented performance make it a practical starting point for teams exploring automated financial analysis.