Small AI Model Outperforms GPT-5.2 at Detecting CEO Evasion

While GPT-5.2 and Claude Opus dominate headlines for their broad capabilities, a specialized 7-billion parameter model called EvadeDetect has achieved superior performance on a task that matters to investors and journalists: identifying when corporate executives dodge questions during earnings calls.

Released by researchers at Stanford’s Computational Journalism Lab, EvadeDetect scored 89.3% accuracy on the QEvasion benchmark, outperforming GPT-5.2’s 76.1% and Claude Opus’s 78.4%. The model specifically targets linguistic patterns that signal non-responsive answers, including topic shifts, vague language, and deflection tactics commonly used in high-stakes corporate communications.

Benchmarks

EvadeDetect was evaluated against three datasets: the QEvasion benchmark containing 12,000 annotated question-answer pairs from earnings calls, a political debate dataset with 4,500 exchanges, and a newly created Corporate Interview Corpus with 8,200 media interview segments.

The model achieved its highest performance on earnings call transcripts, where executives face structured questions from analysts. On political debates, accuracy dropped to 81.2%, suggesting the model’s training emphasized corporate communication patterns. GPT-5.2 showed more consistent performance across domains (74-78% range) but never matched EvadeDetect’s specialized accuracy.

False positive rates tell an important story. EvadeDetect flagged only 6.8% of direct answers as evasive, compared to GPT-5.2’s 18.3% false positive rate. This precision matters for automated analysis systems that might otherwise generate excessive alerts.

The research team tested the model on historical transcripts from companies later involved in accounting scandals. EvadeDetect identified elevated evasion rates 6-9 months before public disclosure in 73% of cases, though the researchers caution against using this as a standalone fraud detection tool.

How to Run It

EvadeDetect is available through Hugging Face at https://huggingface.co/stanford-cjl/evadedetect-7b. The model requires approximately 14GB of VRAM for inference, making it accessible on consumer GPUs like the RTX 4090 or professional cards like the A10.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("stanford-cjl/evadedetect-7b")
model = AutoModelForSequenceClassification.from_pretrained("stanford-cjl/evadedetect-7b")

question = "What specific measures are you taking to address the 23% revenue decline?"
answer = "We remain focused on long-term value creation and operational excellence."

inputs = tokenizer(question, answer, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs)
evasion_score = torch.softmax(outputs.logits, dim=1)[0][1].item()

print(f"Evasion probability: {evasion_score:.2%}")

The model outputs a probability score between 0 and 1, with values above 0.65 indicating likely evasion. The research team recommends adjusting this threshold based on specific use cases—financial analysts might prefer higher sensitivity (0.55) while automated flagging systems might use 0.75 to reduce false positives.

An API endpoint is available for researchers at https://api.stanford.edu/evadedetect with rate limits of 1,000 requests per day for academic users. Commercial licensing discussions are ongoing with several financial data providers.

Limitations

EvadeDetect struggles with technical questions where legitimate answers require industry jargon. In pharmaceutical earnings calls, the model incorrectly flagged 22% of scientifically detailed responses as evasive, mistaking complexity for obfuscation.

The model was trained primarily on English-language transcripts from U.S. companies. Performance on international earnings calls or translated transcripts has not been systematically evaluated. Cultural differences in communication styles may affect accuracy—Japanese corporate communications, which often emphasize indirect language, could produce elevated false positive rates.

Context window constraints limit the model to analyzing individual question-answer pairs. It cannot track whether an executive eventually addresses a question later in the call or identify patterns across multiple exchanges. This narrow focus misses sophisticated evasion tactics that span entire conversations.

The training data spans 2018-2023, potentially missing evolving communication strategies. As executives and their advisors become aware of automated evasion detection, they may develop new linguistic patterns the model hasn’t encountered.

Verdict

EvadeDetect demonstrates that specialized small models can exceed frontier model performance on focused tasks. Its 89.3% accuracy and low false positive rate make it practical for augmenting human analysis of corporate communications, though not replacing it entirely.

The model serves journalists tracking specific companies, analysts screening large volumes of transcripts, and researchers studying corporate transparency. Its limitations around technical content and cultural context mean users should treat outputs as signals requiring verification rather than definitive judgments.

For organizations already using GPT-5.2 or Claude Opus for general-purpose analysis, EvadeDetect offers a complementary tool worth integrating into specialized workflows where detecting non-responsive communication matters.

Tiny AI Beats GPT-5.2 at Spotting CEO Dodges

Small AI Model Outperforms GPT-5.2 at Detecting CEO Evasion

Benchmarks

How to Run It

Limitations

Verdict

Related Tips

AI Code Speed Outpaces Developer Understanding

AI Agents Outperform Reddit Stock Picks in Study

AI Giants Unite to Combat Chinese Model Theft