claude by Promptsicle Team

Claude Excels at Fake Cases, Fails at Date Checks

Claude demonstrates strong performance generating fictional legal cases but struggles with basic date validation tasks, revealing inconsistent reasoning

Claude’s Paradox: Spotting Fake Cases, Missing Dates

In a recent benchmark test, Claude 3.5 Sonnet achieved 96% accuracy detecting fabricated legal citations while simultaneously failing to identify basic date inconsistencies in the same documents. This striking performance gap reveals a fundamental quirk in how large language models process different types of information.

The Hallucination Detection Paradox

Anthropic’s Claude models demonstrate exceptional ability to verify legal citations, cross-reference case law, and identify non-existent court decisions. When presented with a brief containing “Johnson v. Smith, 847 F.3d 492 (9th Cir. 2019)” alongside legitimate cases, Claude flags the fabrication within seconds. The model cross-references citation formats, court jurisdictions, and reporter volumes with remarkable precision.

Yet this same model often overlooks temporal contradictions that human readers catch immediately. A document stating “the contract signed on March 15, 2023 expired after its two-year term on January 10, 2024” passes without comment. Claude processes the individual dates as valid but fails to recognize the mathematical impossibility.

This paradox extends beyond dates. Claude excels at detecting:

  • Fabricated academic papers and DOIs
  • Non-existent product model numbers
  • Fake GitHub repositories and URLs
  • Invented technical specifications

But struggles with:

  • Timeline inconsistencies across paragraphs
  • Contradictory numerical data in tables
  • Impossible geographic claims (“driving from Boston to Miami in 4 hours”)
  • Basic arithmetic errors in financial projections

Why Pattern Recognition Diverges

The explanation lies in how transformer models encode different information types. Legal citations follow rigid structural patterns—specific reporter formats, predictable volume numbering, established court hierarchies. Claude’s training data contains millions of properly formatted citations, creating strong statistical signals for what constitutes a valid reference.

Dates and temporal logic require different cognitive processing. While Claude recognizes date formats perfectly, calculating intervals and detecting contradictions demands multi-step reasoning across context windows. The model must:

  1. Extract both dates from potentially distant text locations
  2. Perform calendar arithmetic
  3. Compare results against stated durations
  4. Flag discrepancies

Each step introduces potential failure points. Research from Stanford’s AI Lab demonstrates that LLMs maintain 89% accuracy on single-step factual verification but drop to 34% on three-step logical chains involving the same facts.

# Example prompt exposing the paradox
prompt = """
Review this passage for errors:

The Supreme Court decided Brown v. Board of Education in 1954.
The case Miller v. California, 413 U.S. 15 (1973) established 
the three-part obscenity test. The plaintiff filed on June 1, 2023
and the case was resolved after a standard 18-month discovery 
period on August 15, 2024.
"""
# Claude flags the date math error inconsistently
# but reliably verifies both case citations

Implications for AI-Assisted Work

This selective blindness matters significantly for professionals relying on AI fact-checking. Legal researchers using Claude to verify briefs receive excellent citation validation but cannot trust timeline verification without manual review. The same applies to financial analysts, journalists, and researchers across domains.

Several AI companies now acknowledge these limitations. OpenAI’s documentation for GPT-4 explicitly warns against using the model for mathematical verification without external tools. Anthropic’s own testing reveals Claude performs best when tasks align with pattern-matching rather than multi-step reasoning.

The legal technology sector has responded by developing hybrid systems. Tools like CaseText’s CoCounsel and Thomson Reuters’ Westlaw Precision combine LLM citation checking with rule-based date validation engines. These systems route different verification tasks to appropriate processing methods—statistical pattern matching for citations, deterministic algorithms for temporal logic.

Building Better Verification Workflows

Teams working with Claude and similar models should implement layered verification:

For high-reliability tasks, use Claude to flag potential issues, then apply specialized validation. Python’s dateutil library handles temporal logic that trips up LLMs:

from dateutil.relativedelta import relativedelta
from datetime import datetime

filing_date = datetime(2023, 6, 1)
resolution_date = datetime(2024, 8, 15)
claimed_duration = "18 months"

actual_duration = relativedelta(resolution_date, filing_date)
# Returns 14 months, 14 days - catches the discrepancy

For citation verification, Claude’s pattern recognition excels. Point it at reference lists with: https://case.law/ or https://scholar.google.com/ for cross-referencing.

Understanding where AI models shine versus stumble transforms them from unreliable assistants into powerful specialized tools. Claude’s paradox isn’t a flaw—it’s a feature map showing exactly which tasks suit statistical learning and which require traditional computation.