Your “Best” Estimator Might Be Lying: The Hidden Freedom That Can Flip Economic Results

Economists often rely on statistical models that are over-identified—meaning the model implies more testable conditions than the number of parameters being estimated. In textbooks, this comes with a neat promise: if the model is right, use the most efficient method; if it’s wrong, diagnostic tests (like the famous Hansen J test) should raise alarms.

But Isaiah Andrews, Jiafeng Chen, and Otávio Tecchio argue that real-world practice looks very different—and for a reason: models are almost always at least a little wrong. And once you accept that, a lot of standard “best practice” logic collapses.

The core message: when the model is wrong, estimators stop chasing the same target

In a correctly specified world, different estimation methods mainly differ in efficiency (how much statistical noise you get). But under misspecification (the normal case), something deeper happens:

Different estimators can imply different “targets” (estimands).

So it’s not just “Estimator A is noisier than Estimator B.” It can be:

Estimator A and Estimator B are answering different questions, even if the researcher thinks they’re estimating the same parameter.

This matters a lot for moment-based methods like GMM, where researchers choose:

which moments to include,
how to weight them,
whether to use “hand-picked” weights or identity weights,
whether to use “optimal” weights.

Under misspecification, these choices don’t just affect precision—they can change what you’re estimating.

What economists actually do (and why it’s not crazy)

The authors review papers in the American Economic Review (2020–2024) using GMM-like methods and find patterns that would look “wrong” in a perfect-model world:

Many papers don’t report results using efficient weighting (even though that would be recommended if the model were correct).
Many papers use only a subset of moments (effectively ignoring some implications of the model).
Formal specification tests (J-tests or even reporting J-statistics) are rare.
Informal “eyeball tests” (plots comparing model vs data moments) are common.

This isn’t necessarily incompetence. It’s often a tacit admission that:

the model is an approximation,
some moments are more trustworthy than others,
researchers are trying to get the “least-bad” answer.

The uncomfortable tradeoff: flexibility creates “researcher degrees of freedom”

Here’s the problem: once you allow yourself to choose moments/weights because you suspect misspecification, you create room for results to vary depending on researcher judgment.

In the best case, that’s honest disagreement about what’s most credible.

In the worst case, it becomes a route to cherry-picking—what the authors describe as the potential for “weight hacking.”

A new interpretation of Hansen’s J-statistic: it measures how much your results can be “steered”

The paper’s standout contribution is a new theoretical result connecting the J-statistic to how much estimates can move when you vary weighting choices—without having to change the data.

Intuition (without math):

The J-statistic is usually treated as a “pass/fail” test of whether the over-identifying restrictions hold.
The authors show it can also be read as a fragility gauge:
- If the J-statistic is larger, there’s a wider range of plausible estimates you can obtain by changing weights while keeping standard errors in a similar ballpark.
- In other words, a bigger J means more room for alternative conclusions.

They go further: in a realistic regime where misspecification is about as large as sampling noise, they show that a sufficiently “flexible” researcher can push the t-statistic for almost any hypothesis to a level linked to the J-statistic—meaning statistical significance can become surprisingly “engineerable” when J is large.

Practical guidelines: what researchers should do differently

The authors recommend four concrete improvements for more transparent empirical work:

Separate “econometric correctness” vs “statistical fit.”

A model can fit the data poorly yet still be useful for some economic object—or fit the data “okay” but be wrong about the policy parameter you care about.
Be explicit about what your estimator implies under misspecification.

If you choose non-standard moments or non-optimal weights, say what that choice is buying you (credibility? robustness? interpretability?) and what it changes.
Use misspecification-robust standard errors.

Standard GMM standard errors can be wrong when the model doesn’t hold exactly. Robust alternatives exist, and bootstrapping often works without imposing “fake correctness.”
Report J-statistics even if you don’t do J-tests.

Even if you believe “the model is never exactly true,” the J-statistic is still valuable because it summarizes how much estimation choices could move the answer.

The takeaway for readers

If you’re reading empirical papers: don’t just ask “Is it significant?”

Ask:

How much did the result depend on moment/weight choices?
Did the authors report diagnostics that quantify fragility (like J-statistics)?
Are standard errors robust to misspecification?
Is the estimand clear, or is it implicitly shifting with the estimator?

Because in the real world, the purpose of an estimator is what it does—and under misspecification, what it does may be very different than what you think.

source: https://arxiv.org/pdf/2508.13076

Leave a Reply Cancel reply

The Internet’s “Danger Zones”: How to Spot Information Voids Before Misinformation Takes Over

When AI Decides What “Violence” Means… It Doesn’t Think Like You Do

The “Hidden Traffic Hack” in Chaotic Roads: Why 30–60% Vehicle Grouping Can Boost Flow (and When It Backfires)

The “Household Size” Bombshell: Why Some European Countries Were Basically Set Up to Lose Against COVID

Low Interest Rates Can Make a Country Rich… Then Quietly Make It Poorer (Stiglitz Explains Why)

The “Household Size” Bombshell: Why Some European Countries Were Basically Set Up to Lose Against COVID

When AI Sounds Smart but Lies: How Students Spot (and Miss) ChatGPT Hallucinations

Can Chatbots Keep Survivors Safe? Testing AI Advice for Technology-Facilitated Abuse

Rebuilding Docker Images? The Bad News: Only ~1 in 40 Are Truly Reproducible

Mind the Boundary: How a Cloud Run “A2A Hub” Makes Gemini Enterprise Agents Actually Stable

Lost Before Translation: When AI Talks to AI, Truth Gets “Polished” and Meaning Gets Thinner

Why Some Professors Thrive With Generative AI (and Others Resist): The Hidden Power of Digital Self-Efficacy