Contra Sample Splitting
Marek Kirejczyk discussed a negative trend in software development called <a href="https://blog.daftcode.pl/hype-driven-development-3469fc2e9b22#.1di3jw7xp">Hype Driven Development. I'm here to argue the same thing happens in data, econometrics, and academia.
I'll give two examples: the p-value and sample splitting. My real focus here is to convince the reader that sample splitting is a trendy trick but it is in fact bad for analysis.
The standardization of the .05 alpha level for a successful p-value test was never logically derived and it was always a professional norm. Essentially, it was all hype. Now the profession is <a href="https://www.amstat.org/asa/files/pdfs/P-ValueStatement.pdf">finally starting to change.
Now there's another thing called <a href="http://www.ssc.wisc.edu/~bhansen/papers/ecnmt_00.pdf">sample splitting. I've seen it used in a few contexts:
- Threshold Estimation
- Interaction Modelling
- Robustness Testing
- Sample splitting is not real forward-testing or meta-analysis. The following occur in real cross-sample analysis, but not in sample splitting:
- Reduced exposure to measurement bias.
- Reduced exposure to cherry-picking specifications.
- Implicit robustness checks due to various unexpected and/or unobserved differences.
- Sample splitting may cause certain independent variables to appear in one sample and not the other.
- Sample splitting may cause certain cross-relations between independent variables to exist in one sample and not the other.
- Even if both variables are in both places, the significance of a cross-correlation may be lost.
- This may be due to the absence of variance-reducing covariates in one or another subsample.
- Sample splitting may cause an analyst to miss the significance of variables which are significant but have a high variance.
- Using the full/pooled/aggregated sample will allow the analyst to identify the coefficient more precisely.
- If you are really stuck on sample splitting, consider pooling afterward and adopting the pooled coefficient, esp when the confidence intervals on either sample are consistent.
- Since you're not doing real cross-sample analysis as per (1) then really your just checking for variable significance with smaller n.
- This is cool, but you don't need to engage the whole sample splitting process to do that.
- Why 2 subsamples? A real test would leverage, at best, f(n) subsamples and check for robustness at each level.
- f(n) is like this: Start with one sample, then two, then three, and so on until each subsample has 2 observations each. Then you're done, with #subsamples = n/2.
- The point of this is to check for cross-sample durability of variable significance. This is supposedly the lauded quality of sample splitting.
- But, the same comprehensive durability check is implicit in the ordinary p-value of the specification in the aggregated sample.
Tangentially: Becker and Stigler discuss the existance of fashions and fads about scientific doctrine in the classic <a href="http://www.jstor.org.mutex.gmu.edu/stable/1807222">De Gustibus Non Est Disputandum.