A Cross-Validation Statistical Framework for Asymmetric Data Integration
The proliferation of biobanks and large public clinical data sets enables their integration with a smaller amount of locally gathered data for the purposes of parameter estimation and model prediction. However, public data sets may be subject to context-dependent confounders and the protocols behind their generation are often opaque; naively integrating all external data sets equally can bias estimates and lead to spurious conclusions. Weighted data integration is a potential solution, but current methods still require subjective specifications of weights and can become computationally intractable. Under the assumption that local data are generated from the set of unknown true parameters, we propose a novel weighted integration method based upon using the external data to minimize the local data leave-one-out cross validation (LOOCV) error. We demonstrate how the optimization of LOOCV errors for linear and Cox proportional hazards models can be rewritten as functions of external data set integration weights. Significant reductions in estimation error and prediction error are shown using simulation studies mimicking the heterogeneity of clinical data as well as a real-world example using kidney transplant patients from the Scientific Registry of Transplant Recipients.
No keywords indexed for this article. Browse by subject →
Richard H. Byrd, Peihuang Lu, Jorge Nocedal et al.
Robert A. Wolfe, Valarie B. Ashby, Edgar L. Milford et al.
- Published
- May 07, 2022
- Vol/Issue
- 79(2)
- Pages
- 1280-1292
- License
- View
You May Also Like
J. Richard Landis, Gary G. Koch · 1977
60,481 citations
Elizabeth R. DeLong, David M. DeLong · 1988
21,831 citations
Colin B. Begg, Madhuchhanda Mazumdar · 1994
14,113 citations
Sue Duval, Richard Tweedie · 2000
11,424 citations