journal article Open Access May 07, 2022

A Cross-Validation Statistical Framework for Asymmetric Data Integration

Biometrics Vol. 79 No. 2 pp. 1280-1292 · JSTOR
View at Publisher Save 10.1111/biom.13685
Abstract
Abstract
The proliferation of biobanks and large public clinical data sets enables their integration with a smaller amount of locally gathered data for the purposes of parameter estimation and model prediction. However, public data sets may be subject to context-dependent confounders and the protocols behind their generation are often opaque; naively integrating all external data sets equally can bias estimates and lead to spurious conclusions. Weighted data integration is a potential solution, but current methods still require subjective specifications of weights and can become computationally intractable. Under the assumption that local data are generated from the set of unknown true parameters, we propose a novel weighted integration method based upon using the external data to minimize the local data leave-one-out cross validation (LOOCV) error. We demonstrate how the optimization of LOOCV errors for linear and Cox proportional hazards models can be rewritten as functions of external data set integration weights. Significant reductions in estimation error and prediction error are shown using simulation studies mimicking the heterogeneity of clinical data as well as a real-world example using kidney transplant patients from the Scientific Registry of Transplant Recipients.
Topics

No keywords indexed for this article. Browse by subject →

References
30
[1]
Belsey (1980) 10.1002/0471725153
[2]
Brookhart "Confounding control in healthcare database research: challenges and potential approaches" Medical Care (2010) 10.1097/mlr.0b013e3181dbebe3
[3]
A Limited Memory Algorithm for Bound Constrained Optimization

Richard H. Byrd, Peihuang Lu, Jorge Nocedal et al.

SIAM Journal on Scientific Computing 1995 10.1137/0916069
[4]
Davis "The extent and predictors of waiting time geographic disparity in kidney transplantation in the United States" Transplantation (2014) 10.1097/01.tp.0000438623.89310.dc
[5]
Delmonico "Analysis of the wait list and deaths among candidates waiting for a kidney transplant" Transplantation (2008) 10.1097/tp.0b013e31818fe694
[6]
Fu "Weighted empirical likelihood inference for multiple samples" Journal of Statistical Planning and Inference (2009) 10.1016/j.jspi.2008.07.015
[7]
Goldfarb-Rumyantzev "Duration of end-stage renal disease and kidney transplant outcome" Nephrology Dialysis Transplantation (2005) 10.1093/ndt/gfh541
[8]
Guo "Data fusion using weighted likelihood" European Journal of Pure and Applied Mathematics (2012)
[9]
Han (2019)
[10]
Hong "A robust nonlinear identification algorithm using press statistic and forward regression" IEEE Transactions on Neural Networks (2003) 10.1109/tnn.2003.809422
[11]
Ibrahim "Power prior distributions for regression models" Statistical Science (2000)
[12]
Ibrahim "On optimality properties of the power prior" Journal of the American Statistical Association (2003) 10.1198/016214503388619229
[13]
Inan "A press statistic for working correlation structure selection in generalized estimating equations" Journal of Applied Statistics (2019) 10.1080/02664763.2018.1508560
[14]
Jiang "Variable selection with prior information for generalized linear models via the prior lasso method" Journal of the American Statistical Association (2016) 10.1080/01621459.2015.1008363
[15]
Louie "Data integration and genomic medicine" Journal of Biomedical Informatics (2007) 10.1016/j.jbi.2006.02.007
[16]
Meier-Kriesche "The impact of body mass index on renal transplant outcomes: a significant independent risk factor for graft failure and patient death" Transplantation (2002) 10.1097/00007890-200201150-00013
[17]
Meijer "Efficient approximate k-fold and leave-one-out cross-validation for ridge regression" Biometrical Journal (2013) 10.1002/bimj.201200088
[18]
Metzger "Expanded criteria donors for kidney transplantation" American Journal of Transplantation (2003) 10.1034/j.1600-6143.3.s4.11.x
[19]
Plante "Nonparametric adaptive likelihood weights" Canadian Journal of Statistics (2008) 10.1002/cjs.5550360308
[20]
Plante "Asymptotic properties of the MAMSE adaptive likelihood weights" Journal of Statistical Planning and Inference (2009) 10.1016/j.jspi.2008.10.001
[21]
Rodriguez-Bermudez "Efficient feature selection and linear discrimination of eeg signals" Neurocomputing (2013) 10.1016/j.neucom.2013.01.001
[22]
Snyder "Developing statistical models to assess transplant outcomes using national registries: the process in the United States" Transplantation (2016) 10.1097/tp.0000000000000891
[23]
Tennankore "Frailty and the kidney transplant wait list: protocol for a multicenter prospective study" (2020)
[24]
Than "Confounding factors in HGT detection: statistical error, coalescent effects, and multiple solutions" (2017)
[25]
Van Houwelingen "Cross-validated Cox regression on microarray gene expression data" Statistics in Medicine (2006) 10.1002/sim.2353
[26]
Veroux "Age is an important predictor of kidney transplantation outcome" Nephrology Dialysis Transplantation (2012) 10.1093/ndt/gfr524
[27]
Verweij "Cross-validation in survival analysis" Statistics in Medicine (1993) 10.1002/sim.4780122407
[28]
Wang "Selecting likelihood weights by cross-validation" The Annals of Statistics (2005) 10.1214/009053604000001309
[29]
Comparison of Mortality in All Patients on Dialysis, Patients on Dialysis Awaiting Transplantation, and Recipients of a First Cadaveric Transplant

Robert A. Wolfe, Valarie B. Ashby, Edgar L. Milford et al.

New England Journal of Medicine 1999 10.1056/nejm199912023412303
[30]
Zhai "Data integration with oracle use of external information from heterogeneous populations" (2022) 10.1080/10618600.2022.2050248
Metrics
9
Citations
30
References
Details
Published
May 07, 2022
Vol/Issue
79(2)
Pages
1280-1292
License
View
Funding
National Institutes of Health Award: 5T32CA083654
Cite This Article
Lam Tran, Kevin He, Di Wang, et al. (2022). A Cross-Validation Statistical Framework for Asymmetric Data Integration. Biometrics, 79(2), 1280-1292. https://doi.org/10.1111/biom.13685
Related

You May Also Like