XGboost prediction model for osteoarthritis risk based on community big data
-
摘要:
目的 探索社区医疗大数据和机器学习模型构建骨关节炎风险预警模型,为社区骨关节炎的早期预警提供定量工具,以期为老年人骨关节炎防治提供更高效的管理方法。 方法 集成2019年1月1日—12月31日上海6家社区卫生服务中心2019年健康档案、健康体检和诊疗数据形成包含4万多个样本和126个变量的原始数据库,经过数据预处理和复合特征选择筛选入模特征,采用XGBoost算法构建骨关节炎患者风险评估模型。 结果 本研究筛选纳入模型14个,包括饮食是否荤素均衡、身高、体重、BMI、每次锻炼时间、总胆固醇、高密度脂蛋白、低密度脂蛋白、是否患有高血压、是否有肢体外伤等。其中重要性排名前5位的特征因素分别为高密度脂蛋白、总胆固醇、BMI、低密度脂蛋白、饮酒频率,其特征重要度均超过0.1。以“是否骨关节炎”作为输出变量,特征工程筛选后的14个特征作为输入变量,构建骨关节炎风险评估的XGBoost模型,采用8折交叉验证的方法训练后,在测试集上验证模型准确率为92%,精确率为71%,召回率为65%,F1_score为0.68,AUC达到0.82,KS值为0.48。 结论 本研究采用社区医疗大数据构建了骨关节炎风险预警模型,模型的整体拟合度和特征合理性较好,为社区骨关节炎的早期预警提供了工具,有利于社区骨关节炎的早诊早治。 Abstract:Objective To explore the construction of osteoarthritis risk warning model by community medical big data and machine learning model, provide a quantitative tool for the early warning of osteoarthritis in the community, to provide an efficient management method for the prevention and treatment of osteoarthritis in the elderly. Methods The data of health records, health examinations and diagnosis and treatment data of six community health service centres in Shanghai from January 1, 2019 to December 31, 2019, were integrated to form an original database containing more than 40 000 samples and 126 variables. After data pre-processing and compound feature selection to screen the model characteristics, the XGBoost algorithm was used to construct a risk assessment model for osteoarthritis patients. Results Fourteen characteristics were screened in this study: diet with balanced meat and vegetables, height, weight, body mass index (BMI), time of each exercise, total cholesterol, high-density lipoprotein, low-density lipoprotein, hypertension, limb trauma, etc. High-density lipoprotein, total cholesterol, BMI, low-density lipoprotein and frequency of drinking were the top five characteristic factors in importance ranking, and their characteristic importance was more than 0.1. The XGBoost model of osteoarthritis risk assessment was constructed with 'osteoarthritis' as the output variable, and 14 features were screened by feature engineering as the input variable. After the XGBoost model was trained by eightfold cross-validation, the model was validated on the test set with an accuracy rate of 92%, a precision rate of 71% and recall rate of 65%, F1_score was 0.68, the area under the receiver operating characteristic curve reached 0.82, and the KS value was 0.48. Conclusion In this study, a risk warning model of osteoarthritis is constructed using community medical big data, and the overall fit and feature rationality of the model are good, which provides a tool for the early warning of osteoarthritis in the community and is conducive to the early diagnosis and treatment of osteoarthritis in the community. -
Key words:
- Community /
- Osteoarthritis /
- Risk prediction /
- Big data
-
表 1 骨关节炎风险预警模型入模特征描述性分析
Table 1. Descriptive analysis of risk warning model for osteoarthritis
变量名 是否骨关节炎 无骨关节炎 有骨关节炎 性别[例(%)] 女性 21 414(57.9) 2 115(64.1) 男性 15 550(42.1) 1 182(35.9) 吸烟状况[例(%)] 不吸烟 27 326(73.9) 2 283(69.2) 已戒烟 1 817(4.9) 249(7.6) 吸烟 3 503(9.5) 325(9.9) 缺失 4 318(11.7) 440(13.3) 饮酒情况[例(%)] 不饮酒 25 761(69.7) 2 472(75.0) 已戒酒 4 364(11.8) 210(6.4) 饮酒 1 930(5.2) 152(4.6) 缺失 4 909(13.3) 463(14.0) 饮食荤素平衡[例(%)] 否 5 787(15.7) 794(24.1) 是 31 177(84.3) 2 503(75.9) 是否高血压[例(%)] 否 14 301(38.7) 922(28.0) 是 22 663(61.3) 2 375(72.0) 肢体外伤[例(%)] 否 34 283(92.7) 2 784(84.4) 是 2 681(7.3) 513(15.6) 年龄(x±s,岁) 72.5±7.7 71.3±6.7 身高(x±s,cm) 159.8±8.4 159.1±8.4 体重(x±s,kg) 63.7±10.8 64.3±10.9 BMI(x±s) 24.8±3.4 25.2±3.4 每次锻炼时间(x±s,min) 40.8±20.2 38.7±14.4 高密度脂蛋白(x±s,mmol/L) 1.3±0.3 1.5±0.3 低密度脂蛋白(x±s,mmol/L) 2.9±0.9 3.0±0.9 总胆固醇(x±s,mmol/L) 4.8±1.1 5.2±1.0 表 2 骨关节炎风险预警模型评价指标
Table 2. Evaluation index of osteoarthritis risk warning model
数据集 准确率(%) 精确率(%) 召回率(%) F1_score KS值 AUC 训练集 94 69 71 0.69 0.42 0.86 测试集 92 71 65 0.68 0.48 0.82 -
[1] 中华医学会骨科学分会关节外科学组. 骨关节炎诊疗指南(2018年版)[J]. 中华骨科杂志, 2018, 38 (12): 705-715. doi: 10.3760/cma.j.issn.0253-2352.2018.12.001Group of Joint Surgery, Chinese Society of Osteology. Clinical Guidelines for Osteoarthritis (2018 edition)[J]. Chinese Journal of Orthopaedics, 2018, 38 (12): 705-715. doi: 10.3760/cma.j.issn.0253-2352.2018.12.001 [2] 郑双, 徐建华, 黄淑婷, 等. 某三甲医院148例膝骨关节炎患者就医及治疗现状分析[J]. 中华疾病控制杂志, 2015, 19(1): 91-92, 106. https://www.cnki.com.cn/Article/CJFDTOTAL-JBKZ201501024.htmZHENG S, XU J H, HUANG S T, et al. Cross-sectional study of the hospitalizing behavior and therapeutic status of 148 knee osteoarthritis patients in a third-level first-calss hospital[J]. Chinese Journal of Disease Control & Prevention, 2015, 19(1): 91-92, 106. https://www.cnki.com.cn/Article/CJFDTOTAL-JBKZ201501024.htm [3] CROSS M, SMITH E, HOY D, et al. The global burden of hip and knee osteoarthritis: Estimates from the global burden of disease 2010 study[J]. Ann Rheum Dis, 2014, 73(7): 1323-1330. doi: 10.1136/annrheumdis-2013-204763 [4] 任燕, 石娅娅, 谭波, 等. 中国人群膝骨关节炎危险因素的Meta分析[J]. 现代预防医学, 2015, 42(12): 2282-2284, 2292. https://www.cnki.com.cn/Article/CJFDTOTAL-XDYF201512053.htmREN Y, SHI Y Y, TAN B, et al. Meta-analysis of the risk factors for knee osteoarthritis among the Chinese population[J]. Modern Preventive Medicine, 2015, 42(12): 2282-2284, 2292. https://www.cnki.com.cn/Article/CJFDTOTAL-XDYF201512053.htm [5] 陈颂春, 王欣欣, 高翔. 膝骨关节炎危险因素的系统评价与Meta分析[J]. 老年医学与保健, 2016, 22(6): 405-410. doi: 10.3969/j.issn.1008-8296.2016.06.23CHEN S C, WANG X X, GAO X. Risk Factors for Knee Osteoarthritis: a Systematic Review and Meta-analysis[J]. Geriatrics & Health Care, 2016, 22(6): 405-410 doi: 10.3969/j.issn.1008-8296.2016.06.23 [6] 杨迎春, 于晓璐, 顾海伦, 等. 辽宁省某三甲医院膝关节骨性关节炎患者常见影响因素的调查[J]. 现代预防医学, 2018, 45(8): 1516-1519. https://www.cnki.com.cn/Article/CJFDTOTAL-XDYF201808046.htmYANG Y C, YU X L, GU H L, et al. Survey of influencing factors on patients with knee osteoarthritis in a hospital of Liaoning province[J]. Modern Preventive Medicine, 2018, 45(8): 1516-1519. https://www.cnki.com.cn/Article/CJFDTOTAL-XDYF201808046.htm [7] 沈锋, 夏正明, 周满春. 广州市南石头社区中老年群体膝骨关节炎的危险因素分析[J]. 江西医药, 2019, 54(2): 144-146. https://www.cnki.com.cn/Article/CJFDTOTAL-JXYY201902021.htmSHEN F, XIA Z M, ZHOU M C. Risk factors analysis of knee osteoarthritis in the aged population in Nanshitou Community, Guangzhou City[J]. Jiangxi Medical Journal, 2019, 54(2): 144-146. https://www.cnki.com.cn/Article/CJFDTOTAL-JXYY201902021.htm [8] 张洪逵, 陈国华, 叶壮益. 膝骨关节炎发生的影响因素分析[J]. 实用中西医结合临床, 2018, 18(7): 120-121. https://www.cnki.com.cn/Article/CJFDTOTAL-SZXL201807066.htmZHANG H K, CHEN G H, YE Z Y. Analysis of influencing factors of knee osteoarthritis[J]. Practical Clinical Journal of Integrated Traditional Chinese and Western Medicine, 2018, 18(7): 120-121. https://www.cnki.com.cn/Article/CJFDTOTAL-SZXL201807066.htm [9] 许永超. 基于多标签体检数据的疾病风险分析方法研究[D]. 郑州: 郑州大学, 2017.XU Y C. Study on disease risk analysis method based on multi-label physical examination data[D]. Zhengzhou: Zhengzhou University, 2017. [10] 夏涛, 徐辉煌, 郑建立. 基于机器学习的冠心病住院费用预测研究[J]. 智能计算机与应用, 2019, 9(5): 35-39. https://www.cnki.com.cn/Article/CJFDTOTAL-DLXZ201905008.htmXIA T, XU H H, ZHENG J L. Prediction of hospitalization expenses for coronary heart disease based on machine learning[J]. Intelligent Computer and Applications, 2019, 9(5): 35-39. https://www.cnki.com.cn/Article/CJFDTOTAL-DLXZ201905008.htm [11] 安莹, 黄能军, 杨荣, 等. 基于深度学习的心血管疾病风险预测模型[J]. 中国医学物理学杂志, 2019, 36(9): 1103-1112. https://www.cnki.com.cn/Article/CJFDTOTAL-YXWZ201909021.htmAN Y, HUANG N J, YANG R, et al. Deep learning-based model for risk prediction of cardiovascular diseases[J]. Chinese Journal of Medical Physics, 2019, 36(9): 1103-1112. https://www.cnki.com.cn/Article/CJFDTOTAL-YXWZ201909021.htm [12] 彭佳丽, 刘春容, 李旭, 等. 采用XGBoost和随机森林探索中国西部女性乳腺癌危险因素[J]. 现代预防医学, 2020, 47(1): 1-4. https://www.cnki.com.cn/Article/CJFDTOTAL-XDYF202001001.htmPENG J L, LIU C R, LI X, et al. Applying XGBoost and random frost to explore the risk factors of breast cancer among western Chinese women[J]. Modern Preventive Medicine, 2020, 47(1): 1-4. https://www.cnki.com.cn/Article/CJFDTOTAL-XDYF202001001.htm [13] 李占山, 刘兆赓. 基于XGBoost的特征选择算法[J]. 通信学报, 2019, 40(10): 101-108. https://www.cnki.com.cn/Article/CJFDTOTAL-TXXB201910010.htmLI Z S, LIU Z G. Feature selection algorithm based on XGBoost[J]. Journal on Communications, 2019, 40(10): 101-108. https://www.cnki.com.cn/Article/CJFDTOTAL-TXXB201910010.htm [14] 岳鹏, 侯凌燕, 杨大利, 等. 基于XGBoost特征选择的疾病诊断XLC-Stacking方法[J]. 计算机工程与应用, 2020, 56(17): 136-141. https://www.cnki.com.cn/Article/CJFDTOTAL-JSGG202017020.htmYUE P, HOU L Y, YANG D L, et al. XLC-Stacking Method for Disease Diagnosis Based on XGBoost Feature Selection[J]. Computer Engineering and Applications, 2020, 56(17): 136-141. https://www.cnki.com.cn/Article/CJFDTOTAL-JSGG202017020.htm [15] RASHEED Z, RASHEED N, Al-SHOBAILI H A, et al. Epigallocatechin-3-O-gallate up-regulates microRNA-199a-3p expression by down-regulating the expression of cyclooxygenase-2 in stimulated human osteoarthritis chondrocytes[J]. J Cell Mol Med, 2016, 20(12): 2241-2248. [16] 卢敏强, 钟庆, 贾兆锋, 等. 雌激素与骨关节炎[J]. 国际骨科学杂志, 2018, 39(1): 41-44. https://www.cnki.com.cn/Article/CJFDTOTAL-GWGK201801015.htmLU M Q, ZHONG Q, JIA Z F, et al. Estrogen and osteoarthritis[J]. International Journal of Orthopaedics, 2018, 39(1): 41-44. https://www.cnki.com.cn/Article/CJFDTOTAL-GWGK201801015.htm [17] 沈明球, 刘俊昌, 王新军, 等. 新疆北疆牧区维、哈、汉族膝骨性关节炎致病因素的流行病学调查[J]. 中国组织工程研究, 2015, 19(29): 4614-4618. https://www.cnki.com.cn/Article/CJFDTOTAL-XDKF201529007.htmSHEN M Q, LIU J C, WANG X J, et al. An epidemiological investigation on the pathogenic factors of knee osteoarthritis in Uygur, Kazakh and Han populations in pastoral areas of northern Xinjiang Uygur Autonomous Region, China[J]. Chinese Journal of Tissue Engineering Research, 2015, 19(29): 4614-4618. https://www.cnki.com.cn/Article/CJFDTOTAL-XDKF201529007.htm [18] LIU Y, ZHANG H F, LIANG N X, et al. Prevalence and associated factors of knee osteoarthritis in a rural Chinese adult population: An epidemiological survey[J]. BMC Public Health, 2016, 16: 94. [19] ANTONY B, VENN A, CICUTTINI F, et al. Correlates of knee bone marrow lesions in younger adults[J]. Arthritis Res Ther, 2016, 18: 31 [20] 石银朋, 奚阳, 张志毅, 等. 血脂对骨关节炎影响研究进展[J]. 中国实用内科杂志, 2020, 40(1): 67-69. https://www.cnki.com.cn/Article/CJFDTOTAL-SYNK202001016.htmSHI Y P, XI Y, ZHANG Z Y, et al. Research progress in the effect of blood lipids on osteoarthritis[J]. Chinese Journal of Practical Internal Medicine, 2020, 40(1): 67-69. https://www.cnki.com.cn/Article/CJFDTOTAL-SYNK202001016.htm [21] 吴鹏, 茆军. 代谢组学在中医药治疗膝骨关节炎中应用的研究进展[J]. 中国医药, 2021, 16(9): 1420-1422. https://www.cnki.com.cn/Article/CJFDTOTAL-ZGYG202109032.htmWUP, MAO J. Research progress on metabolomics in the treatment of knee osteoarthritis with traditional Chinese medicine[J]. China Medicine, 2021, 16(9): 1420-1422. https://www.cnki.com.cn/Article/CJFDTOTAL-ZGYG202109032.htm [22] 陈江水, 杨华瑞, 方志, 等. 脂质代谢异常与骨关节炎关系研究进展[J]. 海南医学, 2018, 29(5): 682-684. https://www.cnki.com.cn/Article/CJFDTOTAL-HAIN201805029.htmCHEN J S, YANG H R, FANG Z, et al. Relationship between abnormal lipid metabolism and progression of osteoarthritis[J]. Hainan Medical Journal, 2018, 29(5): 682-684. https://www.cnki.com.cn/Article/CJFDTOTAL-HAIN201805029.htm