Comparative study on the effectiveness of two machine learning algorithms in constructing risk assessment models of coronary heart disease in the elderly
-
摘要:
目的 基于机器学习算法构建老年冠心病风险评估模型,并比较逻辑回归(logistic)和极限梯度上升(eXtreme Gradient Boosting, XGBoost)模型在社区老年人群中预测冠心病风险的效能,以期为老年人冠心病防治提供更高效的健康管理方法。 方法 抽取2019年浦东地区47家社区卫生服务中心的8万条老年体检数据构建本次模型,经特征工程筛选27个变量,采用logistic和XGBoost算法构建老年冠心病风险评估模型。 结果 XGBoost模型最优参数为learning_rate=0.1,树深度=8,最小子节点权重=5,循环次数=50;logistic模型最优参数为:C=1,class_weight=None,max_iter=100,solver=newton-cg。XGBoost和logistic准确度分别为0.82和0.71,受试者工作特征曲线下面积分别为0.85和0.80。两模型特征重要性分布区别较大,XGBoost模型重要性集中分布在少数特征中,前9项特征重要性之和为94.2%,logistic模型重要性分布相对均衡,前9项特征的重要性之和为59.5%。 结论 基于社区老年人体检数据构建的冠心病风险评估模型稳定性较好,其中XGBoost算法模型的效能相对于logistic算法模型的结果更优,能够为社区老年人冠心病风险评估提供方法参考。 Abstract:Objective The aim of this study is to established the ta risk assessment models for coronary heart disease in elderly based on machine learning algorithms and provide a more efficient health management methods for the prevention of coronary heart disease in the elderly. and compared the effectiveness of Logical regression and XGBoost for the risk prediction of coronary heart disease in elderly. Methods Data records of 47 community health service centers in Pudong area from January to December in 2019 were extracted from the regional health information platform of Shanghai Pudong health development research institute. Using Python Panda, 80 000 physical examination data of the elderly were included to build the model. Twenty-seven variables were selected by feature engineering to build the model, and logistic and xgboost were used to construct the model respectively. Results The optimal parameter of XGBoost model: learning_rate=0.1, Tree depth=8, Minimum node weight=5, Number of cycles=50. The optimal parameters of logistic model: C=1, class_weight=None, max_iter=100, solver=newton-cg. The accuracies of XGBoost and logistic were 0.82 and 0.71, and the area under the receiver operating characteristic curve was 0.85 and 0.80. The importance of XGBoost model is concentrated in a few features, and the importance of the first nine features accounts for 94.2% of the relative importance, while the importance of logistic model is relatively balanced among the features, and the importance of the first nine features accounts for 59.5% of the relative importance. Conclusion The coronary heart disease risk assessment model based on the physical examination data of the elderly in the community has good stability, and the efficiency of the model constructed by XGBoost is better than that of the logistic regression, which can provide a method for coronary heart disease risk assessment of the elderly in the community. -
Key words:
- Community /
- Coronary heart disease /
- Risk assessment /
- Big data /
- Machine learning
-
表 1 2种模型对老年人冠心病风险评估预测结果的比较
数据集 模型 ACC TNR TPR KS F1_score AUC 验证集 logistic模型 0.72 0.71 0.72 0.47 0.74 0.80 XGBoost模型 0.83 0.70 0.87 0.51 0.75 0.86 测试集 logistic模型 0.71 0.70 0.71 0.46 0.73 0.80 XGBoost模型 0.82 0.68 0.87 0.50 0.75 0.85 表 2 2种模型特征重要性评分
特征分区 排序 XGBoost Logistic 特征名称 特征重要性评分 特征名称 特征重要性评分 第一区 1 是否高血压a 0.717 6 是否高脂血症a 0.085 6 2 是否脑血管病 0.160 7 右侧收缩压 0.085 2 3 是否高脂血症a 0.014 0 甘油三酯 0.081 2 4 尿葡萄糖a 0.012 6 是否慢性肾病a 0.064 8 5 尿酸 0.012 2 性别 0.058 7 6 饮酒频率 0.006 9 腰围 0.057 1 7 年龄 0.006 4 是否高血压a 0.054 1 8 是否糖尿病 0.006 2 右侧舒张压 0.054 1 9 是否慢性肾病a 0.005 2 尿葡萄糖a 0.053 7 第二区 10 性别 0.004 9 是否脑血管病 0.051 4 11 吸烟状况a 0.004 2 饮食习惯_嗜油 0.051 3 12 总胆固醇 0.004 1 尿酸 0.047 7 13 尿素 0.003 9 低密度脂蛋白a 0.041 2 14 尿微量白蛋白 0.003 8 年龄 0.040 9 15 右侧收缩压 0.003 6 饮酒频率 0.033 8 16 低密度脂蛋白a 0.003 6 是否糖尿病 0.028 2 17 甘油三酯 0.003 6 葡萄糖 0.027 9 18 右侧舒张压 0.003 5 吸烟状况a 0.019 3 第三区 19 葡萄糖 0.003 3 肌酐a 0.012 9 20 饮食习惯_荤素平衡a 0.003 2 尿微量白蛋白 0.011 0 21 高密度脂蛋白a 0.003 0 饮食习惯_荤素平衡a 0.009 7 22 BMIa 0.002 9 B超结果a 0.008 2 23 B超结果a 0.002 7 高密度脂蛋白a 0.007 7 24 饮食习惯_嗜盐a 0.002 7 尿素 0.007 7 25 肌酐a 0.002 6 BMIa 0.004 8 26 腰围 0.002 6 总胆固醇 0.001 3 27 饮食习惯_嗜油 0.000 1 饮食习惯_嗜盐a 0.000 3 注:a是指2种模型中分布在同一区组的特征。 -
[1] 胡盛寿, 高润霖, 刘力生, 等. 《中国心血管病报告2018》概要[J]. 中国循环杂志, 2019, 34(3): 209-220. doi: 10.3969/j.issn.1000-3614.2019.03.001 [2] 邹一帆, 徐滔, 赵婷, 等. 老年人内脏脂肪面积与血脂水平及冠心病的相关性研究[J]. 中华全科医学, 2020, 18(6): 909-912. https://www.cnki.com.cn/Article/CJFDTOTAL-SYQY202006008.htm [3] 李富军, 杨利娟, 黄晓鸥, 等. 社区与家庭一体化管理模式在农村地区冠心病二级预防中的探索[J]. 中华全科医学, 2019, 17(8): 1360-1362. https://www.cnki.com.cn/Article/CJFDTOTAL-SYQY201908032.htm [4] 熊日新, 林英忠. 人工智能在心血管疾病风险评估中的应用研究进展[J]. 中国临床新医学, 2020, 13(5): 537-540. doi: 10.3969/j.issn.1674-3806.2020.05.27 [5] 李婕, 向菲. 冠心病风险因素识别及其预测模型构建[J]. 中华医学图书情报杂志, 2020, 29(6): 7-13. doi: 10.3969/j.issn.1671-3982.2020.06.002 [6] JAN G, OWACKI, MATEUSZ K, et al. Machine learning-based algorithm enables the exclusion of obstructive coronary artery disease in the patients who underwent coronary artery calcium scoring[J]. Academic Radiology, 2020, 27(10): 1416-1421. doi: 10.1016/j.acra.2019.11.016 [7] 施建伷, 蒋志新, 叶力, 等. 人工智能在冠心病诊断及危险度分层中的应用进展[J]. 医学研究生学报, 2019, 32(9): 973-977. https://www.cnki.com.cn/Article/CJFDTOTAL-JLYB201909016.htm [8] 阿拉依·阿汗, 田翔华, 肖齐, 等. 关联规则与Logistic回归在维吾尔族健康体检人群代谢综合征数据挖掘中的应用[J]. 现代预防医学, 2018, 45(7): 1161-1165. https://www.cnki.com.cn/Article/CJFDTOTAL-XDYF201807003.htm [9] 魏倩. 高维数据下改进Logistic回归模型及其应用研究[D]. 武汉: 中南财经政法大学, 2019. [10] 谷鸿秋, 王春娟, 李子孝, 等. 基于Logistic回归与XGBoost构建缺血性卒中院内复发风险预测模型的初步比较研究[J]. 中国卒中杂志, 2020, 15(6): 587-594. doi: 10.3969/j.issn.1673-5765.2020.06.003 [11] 李惠萍, 胡安民. 机器学习DNN和XGBoost算法对危重患者预后预测模型效能评估[J]. 实用医学杂志, 2020, 36(4): 466-469. doi: 10.3969/j.issn.1006-5725.2020.04.009 [12] 中国心血管病预防指南(2017)写作组, 中华心血管病杂志编辑委员会. 《中国心血管病预防指南(2017)》冠心病二级预防要点[J]. 实用心脑肺血管病杂志, 2018, 26(1): 6. https://www.cnki.com.cn/Article/CJFDTOTAL-SYXL201801054.htm [13] 尹春燕. 基于集成特征选择的冠心病筛查模型研究[D]. 济南: 山东大学, 2019. [14] LI D, XIONG G L, ZENG H S, et al. Machine learning-aided risk stratification system for the prediction of coronary artery disease[J]. Int J Cardiol, 2020, 326: 30-34. http://www.sciencedirect.com/science/article/pii/S0167527320339000 [15] 龚军, 杜超, 钟小钢, 等. 基于机器学习算法的原发性高血压并发冠心病的患病风险研究[J]. 解放军医学杂志, 2020, 45(7): 735-741. https://www.cnki.com.cn/Article/CJFDTOTAL-JFJY202007010.htm [16] 魏珂, 司春婴, 王贺, 等. 人工智能在心血管疾病诊断及风险预测中的研究进展[J]. 世界科学技术-中医药现代化, 2020, 22(10): 3576-3582. https://www.cnki.com.cn/Article/CJFDTOTAL-SJKX202010024.htm [17] SUAT G, CEM B, SALIM Y, et al. PP-171 the role of cardiovascular risk factors and risk scoring systems in predicting coronary atherosclerosis[J]. AM J Cardiol, 2016, 117: S102. http://www.onacademic.com/detail/journal_1000039648878310_fe2c.html [18] HAN D, KOLLI K K, GRANSAR H, et al. Machine learning based risk prediction model for asymptomatic individuals who underwent coronary artery calcium score: Comparison with traditional risk prediction approaches[J]. J Cardiovasc Comput Tomogr, 2020, 14(2): 168-176. doi: 10.1016/j.jcct.2019.09.005 [19] 逄凯. 三种机器学习方法在冠心病筛查中的比较研究[D]. 长春: 吉林大学, 2016. [20] 刘毅. 基于集成学习算法的冠心病早期筛查方法研究[D]. 济南: 山东大学, 2018.
计量
- 文章访问数: 531
- HTML全文浏览量: 511
- PDF下载量: 42
- 被引次数: 0