两种机器学习算法构建老年冠心病患病风险评估模型的效能比较研究

王晓丽; 施天行; 彭德荣; 王朝昕; 王慧; 石建伟; 俞文雅

doi:10.16766/j.cnki.issn.1674-4152.001852

两种机器学习算法构建老年冠心病患病风险评估模型的效能比较研究

doi: 10.16766/j.cnki.issn.1674-4152.001852

1.
上海市浦东卫生发展研究院，上海 201208
2.
上海市静安区彭浦新村社区卫生服务中心，上海 200025
3.
上海交通大学医学院公共卫生学院, 上海 200025

基金项目:

国家自然科学基金面上项目 71774116

2018年国家重点研发计划项目 2018YFC2000700

上海浦东新区2019年度卫生科技项目 PW2019A-42

上海市浦江人才计划资助 2019PJC072

上海市卫生系统优秀青年人才项目 2018YQ52

上海市社区卫生协会2019社区科研项目 201940052

详细信息

通讯作者:
俞文雅，E-mail: jsjyyuwenya@sina.cn

中图分类号: R541.4
计量
- 文章访问数: 778
- HTML全文浏览量: 693
- PDF下载量: 47
- 被引次数: 0
出版历程
- 收稿日期: 2020-10-30
- 网络出版日期: 2022-02-16

Comparative study on the effectiveness of two machine learning algorithms in constructing risk assessment models of coronary heart disease in the elderly

1.
Pudong Institute for Health Development, Shanghai 201208, China

摘要

摘要: 目的基于机器学习算法构建老年冠心病风险评估模型，并比较逻辑回归(logistic)和极限梯度上升(eXtreme Gradient Boosting, XGBoost)模型在社区老年人群中预测冠心病风险的效能，以期为老年人冠心病防治提供更高效的健康管理方法。方法抽取2019年浦东地区47家社区卫生服务中心的8万条老年体检数据构建本次模型，经特征工程筛选27个变量，采用logistic和XGBoost算法构建老年冠心病风险评估模型。结果 XGBoost模型最优参数为learning_rate=0.1，树深度=8，最小子节点权重=5，循环次数=50；logistic模型最优参数为：C=1，class_weight=None，max_iter=100，solver=newton-cg。XGBoost和logistic准确度分别为0.82和0.71，受试者工作特征曲线下面积分别为0.85和0.80。两模型特征重要性分布区别较大，XGBoost模型重要性集中分布在少数特征中，前9项特征重要性之和为94.2%，logistic模型重要性分布相对均衡，前9项特征的重要性之和为59.5%。结论基于社区老年人体检数据构建的冠心病风险评估模型稳定性较好，其中XGBoost算法模型的效能相对于logistic算法模型的结果更优，能够为社区老年人冠心病风险评估提供方法参考。
- 社区 /
- 冠心病 /
- 风险评估 /
- 大数据 /
- 机器学习
Abstract: Objective The aim of this study is to established the ta risk assessment models for coronary heart disease in elderly based on machine learning algorithms and provide a more efficient health management methods for the prevention of coronary heart disease in the elderly. and compared the effectiveness of Logical regression and XGBoost for the risk prediction of coronary heart disease in elderly. Methods Data records of 47 community health service centers in Pudong area from January to December in 2019 were extracted from the regional health information platform of Shanghai Pudong health development research institute. Using Python Panda, 80 000 physical examination data of the elderly were included to build the model. Twenty-seven variables were selected by feature engineering to build the model, and logistic and xgboost were used to construct the model respectively. Results The optimal parameter of XGBoost model: learning_rate=0.1, Tree depth=8, Minimum node weight=5, Number of cycles=50. The optimal parameters of logistic model: C=1, class_weight=None, max_iter=100, solver=newton-cg. The accuracies of XGBoost and logistic were 0.82 and 0.71, and the area under the receiver operating characteristic curve was 0.85 and 0.80. The importance of XGBoost model is concentrated in a few features, and the importance of the first nine features accounts for 94.2% of the relative importance, while the importance of logistic model is relatively balanced among the features, and the importance of the first nine features accounts for 59.5% of the relative importance. Conclusion The coronary heart disease risk assessment model based on the physical examination data of the elderly in the community has good stability, and the efficiency of the model constructed by XGBoost is better than that of the logistic regression, which can provide a method for coronary heart disease risk assessment of the elderly in the community.
- Community /
- Coronary heart disease /
- Risk assessment /
- Big data /
- Machine learning

HTML全文

表 1 2种模型对老年人冠心病风险评估预测结果的比较

数据集	模型	ACC	TNR	TPR	KS	F1_score	AUC
验证集	logistic模型	0.72	0.71	0.72	0.47	0.74	0.80
	XGBoost模型	0.83	0.70	0.87	0.51	0.75	0.86
测试集	logistic模型	0.71	0.70	0.71	0.46	0.73	0.80
	XGBoost模型	0.82	0.68	0.87	0.50	0.75	0.85

下载: 导出CSV

表 2 2种模型特征重要性评分

特征分区	排序	XGBoost		Logistic
特征分区	排序	特征名称	特征重要性评分	特征名称	特征重要性评分
第一区	1	是否高血压^a	0.717 6	是否高脂血症^a	0.085 6
	2	是否脑血管病	0.160 7	右侧收缩压	0.085 2
	3	是否高脂血症^a	0.014 0	甘油三酯	0.081 2
	4	尿葡萄糖^a	0.012 6	是否慢性肾病^a	0.064 8
	5	尿酸	0.012 2	性别	0.058 7
	6	饮酒频率	0.006 9	腰围	0.057 1
	7	年龄	0.006 4	是否高血压^a	0.054 1
	8	是否糖尿病	0.006 2	右侧舒张压	0.054 1
	9	是否慢性肾病^a	0.005 2	尿葡萄糖^a	0.053 7
第二区	10	性别	0.004 9	是否脑血管病	0.051 4
	11	吸烟状况^a	0.004 2	饮食习惯_嗜油	0.051 3
	12	总胆固醇	0.004 1	尿酸	0.047 7
	13	尿素	0.003 9	低密度脂蛋白^a	0.041 2
	14	尿微量白蛋白	0.003 8	年龄	0.040 9
	15	右侧收缩压	0.003 6	饮酒频率	0.033 8
	16	低密度脂蛋白^a	0.003 6	是否糖尿病	0.028 2
	17	甘油三酯	0.003 6	葡萄糖	0.027 9
	18	右侧舒张压	0.003 5	吸烟状况^a	0.019 3
第三区	19	葡萄糖	0.003 3	肌酐^a	0.012 9
	20	饮食习惯_荤素平衡^a	0.003 2	尿微量白蛋白	0.011 0
	21	高密度脂蛋白^a	0.003 0	饮食习惯_荤素平衡^a	0.009 7
	22	BMI^a	0.002 9	B超结果^a	0.008 2
	23	B超结果^a	0.002 7	高密度脂蛋白^a	0.007 7
	24	饮食习惯_嗜盐^a	0.002 7	尿素	0.007 7
	25	肌酐^a	0.002 6	BMI^a	0.004 8
	26	腰围	0.002 6	总胆固醇	0.001 3
	27	饮食习惯_嗜油	0.000 1	饮食习惯_嗜盐^a	0.000 3
注：^a是指2种模型中分布在同一区组的特征。

下载: 导出CSV

参考文献(20)

[1]	胡盛寿, 高润霖, 刘力生, 等. 《中国心血管病报告2018》概要[J]. 中国循环杂志, 2019, 34(3): 209-220. doi: 10.3969/j.issn.1000-3614.2019.03.001
[2]	邹一帆, 徐滔, 赵婷, 等. 老年人内脏脂肪面积与血脂水平及冠心病的相关性研究[J]. 中华全科医学, 2020, 18(6): 909-912. https://www.cnki.com.cn/Article/CJFDTOTAL-SYQY202006008.htm
[3]	李富军, 杨利娟, 黄晓鸥, 等. 社区与家庭一体化管理模式在农村地区冠心病二级预防中的探索[J]. 中华全科医学, 2019, 17(8): 1360-1362. https://www.cnki.com.cn/Article/CJFDTOTAL-SYQY201908032.htm
[4]	熊日新, 林英忠. 人工智能在心血管疾病风险评估中的应用研究进展[J]. 中国临床新医学, 2020, 13(5): 537-540. doi: 10.3969/j.issn.1674-3806.2020.05.27
[5]	李婕, 向菲. 冠心病风险因素识别及其预测模型构建[J]. 中华医学图书情报杂志, 2020, 29(6): 7-13. doi: 10.3969/j.issn.1671-3982.2020.06.002
[6]	JAN G, OWACKI, MATEUSZ K, et al. Machine learning-based algorithm enables the exclusion of obstructive coronary artery disease in the patients who underwent coronary artery calcium scoring[J]. Academic Radiology, 2020, 27(10): 1416-1421. doi: 10.1016/j.acra.2019.11.016
[7]	施建伷, 蒋志新, 叶力, 等. 人工智能在冠心病诊断及危险度分层中的应用进展[J]. 医学研究生学报, 2019, 32(9): 973-977. https://www.cnki.com.cn/Article/CJFDTOTAL-JLYB201909016.htm
[8]	阿拉依·阿汗, 田翔华, 肖齐, 等. 关联规则与Logistic回归在维吾尔族健康体检人群代谢综合征数据挖掘中的应用[J]. 现代预防医学, 2018, 45(7): 1161-1165. https://www.cnki.com.cn/Article/CJFDTOTAL-XDYF201807003.htm
[9]	魏倩. 高维数据下改进Logistic回归模型及其应用研究[D]. 武汉: 中南财经政法大学, 2019.
[10]	谷鸿秋, 王春娟, 李子孝, 等. 基于Logistic回归与XGBoost构建缺血性卒中院内复发风险预测模型的初步比较研究[J]. 中国卒中杂志, 2020, 15(6): 587-594. doi: 10.3969/j.issn.1673-5765.2020.06.003
[11]	李惠萍, 胡安民. 机器学习DNN和XGBoost算法对危重患者预后预测模型效能评估[J]. 实用医学杂志, 2020, 36(4): 466-469. doi: 10.3969/j.issn.1006-5725.2020.04.009
[12]	中国心血管病预防指南(2017)写作组, 中华心血管病杂志编辑委员会. 《中国心血管病预防指南(2017)》冠心病二级预防要点[J]. 实用心脑肺血管病杂志, 2018, 26(1): 6. https://www.cnki.com.cn/Article/CJFDTOTAL-SYXL201801054.htm
[13]	尹春燕. 基于集成特征选择的冠心病筛查模型研究[D]. 济南: 山东大学, 2019.
[14]	LI D, XIONG G L, ZENG H S, et al. Machine learning-aided risk stratification system for the prediction of coronary artery disease[J]. Int J Cardiol, 2020, 326: 30-34. http://www.sciencedirect.com/science/article/pii/S0167527320339000
[15]	龚军, 杜超, 钟小钢, 等. 基于机器学习算法的原发性高血压并发冠心病的患病风险研究[J]. 解放军医学杂志, 2020, 45(7): 735-741. https://www.cnki.com.cn/Article/CJFDTOTAL-JFJY202007010.htm
[16]	魏珂, 司春婴, 王贺, 等. 人工智能在心血管疾病诊断及风险预测中的研究进展[J]. 世界科学技术-中医药现代化, 2020, 22(10): 3576-3582. https://www.cnki.com.cn/Article/CJFDTOTAL-SJKX202010024.htm
[17]	SUAT G, CEM B, SALIM Y, et al. PP-171 the role of cardiovascular risk factors and risk scoring systems in predicting coronary atherosclerosis[J]. AM J Cardiol, 2016, 117: S102. http://www.onacademic.com/detail/journal_1000039648878310_fe2c.html
[18]	HAN D, KOLLI K K, GRANSAR H, et al. Machine learning based risk prediction model for asymptomatic individuals who underwent coronary artery calcium score: Comparison with traditional risk prediction approaches[J]. J Cardiovasc Comput Tomogr, 2020, 14(2): 168-176. doi: 10.1016/j.jcct.2019.09.005
[19]	逄凯. 三种机器学习方法在冠心病筛查中的比较研究[D]. 长春: 吉林大学, 2016.
[20]	刘毅. 基于集成学习算法的冠心病早期筛查方法研究[D]. 济南: 山东大学, 2018.