heart disease prediction using machine learning research paper

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
My Account Login
Explore content
About the journal
Publish with us
Sign up for alerts
Open access
Published: 07 October 2024

A proposed technique for predicting heart disease using machine learning algorithms and an explainable AI method

Hosam El-Sofany 1 ,
Belgacem Bouallegue 1 , 2 &
Yasser M. Abd El-Latif 3

Scientific Reports volume 14 , Article number: 23277 ( 2024 ) Cite this article

9550 Accesses

3 Altmetric

Metrics details

Computational science
Disease prevention
Health services

One of the critical issues in medical data analysis is accurately predicting a patient’s risk of heart disease, which is vital for early intervention and reducing mortality rates. Early detection allows for timely treatment and continuous monitoring by healthcare providers, which is essential but often limited by the inability of medical professionals to provide constant patient supervision. Early detection of cardiac problems and continuous patient monitoring by physicians can help reduce death rates. Doctors cannot constantly have contact with patients, and heart disease detection is not always accurate. By offering a more solid foundation for prediction and decision-making based on data provided by healthcare sectors worldwide, machine learning (ML) could help physicians with the prediction and detection of HD. This study aims to use different feature selection strategies to produce an accurate ML algorithm for early heart disease prediction. We have chosen features using chi-square, ANOVA, and mutual information methods. The three feature groups chosen were SF-1, SF-2, and SF-3. The study employed ten machine learning algorithms to determine the most accurate technique and feature subset fit. The classification algorithms used include support vector machines (SVM), XGBoost, bagging, decision trees (DT), and random forests (RF). We evaluated the proposed heart disease prediction technique using a private dataset, a public dataset, and different cross-validation methods. We used the Synthetic Minority Oversampling Technique (SMOTE) to eliminate inconsistent data and discover the machine learning algorithm that achieves the most accurate heart disease predictions. Healthcare providers might identify early-stage heart disease quickly and cheaply with the proposed method. We have used the most effective ML algorithm to create a mobile app that instantly predicts heart disease based on the input symptoms. The experimental results demonstrated that the XGBoost algorithm performed optimally when applied to the combined datasets and the SF-2 feature subset. It had 97.57% accuracy, 96.61% sensitivity, 90.48% specificity, 95.00% precision, a 92.68% F1 score, and a 98% AUC. We have developed an explainable AI method based on SHAP approaches to understand how the system makes its final predictions.

Analyzing the impact of feature selection methods on machine learning algorithms for heart disease prediction

heart disease prediction using machine learning research paper

Machine learning-based classification of valvular heart disease using cardiovascular risk factors

Early and accurate detection and diagnosis of heart disease using intelligent computational model

Introduction.

Globally, heart diseases consistently rank as the leading cause of death 1 . Heart disease and stroke account for 17.5 million annual deaths worldwide, according to the World Health Organization’s report. More than 75% of deaths caused by heart diseases occur mostly in countries with middle- and low-income populations. In addition, heart attacks and strokes are responsible for 80% of all fatalities caused by CVDs 2 . Observing the patient’s symptoms and conducting a physical examination often leads to the diagnosis of heart disease. Smoking, age, a family history of heart disease, high cholesterol levels, inactivity, high blood pressure, obesity, diabetes, and stress are some risk factors for cardiovascular disease 3 . Lifestyle modifications, such as quitting smoking, losing weight, exercising, and managing stress, may help to reduce some of these risk factors. We diagnose heart disease using medical history, physical examination, and imaging tests such as electrocardiograms, echocardiograms, cardiac MRIs, and blood tests. Lifestyle adjustments, drugs, medical treatments like angioplasty, coronary artery bypass surgery, or implanted devices like pacemakers or defibrillators can treat heart disease 4 . It is now possible to construct prediction models for heart disease with the assistance of the vast amounts of patient data that are easily accessible as a result of the growing number of recent healthcare systems. Machine learning is considered a data-sorting approach that analyzes large datasets from various viewpoints and then transforms the results into tangible knowledge 5 .

Several studies have utilized ML algorithms like SVM, artificial neural networks (ANN), DT, LR, and RF to analyze medical data and predict heart diseases. A recent study by 6 used ML models to predict the risk of cardiac disease in a multi-ethnic population. The authors utilized a large dataset of electronic health record data and linked it with socio-demographic information to stratify CVD risks. The models achieved high accuracy in predicting CVD risk in the multi-ethnic population. Similarly, another study by 7 applied a deep learning (DL) algorithm to predict coronary artery disease (CAD). The researchers utilized clinical data and coronary computed tomography angiography (CCTA) images to train the DL model. A study by 8 used different ML models to predict CVD based on clinical data. The models used by the researchers included DTs, K-nearest neighbor (KNN), and RFs. Using these models, the authors reported high accuracy in predicting CVD. Similarly, in a study by 9 , ML techniques were used to determine what factors contribute to heart disease risk. The authors utilized the National Health and Nutrition Examination Survey (NHANES) data to determine risk factors related to coronary heart disease. Another research study 10 examined the effectiveness of various machine learning algorithms in predicting heart diseases. The authors reported that the models achieved high accuracy in predicting heart diseases.

Many researchers use ML classification techniques to predict heart disease. The ML classifiers used in this work have shown promising results in detecting the risk of CVD 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 . The LR, RF, and KNN algorithms have shown high accuracy in classifying the risk of CVD. Ensemble learning techniques, such as bagging, AdaBoost, and voting, have improved the classification accuracy compared to single classifiers. Employing several ML classifiers can enhance the accuracy of CVD risk prediction. We can conduct further research in this area to improve CVD forecasting and diagnosis. ML is a powerful tool for HD prediction. It has the potential to improve patient outcomes by allowing for early detection and personalized treatment. This part starts with a comparison of ten machine learning classifiers for predicting heart disease. The classifiers are Naive Bayes, SVM, voting, XGBoost, AdaBoost, bagging, KNN, DT, RF, and LR (see Table 1 ). The results indicated that ML classifiers could improve heart disease prediction accuracy, with the highest achieved being 97% by 28 using AdaBoost, DT, RF, KNN, and LR on the UCI dataset. Several studies utilized the Cleveland heart disease dataset (CHDD), with accuracies ranging from 77% 32 to 92% 30 using various ML algorithms such as AdaBoost, DT, RF, KNN, LR, SVM, and Naive Bayes. Hence, ML classifiers could improve the certainty of heart disease forecasting, enabling early detection and personalized treatment. Nonetheless, more investigation is essential to validate these classifiers’ accuracy using larger datasets and increase the generalizability and reproducibility of the results.

The objective of the study is to provide an ML approach for heart disease prediction. We evaluated ML algorithms on large, open-access heart disease prediction datasets. This study aims to construct an innovative machine learning technique that is capable of properly classifying several high-definition datasets and then evaluate its performance in comparison to that of other first-rate models. One of the key contributions to this research is the use of a private HD dataset. Egyptian specialized hospitals voluntarily provided 200 data samples between 2022 and 2024. We were able to collect approximately 13 features from these participants. This work deals with the immediate requirement for early HD prediction in Egypt and Saudi Arabia, where the HD rate is rapidly increasing. The authors evaluated the proposed model’s performance by applying ML classification algorithms to a combined dataset consisting of both CHDD and private datasets. Using a combined dataset, this approach accurately predicts HD. It is a new method compared to earlier studies. The research’s stated goal was to predict HD using the combined datasets and the SF-2 feature subset. The following rates were achieved: 97.57% for accuracy, 96.61% for sensitivity, 90.48% for specificity, 95.00% for precision, 92.68% for F1 score, and 98% for AUC. To understand how the system predicts its outcomes, an explainable artificial intelligence approach utilizing SHAP methodologies has been developed. The use of SMOTE to increase the overall number of balanced cases in the dataset is of additional importance to this study. The proposed technique is trained on a balanced dataset using SMOTE to increase the performance of heart disease prediction. The ML techniques applied in this article were additionally optimized with hyperparameters. We have tuned the hyperparameters for all the ML classifiers. The proposed method got 97.57% accuracy rates with hyperparameters that were optimized when the combined datasets and the SF-2 feature subset were used. Additionally, to identify the classifier that achieves the most accurate HD prediction rate, the study assessed 10 distinct ML classification algorithms. The XGBoost technique was identified as a highly accurate classifier to predict HD after assessing the performance of ten algorithms. The proposed app’s capacity for adaptability is shown by applying a domain adaptation method. This shows the ability of the proposed approach to be implemented in various environments and communities, in addition to the initial datasets used in this article.

The proposed study offers several unique contributions that significantly enhance its novelty and relevance in the heart disease prediction field, including:

Comprehensive Feature Selection Methodology : Our research introduces a comprehensive feature selection process using three distinct methods: chi-square, analysis of variance (ANOVA), and mutual information (MI). Unlike prior studies that may rely on a single or less systematic approach, we rigorously evaluated the importance of each feature through these methods, resulting in three specialized feature subsets (SF-1, SF-2, SF-3). This methodological approach guarantees the inclusion of only the most relevant and impactful features in the predictive model, thereby enhancing its accuracy and efficiency.

Evaluation Across Multiple ML Classifiers : We conducted an extensive comparative analysis of ten different ML classifiers, including state-of-the-art algorithms like XGBoost, AdaBoost, and ensemble methods. We identified the optimal classifier-feature combination, a topic not commonly addressed in existing literature, through a broad evaluation across various algorithms and the use of selected feature subsets. We demonstrated the superior performance of the XGBoost classifier with the SF-2 feature subset, highlighting the significance of our feature selection strategy.

Utilization of a Private Health Dataset : In addition to using a publicly available dataset, we employed a private health dataset that has not been explored in previous research. This inclusion of a novel dataset adds a layer of originality to our study, as it allows us to validate the model’s robustness and generalizability across different data sources. This dataset’s results offer fresh perspectives on heart disease prediction, especially in areas where comparable data has not undergone extensive analysis.

We implemented the Synthetic Minority Oversampling Technique (SMOTE) to address the issue of unbalanced data, which is often a challenge in medical datasets. By ensuring balanced training data, our study improves the reliability and accuracy of the predictive models, particularly in detecting early-stage heart disease. This step is crucial for enhancing the practical applicability of the model in real-world scenarios, where data imbalance is common.

Development of an Explainable AI Approach : To our knowledge, the integration of SHAP (Shapley Additive Explanations) methodologies to provide an explainable AI framework in the context of heart disease prediction is a novel contribution. This approach not only enhances the model’s trustworthiness by providing transparency into the prediction process, but also assists healthcare professionals in comprehending the underlying factors that influence the diagnosis.

Practical Application Through a Mobile App : Finally, the development of a mobile application based on the best-performing ML model marks a significant step towards practical, real-world implementation. This app enables users to input symptoms and quickly receive a heart disease prediction, offering a user-friendly, cost-effective tool for early detection. The translation of our research findings into a tangible product underscores the novelty of our study by bridging the gap between theoretical research and practical healthcare solutions.

Figure 1 shows the proposed system’s sequences for predicting heart diseases. We first gathered and preprocessed the dataset to remove any necessary inconsistencies, such as replacing null occurrences with average values. We divided the dataset into two distinct groups, named the test dataset and the training dataset, respectively. Next, we implemented several distinct classification algorithms to determine which one achieved the highest accuracy for these datasets.

The proposed approach sequences for heart disease prediction.

The proposed methodology

This study investigates ML techniques such as Naive Bayes, SVM, voting, XGBoost, AdaBoost, bagging, DT, KNN, RF, and LR classifiers. These algorithms can aid doctors and data analysts in making correct diagnoses of cardiac disease. This article incorporates recent data on cardiovascular illness, as well as relevant journals, research, and publications. The methodology, as in 1 , provides a framework for the suggested model. The methodology is a set of steps that transform raw data into consumable and identifiable data patterns. The proposed approach consists of three stages: the first stage is data collection; the second stage extracts specific feature values; and the third stage is data exploration, as shown in Fig. 1 . Depending on the procedures employed, data preprocessing deals with the missing values, cleansing of the data, and normalization 2 . We then classified the pre-processed data using the ten classifiers (A1, A2,., A10). Finally, after putting the suggested model into practice, we evaluated its performance and accuracy using a range of performance measures. This model developed a Reliable Prediction System for Heart Disease (RPSHD) using a variety of classifiers. This model uses 13 medical factors for prediction, among which are age, sex, cholesterol, blood pressure, and electrocardiography 3 .

Datasets and dataset features

This research employs both the CHDD and a private dataset for heart disease prediction. The CHDD dataset has 303 samples, while the private dataset has 200, and they have the same features. The combined dataset contains 503 records, and 13 features are associated with each one (including demographic, clinical, and laboratory parameters). The datasets have many features that can be used for heart disease prediction including age , gender , blood pressure , cholesterol levels , electrocardiogram readings-ECG , chest pain , exercise-induced angina , blood sugar with fasting condition , max heart rate achieved , oldpeak , coronary artery , thalassemia , and other clinical and laboratory measurements , as shown in Table 2 . The outcome variable known as “Target” takes a binary value and refers to the heart disease predicting feature (i.e., it indicates whether or not cardiac disease is present).

Figure 2 shows the percentage distribution of individuals with heart disease in the combined datasets. A total of 503 samples have been gathered, and 45.9% of those have been diagnosed with HD, while the remaining 54.1% of individuals have not been infected with the disease.

Boxplots are an effective visualization technique for understanding the distribution of data and identifying potential outliers. By applying boxplots to a dataset related to HD, one can get insights into the distribution of a variety of HD-related features or variables. The HD dataset’s boxplots are illustrated in Fig. 3 . Boxplots are used to illustrate the distribution of scores for HD detection in this figure. Every graph we obtained had an anomaly. Removing them will cause the median of the data to drop, which might make it harder to detect HD accurately. On the other hand, this method offers more benefits than the others; by identifying heart disease infection at an early stage, when medical care is most beneficial, this diagnostic could preserve lives.

The percentage distribution of heart disease in the Combined dataset.

Boxplots of the combined heart disease dataset.

Dataset preparation

In this research, preprocessing was performed on collected data. The CHDD has four inaccurate CMV records and two erroneous TS entries. Incorrect data is updated to reflect the best possible values for all fields. Then, StandardScaler is employed to normalize all the features to the relevant coefficient, ensuring each feature has a zero mean and one variance. By considering the patient’s history of cardiac problems and following other medical concerns, an organized and composed augmented dataset was chosen.

The dataset studied in this research is a combination of accessible public WBCD and chosen private datasets. Partitioning the two datasets in this way allows us to use the holdout validation method. In this study, 25% of the data is in the test dataset, compared to 75% in the training dataset. The mutual information method is used in this research to measure the interdependence of variables. Larger numbers indicate greater dependency and information gathering.

The importance of features provides valuable insights into the relevance and predictive power of each feature in a dataset. Using this reciprocal information technique, the thalach feature is given the highest value of 13.65%, while the fbs feature is given the lowest importance of 1.91%, as illustrated in Fig. 4 .

The importance of the heart disease dataset features.

Feature selection

In this research, we perform feature selection and classification using the Scikit-learn module of Python 20 . Initially, the processed dataset was analyzed using several different ML classifiers, including RF, LR, KNN, bagging, DT, AdaBoost, XGBoost, SVM, voting, and Naive Bayes, which were evaluated for their overall accuracy. In the second step, we used the Seaborn libraries from Python to create heat maps of correlation matrices and other visualizations of correlations between different sets of data. Thirdly, a wide variety of feature selection methods (FSM) such as analysis of variance (ANOVA), chi-square, and mutual information (MI) were applied. These strategies are explained in Table 3 and are indicated by the acronyms FSM1, FSM2, and FSM3, respectively. Finally, the performance of several algorithms was compared for the identified features. The validity of the analysis was demonstrated using accuracy , specificity , precision , sensitivity , and F1 score . The StandardScaler method was used to standardize every feature before it passed into the algorithms.

The outcome of different feature selection methods

The F value for each pair of features is determined by using the ANOVA F value technique and the feature weights. Table 4 (a) presents the findings of the ANOVA F test. The EIA, CPT, and OP features provide the most importance to the score, while the RES, CM, and FBS features contribute the least. Chi-square is another approach that determines the degree to which every feature relates to the target. Table 4 (b) shows the chi-square outcomes. In this method, the first three features that are the most significant are MHR, OP, and CMV, whereas TS, REC, and FBS, respectively, are the least important ones. The MI technique is utilized in FSM3. To evaluate the degree of mutual dependency between features, this approach calculates the mutual information between them. A score of 0 indicates complete independence between the two features under consideration; a larger number indicates a greater dependence. The MI score results are shown in Table 4 (c). CPT, TS, and CMV are the three features that are most dependent on each other in this case, whereas FBS and REC are the features that are independent of each other. Table 4 illustrates important factors that can be utilized for predicting the probability of having heart disease. Furthermore, REC, FBS, RBP, and CM all have lower total scores across all three FSMs. Because of all these features, three distinct groups are chosen to be included depending on their score. SF-1, SF-2, and SF-3 were the abbreviations that were given to each of the three different sets of features, respectively. Table 5 shows these feature sets that were selected for additional investigation.

Based on the research’s assessment of performance criteria (see Table 6 ), we chose the XGBoost classifier with SMOTE using the combined datasets and SF-2 feature subset. We will embed the most accurate technique in a mobile app and deploy the model using a variety of integrated development environments (IDEs), including Android Studio 14.0, Python 3.10, Spyder, Java 11, and Pickle 5 26 .

The use of SMOTE and SHAP methods

To overcome the problem of imbalanced datasets, ML prediction applications employ the strong Synthetic Minority Oversampling Technique (SMOTE). This technique plays an important role in various applications.

Balancing class distribution : In many prediction tasks, such as medical diagnosis and prediction, the dataset is often imbalanced. This implies that a particular class, typically the one of interest, has a lower representation than the other class. SMOTE interpolates minority class examples to create synthetic minority class samples. This balanced class distribution ensures the prediction model gets enough minority class examples to learn from.

Improving predictive accuracy : In predictive modeling, an imbalanced dataset can cause the model to be biased towards the majority class, leading to poor performance in predicting the minority class. Accurate prediction of the minority class poses a significant challenge. Applying SMOTE trains the model on a more balanced dataset, improving accuracy and predictive performance, particularly for the minority class. This is critical in applications where missing the minority class (e.g., disease cases) can have significant consequences.

Enhancing recall and precision : Predictive models trained on imbalanced datasets often exhibit high precision for the majority class but low recall for the minority class. This means they miss a large portion of the minority class instances, even if the ones they do identify are accurate. SMOTE helps improve recall without sacrificing precision, leading to a more balanced and effective model. In practical terms, this means the model is better at identifying all relevant cases, not just a select few.

Reducing model bias : In prediction applications, a biased model can result in unfair outcomes, especially when the minority class is underrepresented. By exposing the model to a sufficient number of minority class examples during training, SMOTE mitigates this bias. This helps create a more equitable model that makes fairer predictions across all classes.

Improving generalization : Models trained on imbalanced data may perform well on the majority class during training, but they fail to generalize well to new, unseen data, particularly for the minority class. By using SMOTE to create a balanced training set, the model is better equipped to generalize its predictions to new data, leading to more reliable and consistent performance in real-world applications.

Enhancing robustness in deployment : In deployed machine learning applications, robustness is key. Predictive models often face real-world data that is skewed or imbalanced. SMOTE helps create a more robust model that can handle such data more effectively, reducing the risk of failure in production environments. This is crucial for applications like predictive maintenance, where identifying rare but critical failures can prevent costly downtime.

On the other hand, SHAP (Shapley Additive Explanations) is a powerful tool in ML that helps to interpret and explain the predictions made by complex models. The following are the benefits that SHAP offers in ML applications:

Enhanced Transparency : SHAP makes black-box models more transparent, fostering trust among users and stakeholders. This is especially crucial in industries like finance, healthcare, and legal, where understanding model decisions is essential.

Regulatory Compliance : Many industries are subject to regulations that require model decisions to be explainable. SHAP ensures compliance by providing clear, understandable explanations for each decision, facilitating documentation, and sharing with regulators.

Improved user trust and adoption : When end-users understand why a model is making certain predictions, they are more likely to trust and adopt the technology. User interfaces can incorporate SHAP explanations to improve the user friendliness of AI-powered applications.

Actionable Insights : SHAP doesn’t just explain predictions; it also provides actionable insights. For example, in prediction models, SHAP can identify key factors for effective features, allowing doctors to take proactive steps to detect disease.

Facilitates Collaboration : SHAP explanations can bridge the gap between data scientists and non-technical stakeholders, facilitating better communication and collaboration. By providing a common understanding of model behavior, teams can work more effectively together.

Experimental results and analysis

We use Jupyter Notebook 7 to predict heart diseases from a dataset. It simplifies the visualization of different data relationship graphs in the dataset and facilitates document creation, including live coding. The first step of this research involves cleaning the CHDD using Python’s Pandas and NumPy libraries (version 24.2.0). Next, the StandardScale r method from Python’s Scikit-learn module preprocesses the dataset 34 . The second step of the process calculates the importance of each feature using a feature selection approach, generating three sets of features (SF). Thirdly, we separated the dataset into training and testing sets. We use 75% of the data for training and the remaining 25% for testing. Finally, we trained ten distinct ML algorithms using this 75% of test data. We selected the method with the best performance to predict heart disease 35 .

Performance evaluation

In this subsection, the authors evaluate and explain the proposed system’s performance. The authors presented various algorithms and their comparative performances using evaluation metrics such as accuracy, sensitivity, specificity, and F1-score. We evaluated these performance measures using true positive (TP), true negative (TN), false positive (FP), and false negative (FN) data. The next subsection focuses on these measurements. After this evaluation, we provided the algorithm that produced the best results. Figure 5 illustrates the use of the confusion matrix in assessing the performance of a classification model.

Confusion matrix of the HD dataset using XGBoost and SMOTE.

Figure 5 illustrates the predicted values of T P , F P , T N , and F N for the XGBoost classifier using SMOTE. Each element in this confusion matrix represents the number of cases for both the actual classes and the predicted classes that have a particular set of labels. As an illustration, the matrix has a total of 63 cases (TP) of heart disease classifications, 3 cases (FP) of diagnosis classified as “heart disease”, 4 cases (FN) of diagnosis classed as “no heart disease”, and 66 cases (TN) of distinct “heart disease” classifications.

Figure 6 presents the correlation between the important features of SF-2 using SMOTE. The y-axis values include thalach, chol, sex, age, slope, exang, oldpeak, ca., cp., and thal. Positive or negative correlation coefficients show a significant relationship between the two variables, whereas − 1 and 1 indicate no association. It is essential to keep in mind that the only thing that can be detected via the use of correlation is the linear link that exists between the variables. The prediction for the patient is correlated with each of those variables at a level of at least 70% correlation.

Correlation between features of SF-2 using SMOTE.

Scatter plot among four selected features in the SF-2.

Figures 7 and 8 show the scatter and density plots among the four selected features in the SF-2 dataset. These scatter and density graphs are beneficial for exploring the relationships and distributions of variables in the HD dataset. They can provide insights into correlation, concentration, outliers, and patterns that may exist among the four variables (exang, cp., ca., and thal).

Density plot among the first four important features in the SF-2.

Accuracy : The proposed model’s accuracy was developed to determine what percentage of samples has been accurately classified. Accuracy is computed using the formula given in (Eq. 1 ), which is based on the confusion matrices:

Sensitivity (or recall ): Sensitivity measures the rate of truly positive results and implies that all values should be evaluated positively. Additionally, sensitivity is calculated as “the proportion of correctly detected positive samples”. Sensitivity is determined by the following formula:

Specificity : It predicts that all values will be negative and is determined by calculating the fraction of real negative situations. Specificity is determined mathematically by.

Precision : It determines classifier accuracy and may be calculated from the information given. This is presented by comparing real TP versus predicted TP. The formula in (Eq. 4 ) shows how the accuracy measure verifies the proposed method’s behavior:

F-measure : It is a statistical measure that is employed in the process of evaluating the efficacy of a classification model. It does this by determining the harmonic mean of the accuracy and recall measurements, giving each of these metrics an equal amount of weight. It enables the performance of a model to be described and compared using a single score that takes into consideration both the recall and precision of the model’s predictions and is calculated using the following formula:

The performance of a classifier has been represented and evaluated with the use of a confusion matrix, as shown in Fig. 5 . T P measures how many individuals are accurately classified into the sick positive class. The percentage of healthy people who are appropriately labeled as being in the negative class is known as TN. The number of times that healthy persons were incorrectly diagnosed as being sick is referred to as the F P . When the number of healthy persons is mistakenly predicted, this is known as F N . A comparison of the various performance indicators across 10 ML algorithms is presented in Table 6 . These AI classifiers were applied to the combined dataset that contained SF-1, SF-2, and SF-3 feature subsets. Based on its accuracy of 97.75%, sensitivity of 96.61%, specificity of 90.48%, precision of 95.00%, and F1 score of 92.68% for the SF-2 feature group (see Table 6 ), the XGBoost classifier had the best overall performance.

Experimental evaluation of system performance

Table 6 displays the accuracy of each technique and the processed dataset that underwent analysis using those algorithms. In terms of accuracy for each technique, A4’s accuracy calculation for SF-2 was the most accurate (97.57%), followed by its accuracy calculations for SF-1 and SF-3 (93.17% and 94.19%), respectively. A9 computed an accuracy of 93.07% over all three SFs, putting it in second place. On the other hand, A5 determined that SF-1 and SF-3 had a low accuracy of 85.15% among all classifiers. A3 and A10 likewise provided a low level of accuracy for SF-2 and SF-3, coming in at 86.14% and 86.12%, respectively. The other methods have an accuracy between 87.13% and 90.00%. Furthermore, this finding shows that the XGBoost algorithm method using the SF-2 is the most effective for processing the dataset. Figure 9 illustrates the range of accuracy rates possible for the ten machine learning techniques, all utilizing SF-2.

This study evaluated the sensitivity of all the algorithms. Table 6 displays the sensitivity scores obtained from the ten ML techniques using SF-1, SF-2, and SF-3, respectively. A5’s sensitivity to SF-3 was the lowest (88.14%). A8 rated both SF-1 and SF-3 (89.83% and 89.83%, respectively). A4 (XGBoost) reported the highest sensitivity for SF-2 as well, at 96.61%; A2, A3, A4, A6, and A9 reported the second-highest sensitivity, at 94.92%.

We performed the analysis of specificity on each of these techniques, and Table 6 summarizes the results. A3 scored the lowest (73.81%) for SF-2 and FS-3. According to the analysis, A4 and A9 scored the highest (90.48%) for all SFs, based on the results of the analysis. When compared to the results of the other techniques, A7 for SF-3 (92.86%) provided the best score with SF-3 only.

The accuracy results of the ten ML Algorithms.

We have adopted several strategies to mitigate the risk of overfitting and ensure the real-world applicability of the proposed heart disease prediction model, including (1) Cross-validation : We have used the k-fold cross-validation technique in Python (using scikit-learn) to assess the generalization performance of the models on multiple subsets of the data. This helps identify models that are less prone to overfitting and provides a more reliable estimate of their true performance. (2) Regularization : incorporate regularization techniques, such as L1 or L2 regularization, dropout, or early stopping, to prevent the models from becoming overly complex and reduce the risk of overfitting. (3) Feature engineering: In addition to using feature selection methods such as ANOVA, chi-square, and MI, we expanded our work to include feature engineering techniques that capture meaningful relationships and domain knowledge relevant to heart disease prediction. (4) External validation : Validate the models on independent datasets or real-world patient data to assess their performance in a variety of settings and ensure their reliability in clinical practice.

This study used a combined dataset (CHDD and private datasets) for both testing and training, implementing a variety of ML techniques for the early recognition of CVD. We then tested and trained the ML model on the source and target datasets using a domain adaptation approach. We first trained the proposed HD prediction technique in this study using a private dataset containing 200 cases. After that, we evaluated the system using the combined dataset of 503 cases. To be more specific, we employed a total of ten well-known ML algorithms, including Naive Bayes, SVM, voting, XGBoost, AdaBoost, bagging, DT, KNN, RF, and LR, denoted by (A1, A2, …, A10) as shown in Table 6 , each with a unique set of selected features. The ANOVA F statistic, the chi-square test, and the MI statistic were the statistical methods used to group the important factors that were better at predicting CVD. We used five different evaluation standards: accuracy, sensitivity, precision, specificity, and F1 score to compare and rate the performance of the different ML techniques that used SMOTE. The experiment showed that algorithm A4 had the highest rate of accuracy (97.57%) for SF-2. Algorithm A9 had the second highest rate of accuracy (93.17%) for all three SFs shown in Table 6 . A4 likewise obtained the greatest score possible for sensitivity (96.61%), as well as the best score possible for specificity (90.48%), while testing for SF-2, as shown in Table 6 . The result of the F1 score demonstrated that A4 had the highest score of 92.68% for SF-2 (see Table 6 ), while A9 obtained the highest score of 91.57% for SF-1, SF-2, and SF-3, and A6 obtained the highest score of 90.24% for SF-2. Because A4 performs best when used with SF-2, this method is the most reliable technique in terms of accuracy, specificity, and sensitivity. In terms of F1 score, A9 is the more accurate predictive model for all SFs, making it the second-best predictive algorithm overall. As a result of this research, we have concluded that it provides the highest performance rate. Therefore, we can conclude that XGBoost is an effective method for predicting heart diseases. Most cases achieved an accuracy range of 85.15 to 97.57% when combining the results of multiple different ML algorithms.

The XGBoost classifier’s receiver operating characteristic (ROC) curve with SF-2 is shown in Fig. 10 . This curve shows how well the model works across all classification thresholds, with an AUC of 0.98 (see Table 7 ).

AUC and ROC curve for the XGBoost classifier using SMOTE.

An explainable AI method applying SHAP libraries is used to comprehend the model’s decision-making.

Figure 11 shows the importance of SHAP library XGBoost with SMOTE features.

Explainable AI interpretation of the XGBoost feature importance.

Using ML classifiers for HD prediction is the goal of this work. The experiment findings proved that the XGBoost algorithm was the most accurate percentage for predicting the occurrence of HD. The following features are classified as important for HD prediction according to the mutual information-based feature selection approach: thalach, chol, oldpeak, age, trestbps, ca., thal, cp., exang, slope, restecg, sex, and fbs. We have used the SMOTE method to optimize hyperparameters and oversample using the data that was collected. The XGBoost technique with SMOTE produced the best results. The study reached its goal of predicting HD, with the combined datasets, and the experimental results were 97.57% for accuracy, 96.61% for sensitivity, 90.48% for specificity, 95.00% for precision, 92.68% for F1 score, and 98% for AUC.

In the context of heart disease prediction, the high accuracy of 97.57% suggests that the XGBoost model is very reliable in distinguishing between patients who do and do not have heart disease. However, it is critical to interpret what this high accuracy means in clinical practice:

Early Detection : The model’s high sensitivity (96.61%) indicates that it can effectively identify patients in the early stages of heart disease, which is crucial for timely intervention and treatment.

Minimizing False Positives: The specificity (90.48%) indicates a relatively low rate of false positives , implying fewer patients would receive an incorrect heart disease diagnosis , thereby reducing unnecessary anxiety and treatments.

Balanced Prediction : The high F1 score (92.68%) reflects a balance between precision (correctly predicting heart disease when present) and recall (identifying as many true cases as possible), which is critical for practical applications where both false positives and false negatives have serious consequences.

To understand the breakthrough, the performance of XGBoost must be compared to existing gold standards in heart disease prediction, typically involving established clinical scoring systems or other ML models that have been widely accepted in healthcare.

If XGBoost surpasses existing models , this improvement to 97.57% accuracy would signify a significant advancement, potentially offering a more reliable tool for clinicians.

If Comparable to Current Standards : If this accuracy is only slightly better or comparable to current methods, the significance of the improvement must be critically evaluated. Factors such as the model’s generalizability, ease of integration into clinical workflows, and interpretability for healthcare professionals become critical in deciding whether it truly represents a breakthrough.

To consider this performance a breakthrough, future work must incorporate the following justifications:

Comparison to Baseline Methods : Show that XGBoost significantly outperforms existing prediction methods in terms of accuracy and other key metrics.

Clinical Impact : Discuss how this improvement could translate into better patient outcomes, such as reduced mortality or morbidity, due to more accurate early diagnosis.

Scalability and Implementation : Describe the potential integration of this model with current medical systems and its applicability to various patient populations in real-world settings.

The proposed ML-based heart disease prediction technique has the potential to improve healthcare. By allowing early detection and treatment, accurate heart disease prediction can greatly reduce mortality. In resource-constrained situations with limited expert access, doctors can use this tool to diagnose patients. Integrating the technology with electronic health record (EHR) systems would enable real-time risk assessments to improve patient outcomes and decision-making.

This study’s mobile app may help patients, especially those in remote places or without access to healthcare, self-assess. It lets users enter their symptoms and get a quick heart disease risk assessment, pushing them to seek medical care. This tool puts individuals in charge of their health, which may help diagnose and treat heart problems earlier.

The study contributes to the broader goal of digital health improvement by providing a scalable, cost-effective solution for heart disease prediction. By leveraging ML and explainable AI (through SHAP methodologies), the authors have created a system that is not only accurate but also interpretable, ensuring that healthcare professionals can trust and understand the predictions made by the model. This level of transparency is critical for the adoption of AI tools in clinical practice.

Moreover, the use of a mobile app extends the reach of this technology, making it accessible to a larger population. This democratization of healthcare tools aligns with global efforts to improve public health and reduce the burden of cardiovascular diseases, which are a leading cause of death worldwide.

Limitations

Despite the promising results and potential uses of the proposed ML-based technique for heart disease prediction, there are several limitations to consider:

Dataset quality and availability : The performance and reliability of ML models depend on the quality and availability of testing and training datasets. We employed Cleveland heart disease, and private databases in our study. There may be limitations in availability, representativeness, and data quality. This limitation could make it hard to apply the proposed approach to a broader sample with a variety of additional sources.

Imbalanced classes : SMOTE generates synthetic minority class samples to overcome class imbalance, but its effectiveness depends on the dataset and situation. Class imbalance becomes a major issue in heart disease prediction when disease frequency may be minimal. Class imbalance can cause models to perform well for the majority group but poorly for the minority class, which is often the class of interest. To address this issue, it’s essential to discuss and compare different approaches for handling class imbalances as alternatives to SMOTE such as:

Class Weights : Advantages include (a) being simple to implement, as it involves adjusting the misclassification penalties for different classes; (b) not requiring modifying the dataset or generating synthetic samples, making it computationally efficient; and (3) being effective in improving the model’s performance on the minority class without introducing additional complexity. The limitations include (a) assuming that the misclassification costs are known and properly specified, which may not always be the practice case; (b) assuming that the decision boundary between classes is highly non-linear or complex.

Ensemble Methods : Advantages include (a) naturally handling class imbalance by aggregating predictions from multiple base models trained on balanced subsets of the data; (b) tending to be robust to noise and overfitting, making them suitable for imbalanced datasets; and (c) capturing complex relationships between features and target variables, improving predictive performance. The limitations include (a) requiring more computational resources and longer training times compared to simpler algorithms; (b) not providing explicit control over the balance between classes in the final predictions.

Cost-sensitive Learning : Advantages include (a) explicitly considering the costs associated with misclassifying instances from different classes, allowing for fine-tuning of the model’s behavior; (b) being able to accommodate varying degrees of class imbalance and adjust the decision boundary accordingly. The limitations include (a) requiring knowledge of misclassification costs, which may be difficult to collect or subjective in real-world circumstances; and (b) making model selection more complicated by tuning more hyperparameters.

Anomaly Detection : Pros include (a) being able to be used in situations where the minority class represents rare or unusual events, like finding rare heart conditions; and (b) not needing explicit labeling of minority class instances, which means it can be used in either semi-supervised or unsupervised settings. The limitations include (a) assuming that the minority class instances are outliers or deviate significantly from the majority class, which may not always be the case; and (b) struggling with detecting subtle or nuanced patterns in the data, particularly when the boundary between normal and abnormal instances is ambiguous.

Algorithm selection : To determine the optimal algorithm for predicting HD, the researchers used a variety of ML techniques. Nonetheless, the selection of algorithms is arbitrary and may affect the outcome. Other algorithms that were not considered in this study might be able to achieve different trade-offs or greater accuracy. As a result, future research should carefully consider and evaluate the ML algorithms selected.

Domain adaptation : The use of domain adaptation techniques demonstrated the proposed system’s adaptability. The application of the proposed technique to a variety of different populations or environments may still face some limitations. More research is required to determine the technique’s efficacy in a range of populations with different lifestyles, demographics, and healthcare systems. It is also important to fully address any potential restrictions and difficulties related to domain adaptation.

Missing data : The study does not specify if the ML model training and testing datasets contain missing data. In real-world healthcare, missing data is widespread and can dramatically impair predictive models. The discussion should focus on how missing data can lead to erroneous predictions, misdiagnosis, or delayed treatment. The authors could discuss imputation or robust algorithms for partial datasets. We must address missing data to ensure the reliability and generalizability of clinical prediction models.

Conclusions and future work

In this research study, we employed diverse methods to select features, and then applied ten distinct machine learning techniques with SMOTE to these selected features. This process allowed us to identify the most significant features that are highly effective in predicting heart disease. Every algorithm generates a unique score based on a different combination of features. We used three methods to choose features: ANOVA, chi-square, and MI. We applied these methods to three selected feature groups, namely SF-1, SF-2, and SF-3, respectively. Ten ML classifiers determined the best model and feature subset. The classifiers used were Naive Bayes, SVM, voting, XGBoost, AdaBoost, bagging, DT, KNN, RF, and LR. We employed a well-known open-access dataset and numerous cross-validation processes to evaluate the selected algorithms and measure the performance accuracy of the heart disease detection system. When compared to all other algorithms, the performance of XGBoost was more significant. The XGBoost classifier performed best with the SF-2 feature subset, with 97.64% accuracy, 96.61% sensitivity, 90.48% specificity, 95.00% precision, a 92.68% F1 score, and a 98% AUC. We developed an explainable AI method using SHAP techniques to understand how the system predicts its outcomes. Furthermore, the study demonstrated that the proposed system is adaptable using a domain adaptation approach. This work has made a significant contribution to the field of ML-based HD prediction applications by introducing unique insights and techniques. These findings have the possibility of aiding in the diagnosis and prediction of HD in Egypt and Saudi Arabia.

Finally, the authors are working on developing a smartphone app that allows users to enter symptoms and predict heart disease quickly and accurately. We will embed the best XGBoost technique in the mobile app to predict heart disease and display the detection result instantly. Because the mobile app is a symptom-based heart disease prediction, we will consider and address the impact of “dark data” during its implementation, which refers to information that exists but remains ungathered or underutilized due to data collection limitations, poor reporting, or ignorance. Unreported heart disease instances are considered “black data” when predicting heart disease. Therefore, in our future work, we will examine how dark data impacts the real-world implementation of the proposed mobile apps, specifically focusing on (1) asymptomatic cases in which patients with early-stage cardiac disease may not exhibit any symptoms. (2) Limited scope for symptom-based prediction: The models consider only a limited set of symptoms. Due to diagnostic testing and imaging examinations, heart disease can manifest in ways other than typical symptoms, so focusing solely on symptoms may miss critical signs. (3) Data collection: The absence of asymptomatic cases could potentially impact the quality of the app’s prediction model dataset. If most of the training data is symptomatic, the model’s predictions may prioritize symptomatic presentations, thereby intensifying the dark data effect.

To address the limitations imposed by the dark data effect and enhance the real-world applicability of the mobile app for heart disease prediction, we will consider several strategies, such as (1) comprehensive risk assessment: Expand the scope of the predictive model to incorporate additional risk factors beyond symptoms, such as demographic information, medical history, lifestyle factors, and biomarkers. (2) Integration with diagnostic tools: Connect the mobile app to diagnostic tools or wearable devices capable of measuring physiological parameters associated with heart health, such as blood pressure, heart rate variability, or electrocardiogram (ECG) signals. (3) Population screening programs: Partner with healthcare providers or public health agencies to promote population screening programs aimed at identifying individuals with undiagnosed heart disease. (4) Education and awareness campaigns: Launch educational initiatives to raise awareness about the importance of regular cardiovascular screenings, even in the absence of symptoms. The authors will also consider a cost-effectiveness argument for a heart disease prediction app, provide evidence, and consider a variety of factors, including development, implementation, and maintenance costs. Claiming that certain features are cheaper necessitates a thorough evaluation and comparison with alternative approaches to ensure the claim’s validity.

Due to their inability to explain the decision-making process, ML-based system developers tend to treat AI-based apps as a mystery. In this study, we used SHAP and feature importance techniques to explain and interpret the prominent features that were most influential in the decision. In our future work, we plan to expand our research by incorporating other explainable artificial intelligence (XAI) techniques that can improve transparency and interpretability, such as: (1) Partial Dependence Plots (PDPs): PDPs show the link between a feature and the expected outcome while ignoring other features. By showing each feature separately, we explicitly understand its effect on heart disease prediction. (2) Individual Conditional Expectation (ICE) Plots : ICE plots show how a feature affects each data point, not just the average. The effects of feature changes on various people can be better understood. (3) Local Interpretable Model-Agnostic Explanations (LIME): To explain predictions, LIME develops local surrogate models around specific examples. By changing the input data and watching how the predictions change, LIME gets close to the model’s local behavior and gives simple explanations for specific predictions. (4) Rule-based models, like decision trees or rule lists, connect input features directly to predictions. These models explain the criteria for heart disease, providing transparency. By employing these explainable AI methods, machine learning-based systems for heart disease prediction can provide healthcare professionals and patients with transparent, interpretable, and actionable insights, facilitating informed decision-making and improving trust in AI-driven healthcare applications.

Data availability

The corresponding author will share study datasets upon reasonable request.

World Health Organization. Cardiovascular Diseases (CVDs). Available online: (2023). https://www.afro.who.int/health-topics/cardiovascular-diseases , (accessed on 5 May).

Alom, Z. et al. Early Stage Detection of Heart Failure Using Machine Learning Techniques. In Proceedings of the International Conference on Big Data, IoT, and Machine Learning, Cox’s Bazar, Bangladesh, 23–25 September (2021).

Gour, S., Panwar, P., Dwivedi, D. & Mali, C. A machine learning approach for heart attack prediction. In Intelligent Sustainable Systems (eds Nagar, A. K., Jat, D. S., Marín-Raventós, G. & Mishra, D. K.) 741–747 (Springer, Singapore, 2022). https://doi.org/10.1007/978-981-16-6309-3_70 .

Chapter Google Scholar

Gupta, C., Saha, A., Reddy, N. S. & Acharya, U. D. Cardiac Disease Prediction using Supervised Machine Learning Techniques. In Journal of Physics: Conference Series ; IOP Publishing: Bristol, UK, Volume 2161 , p. 012013 (2022).

Shameer, K. et al. Machine learning predictions of cardiovascular disease risk in a multi-ethnic population using electronic health record data. Int. J. Med. Informatics . 146 , 104335 (2021).

Google Scholar

Liu, M. et al. Deep learning-based prediction of coronary artery disease with CT angiography. Japanese J. Radiol. 38 (4), 366–374 (2020).

Zakria, N., Raza, A., Liaquat, F. & Khawaja, S. G. Machine learning based analysis of cardiovascular disease prediction. J. Med. Syst. 41 (12), 207 (2017).

Yang, M., Wang, X., Li, F. & Wu, J. A machine learning approach to identify risk factors for coronary heart disease: a big data analysis. Comput. Methods Programs Biomed. 127 , 262–270 (2016).

Ngufor, C., Hossain, A., Ali, S. & Alqudah, A. Machine learning algorithms for heart disease prediction: a survey. Int. J. Comput. Sci. Inform. Secur. 14 (2), 7–29 (2016).

Shoukat, A., Arshad, S., Ali, N. & Murtaza, G. Prediction of Cardiovascular diseases using machine learning: a systematic review. J. Med. Syst. 44 (8), 162. https://doi.org/10.1007/s10916-020-01563-1 (2020).

Article Google Scholar

Shankar, G. R., Chandrasekaran, K. & Babu, K. S. An analysis of the potential use of Machine Learning in Cardiovascular Disease Prediction. J. Med. Syst. 43 (12), 345. https://doi.org/10.1007/s10916-019-1524-8 (2019).

Khandadash, N., Ababneh, E. & Al-Qudah, M. Predicting the risk of coronary artery disease in women using machine learning techniques. J. Med. Syst. 45 , 62. https://doi.org/10.1007/s10916-021-01722-6 (2021).

Moon, S., Lee, W. & Hwang, J. Applying machine learning to Predict Cardiovascular diseases. Healthc. Inf. Res. 25 (2), 79–86. https://doi.org/10.4258/hir.2019.25.2.79 (2019).

Lakshmi, M. & Ayeshamariyam, A. Machine learning techniques for Prediction of Cardiovascular Risk. Int. J. Adv. Sci. Technol. 30 (3), 11913–11921. https://doi.org/10.4399/97888255827001 (2021).

Md, R. et al. Early detection of cardiovascular autonomic neuropathy: A multi-class classification model based on feature selection and deep learning feature fusion. Information Fusion, vol. 77, P 70–80, January (2022).

Wongkoblap, A., Vadillo, M. A. & Curcin, V. Machine learning classifiers for early detection of Cardiovascular Disease. J. Biomed. Inform. 88 , 44–51. https://doi.org/10.1016/j.jbi.2018.09.003 (2018).

Delavar, M. R., Motwani, M. & Sarrafzadeh, M. A. Comparative study on feature selection and classification methods for Cardiovascular Disease diagnosis. J. Med. Syst. 39 (9), 98. https://doi.org/10.1007/s10916-015-0333-5 (2015).

Yong, K., Kim, S., Park, S. J. & Kim, J. A. Clinical decision support system for Cardiovascular Disease Risk Prediction in type 2 diabetes Mellitus patients using decision Tree. Comput. Biol. Med. 89 , 413–421. https://doi.org/10.1016/j.compbiomed.2017.08.024 (2017).

Mirza, Q. Z., Siddiqui, F. A. & Naqvi, S. R. The risk prediction of cardiac events using a decision Tree Algorithm. Pakistan J. Med. Sci. 36 (2), 85–89. https://doi.org/10.12669/pjms.36.2.1511 (2020).

Farag, A., Farag, A. & Sallam, A. Improving Heart Disease prediction using boosting and bagging techniques. Proc. Int. Conf. Innovative Trends Comput. Eng. (ITCE) . 90-96 https://doi.org/10.1109/ITCE.2016.7473338 (2016).

Jhajhria, S. & Kumar, R. Predicting the risk of Cardiovascular diseases using ensemble learning approaches. Soft. Comput. 24 (7), 4691–4705. https://doi.org/10.1007/s00500-019-04268-8 (2020).

Samadiani, N., Moghadam, E., Motamed, C. & A. M., & SVM-based classification of Cardiovascular diseases using feature selection: a high-dimensional dataset perspective. J. Med. Syst. 40 (11), 244. https://doi.org/10.1007/s10916-016-0573-7 (2016).

Zhang, X., Zhang, Y., Du, X. & Li, B. Application of XGBoost algorithm in clinical prediction of coronary heart disease. Chin. J. Med. Instrum. 43 (1), 12–15 (2019).

Liu, Y., Li, X. & Ren, J. A comparative analysis of machine learning algorithms for heart disease prediction. Comput. Methods Programs Biomed. 200 , 105965 (2021).

ADS Google Scholar

Hussein, N. S., Mustapha, A. & Othman, Z. A. Comparative study of machine learning techniques for heart disease diagnosis. Comput. Sci. Inform. Syst. 17 (4), 773–785 (2020).

Akbar, S., Tariq, R. & Basharat, A. Heart disease prediction using different machine learning approaches: a critical review. J. Ambient Intell. Humaniz. Comput. 11 (5), 1973–1984 (2020).

Zarshenas, A., Ghanbarzadeh, M. & Khosravi, A. A comparative study of machine learning algorithms for predicting heart disease. Artif. Intell. Med. 98 , 44–54 (2019).

Kaur, I. & Singh, G. Comparative analysis of machine learning algorithms for heart disease prediction. J. Biomed. Inform. 95 , 103208 (2019).

Li, Y., Jia, W. & Li, J. Comparing different machine learning methods for predicting heart disease: a telemedicine case study. Health Inform. Sci. Syst. 6 , 7 (2018).

Zhang, X., Zhou, Y. & Xie, D. Heart disease diagnosis using machine learning and expert system techniques: a survey paper. J. Med. Syst. 42 (7), 129 (2018).

Wu, J. & Roy, J. Stewart, & W. F. A comparative study of machine learning methods for the prediction of heart disease. Journal of Healthcare Engineering, 7947461 (2017). (2017).

Ahmed, Z., Mohamed, K. & Zeeshan, S. Comparison of machine learning algorithms for predicting the risk of heart disease: A systematic review. Journal of Healthcare Engineering, 7058278 (2016). (2016).

Chen, X., Hu, Z. & Cao, Y. Heart disease diagnosis using decision tree and naïve Bayes classifiers. World Congress Med. Phys. Biomedical Eng. 14 , 1668–1671 (2007).

Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12 , 2825–2830 (2011).

MathSciNet Google Scholar

Hosam El-Sofany, S. A., El-Seoud, O. H., Karam, Yasser, M., Abd El-Latif, Islam, A. T. F. & Taj-Eddin A Proposed Technique Using Machine Learning for the Prediction of Diabetes Disease Through a Mobile App. International Journal of Intelligent Systems , volume ID 6688934 , (2024). https://doi.org/10.1155/2024/6688934 , (2024).

Chintan, M. B., Parth, P., Tarang, G. & Pier, L. M. Effective Heart Disease Prediction Using Mach. Learn. Techniques Algorithms , 16 , 88, https://doi.org/10.3390/a16020088 , (2023).

Download references

Acknowledgements

The authors extend their appreciation to the Deanship of Research and Graduate Studies at King Khalid University for funding this work through small group research under grant number (RGP1/129/45).

Author information

Authors and affiliations.

College of Computer Science, King Khalid University, Abha, Kingdom of Saudi Arabia

Hosam El-Sofany & Belgacem Bouallegue

Electronics and Micro-Electronics Laboratory (E. μ. E. L), Faculty of Sciences of Monastir, University of Monastir, Monastir, Tunisia

Belgacem Bouallegue

Faculty of Science, Ain Shams University, Cairo, Egypt

Yasser M. Abd El-Latif

You can also search for this author in PubMed Google Scholar

Contributions

Hosam El-Sofany: Create the original concept for the research. Methodology, design, and implementation. Writing, reviewing, and editing. Proofreading and checking against plagiarism using the iThenticate program provided by King Khalid University. Belgacem Bouallegue: Methodology, design, writing, reviewing, and editing. Yasser M. Abd El-Latif: Methodology, design, writing, reviewing, and editing.

Corresponding author

Correspondence to Hosam El-Sofany .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Ethical approval

This study did not involve human participants, human tissue, or any personal data. The dataset used for this research is publicly available and anonymized, and no ethical approval or informed consent was required.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

El-Sofany, H., Bouallegue, B. & El-Latif, Y.M.A. A proposed technique for predicting heart disease using machine learning algorithms and an explainable AI method. Sci Rep 14 , 23277 (2024). https://doi.org/10.1038/s41598-024-74656-2

Download citation

Received : 19 February 2024

Accepted : 27 September 2024

Published : 07 October 2024

DOI : https://doi.org/10.1038/s41598-024-74656-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Machine learning
Heart diseases
ML algorithms

Quick links

Explore articles by subject
Guide to authors
Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Machine Learning models for heart disease prediction and dietary lifestyle change therapy recommendation: a systematic review

Open access
Published: 19 December 2024
Volume 4 , article number 113 , ( 2024 )

Cite this article

You have full access to this open access article

Francis Adoba Ekle ORCID: orcid.org/0000-0001-9027-2357 1 ,
Vincent Shidali 2 ,
Richard Emoche Ochogwu 3 &
Igoche Bernard Igoche 4

112 Accesses

Explore all metrics

Introduction

Several medical decision support systems for heart disease prediction have been developed by different researchers in today's digital and artificial intelligence-driven society to simplify and ensure effective diagnosis by utilising machine learning (ML) algorithms.

To carry out a systematic comparative review of the performance of variant supervised learning ML models for heart disease prediction, and also propose a Dietary Approach to Stop Hypertension (DASH) lifestyle change therapy recommendation system blueprint for heart disease.

In this research, the authors sourced 61 articles that used more than one supervised learning ML algorithms on heart disease prediction for comparison from Google Scholar and PubMed databases. A content-based filtering recommendation technique was used for designing the proposed system blueprint.

Comparatively, the Voting Ensembles Classifier (VEC) algorithm demonstrated the highest accuracy. This is hinged on the fact that, although each model may slightly overfit or underfit the data, their errors can cancel out when used in combination to produce predictions that are more accurate and stable. Furthermore, VEC's more reliable predictions can improve healthcare management's overall efficiency. Lastly, this study showed the blueprint of the proposed dietary therapy recommendation system for heart disease.

This research offers an extensive summary of the comparative performance of different supervised learning ML algorithms for heart disease prediction and also proposes a dietary lifestyle change therapy recommendation system framework. The information on comparative performance can aid researchers in choosing a suitable ML algorithm for their research, and the proposed system can act as a dietary therapy support tool for cardiologists when fully implemented.

Application of ensemble machine learning algorithms on lifestyle factors and wearables for cardiovascular risk prediction

Applying data science approach to predicting diseases and recommending drugs in healthcare using machine learning models – A cardio disease case study

Lifestyle Disease Influencing Attribute Prediction Using Novel Majority Voting Feature Selection

Explore related subjects.

Artificial Intelligence

Avoid common mistakes on your manuscript.

1 Introduction

According to the World Health Organization (WHO) three quarters of heart disease deaths globally occur in low and middle-income nations [ 1 ]. Heart disease is associated with some risk factors, such as high blood pressure, high lipids or cholesterol, high glucose, excessive alcohol intake, smoking, obesity, and being overweight. To receive timely treatment or management of the disease, it is imperative to develop an accurate and efficient method for making an early diagnosis of the risk or presence [ 2 ]. In the healthcare sector, ML has become a popular technique for handling massive volumes of data. Researchers and experts in artificial intelligence (AI) analyze vast and complex medical data using a range of data mining and machine learning techniques, helping medical professionals diagnose and predict a variety of ailments [ 3 , 4 , 5 ].

Data mining with ML techniques refers to the extraction of information and knowledge from large databases in a variety of disciplines, such as business, security, medicine, and education. One of the most rapidly expanding areas of AI is ML, and the health sector is one of the most important sectors where ML algorithms are particularly useful. This is due to the field's unprecedented generation of data. ML is a subdivision of AI. Its main focus is to design and implement systems or applications with the ability to acquire knowledge and make predictions and classifications based on experience. ML algorithms are trained using training data to create a model. The accuracy of the model is tested using the test dataset. The trained model uses the new input data to predict the outcome of a new event. Also, the model, using the ML algorithm, finds concealed patterns in the input data. It makes predictions for new instances of the datasets based on the effectiveness of the trained model. The data is cleaned, and missing values are filled or replaced in the process known as preprocessing of the dataset before it is used for training and testing the model [ 6 , 7 ].

ML algorithms are primarily classified into three main classes, namely: supervised, unsupervised, and reinforcement learning. In the supervised learning technique, the model is trained using labeled data. It has input data and its results. Data is classified and divided into training and test data. Training data trains our model; while testing data functions as new dataset instances to get the accuracy, error rate, specificity, recall, and precision of the trained model. Regression and classification tasks mostly use supervised learning ML algorithms. Examples of supervised learning ML algorithms are: Naive Bayes (NB), Decision Tree (DT), Support Vector Machine (SVM), Artificial Neural Network (ANN), Random Forest (RF), K-Nearest Neighbors (kNN), Linear Regression (LIR), Logistic Regression (LOR), Deep Neural Network (DNN), Adaboost (AB), Optimized Gradient Boosting (XGB), and Voting Ensembles Classifier (VEC), etc. The second class of ML algorithms is unsupervised learning. This algorithm uses unlabeled and unclassified datasets for training the model to find concealed patterns in the dataset. The model is mainly trained to find and predict hidden patterns for any new instance of the input dataset. With this technique, our expected output is the pattern hidden in the dataset. Examples of unsupervised ML algorithms are K-Means clustering and all other clustering ML algorithms. And thirdly, reinforcement learning does not use classified and labeled datasets nor the results that are associated with the dataset, but the model learns from the experience of previous tasks carried out. In this algorithm, the model gets better after each job. Examples of reinforcement learning algorithms are algorithms for developing robots in factories and cleaning houses [ 6 , 7 , 8 , 9 , 10 ]. For this research, we will be focusing the review on supervised learning ML algorithms for the prediction of heart disease.

Heart disease, also known as cardiovascular disease, refers to several types of situations and conditions that can affect the correct functioning of the heart. Prediction of the presence or risk of heart disease can have an enormous effect in the health domain as well as in individuals' lives. Examples of different kinds of heart diseases are [ 11 ]:

Cardiac Arrest: This is the unexpected and sudden loss of heart function, breathing, and consciousness, it can be fatal or temporal.

High Blood Pressure (Hypertension): This condition arises when the movement of blood is too high for the artery walls. According to the American Heart Association [ 12 ], Stage 1 hypertension occurs when the systolic (upper number) in mm Hg is between 130 and 139 or the diastolic (lower number) in mm Hg is between 80 and 89. Stage 2 hypertension is when the systolic (upper number) in mm Hg is 140 or higher, or the diastolic (lower number) in mm Hg is 90 or higher. A hypertensive crisis is when the systolic (upper number) in mm Hg is higher than 180, and/or the diastolic (lower number) in mm Hg is higher than 120.

Stroke: This condition arises when there is injury to the brain by a break or disruption in blood flow or supply to the brain.

Peripheral Artery Disease: This is a state in which the thin blood vessels decrease the flow of blood towards the limbs.

Coronary Artery Disease: This condition arises when there are ruins in the heart's main blood vessels. This is due to the deficiency of adequate blood movement to the heart.

Congenital Heart Disease: This condition occurs when there are one or more differences in the structure of your heart organ that you were born with or inherited from one of your parents.

Congestive Heart Failure: This is a persistent condition in which the heart halts the pumping of blood.

Other organs in the human body can stop functioning efficiently if the heart is not functioning properly because of its importance. Furthermore, as a result of our stressful lives, urbanization, and poor eating habits, the risk of heart disease is increasing among people. Therefore, by seeking accurate and efficient methods for the early identification of heart disease risk or presence and providing remedy, one can potentially reduce the chances of developing a heart-related disease. Additionally, addressing and controlling the risk factors associated with heart disease can further help mitigate these risks. However, given the increasing applicability and effectiveness of supervised learning ML algorithms in predictive illness modeling. It was discovered that little research has conducted a thorough and systematic assessment of published studies employing various supervised learning ML algorithms for heart disease prediction [ 10 , 13 ].

1.1 Motivations

The urgent global health challenge of heart disease drives this research. There are two primary motivations for this research. First, the need to find an accurate and efficient method for early diagnosis of heart disease because of its asymptomatic nature. Second, the recommendation of locally accessible DASH or heart-healthy diets to patients at risk or diagnosed with heart disease holds significant importance due to the following reasons [ 14 ]:

Clinical Evidence: DASH diets have been broadly studied and revealed to lower blood pressure, low-density lipoproteins, triglyceride levels, and cholesterol, which are significant risk factors for heart disease. It also improves overall heart health.

Customization: Local availability makes it simpler to follow the diet. Patients will find it easier to stick to a diet over time if they can easily obtain and incorporate foods accepted in their culture into their daily meals.

Sustainability: Promoting locally available foods promotes sustainability and helps the environment by lowering the need for processed or imported goods. This lessens the carbon footprint caused by the transportation of food and promotes regional agriculture.

Cultural Relevance: Dietary practices frequently have strong cultural roots. In addition to respecting cultural preferences, recommending locally accessible heart-healthy diets increases the likelihood that dietary recommendations will be accepted and followed.

Affordability: Foods that are available locally are frequently less expensive than imports or specialty items. Patients can now adopt and maintain a heart-healthy diet more easily, especially those with limited financial resources.

Community Support: Community support networks can be fostered by promoting heart-healthy diets that are readily available locally. Patients can support one another in making dietary changes by exchanging recipes, cooking tips, and experiences.

Prevention and Management: Heart-healthy diets are essential for preventing heart disease as well as helping to manage it when it already exists. By suggesting these diets to individuals who are at risk of heart disease, medical professionals can take early action and possibly stop cardiovascular issues from developing.

Comprehensive Approach: A complete strategy for heart disease prevention and management combines dietary advice with other lifestyle changes, like consistent exercise and quitting smoking.

1.2 Main contributions

The key contributions of this research are as follows:

Conduct a systematic comparative review of the performance of various supervised learning machine learning models for predicting heart disease.

Develop a formal framework for recommending indigenous DASH diets in the right quantity of daily calories for those diagnosed or at risk of heart disease using a content-based filtering recommendation technique.

The results of this study will help scholars, researchers, and data scientists in the domain of AI to identify the most competent supervised learning ML algorithm for heart disease prediction. The proposed system, when fully implemented, will be used by cardiologists and dieticians to recommend dietary lifestyle change therapy for heart disease. The structure of the remainder of this research paper is as follows: Various existing supervised learning ML algorithms are discussed in Sect. 2 . The classifier performance metrics are explained in Sect. 3 . Section 4 reviewed and compared supervised ML models for heart disease prediction. Section 5 outlines the proposed dietary lifestyle change therapy recommendation system. Sections 6 and 7 present the results and discussion respectively. Section 8 concludes the paper and offers direction for future research.

2 Supervised learning machine learning algorithms

Jordan and Mitchell [ 15 ], defined ML as an application of AI in which computers use techniques (algorithms) embodied in software to learn from data and adapt with experience. ML generally refers to the modifications in systems that execute tasks associated with AI. Examples of such tasks include diagnosis, recognition, planning, prediction, robot control, classification, clustering, etc. [ 16 ].

The modifications could be upgrades to already-functioning systems or the creation of whole new systems. ML algorithms for developing models are organized into three basic classes based on the desired outcome or function of the algorithm [ 6 ]. The basic ML algorithm classes include: supervised, unsupervised, and reinforcement learning [ 17 ].

The scope of this study is limited to reviewing the comparative performances of various supervised learning machine learning algorithms for heart disease prediction. Additionally, it proposes a lifestyle change therapy recommendation system that emphasizes heart-healthy or DASH diets, utilizing a content-based filtering recommendation technique.Supervised learning ML algorithms are trained on a labeled dataset. Training and testing datasets are created by categorizing and subdividing the data. The training dataset is used to train our model, while the testing dataset is used to test the trained model. The most commonly used supervised learning ML algorithms for heart disease prediction are;

Artificial Neural Networks (ANN): Olaniyi et al. [ 18 ] described ANN as an information processing system inspired by the central nervous system and brain of humans and animals in terms of operational principles and structure. They are composed of a huge number of basic units known as neurons that act in tandem. These neurons communicate by sending information in the form of activation signals to each other through directed connections. ANNs have become quite popular in recent years due to their capacity to learn quickly in real time and their ease of execution features. ANNs have recently gained increasing popularity and are used in a variety of fields, including health (i.e., disease diagnosis and prediction) and text classification (i.e., detection of hate speech when used in collaboration with Natural Language Processing), etc. However, using the right hyperparameters and activation function for ANN can provide significantly better results for classification and prediction tasks. The simple structure of an ANN is made up of the input layer, a hidden layer (two hidden layers in this instance) with multiple interconnected nodes, and an output layer, as shown in Fig. 1 .

Multi-Layer Perceptron (MLP): An MLP is a type of ANN in which the hidden layers are numerous. It works by accepting input from other perceptron’s, assigning a weight to each node, and then passing the data to the hidden layer. The hidden layer's output is fed to the output layer ahead of it. The estimated predicted value and the actual value are used to derive the error value [ 19 , 20 ]. The goal is to lower the error value, so backpropagation between the hidden and output layers is repeated until the error value is minimized or reduced to the bare minimum. The MLP network is fed forward after the backpropagation method has been established. When the correct input hyper-parameters are chosen, the MLP networks generate superior performance for more difficult jobs.

Decision Tree (DT): Although it can be used to solve regression and classification problems, Shah et al. [ 6 ] note that DT is a popular data mining machine learning technique that is primarily used for classification tasks. This method divides a population into segments that resemble branches and joins them to form an inverted tree with leaf, internal, and root nodes [ 21 ]. DT does not need a complex parametric structure to handle large, complex, and sophisticated data sets because it is non-parametric. DT learning is a discrete-valued target function approximation method in which a DT is used to represent the learned function. Alternatively, learned trees can be shown as a sequence of if–then rules to make them easier for humans to understand. The DT model makes analysis based on three basic nodes, namely:

Root node: Primary node, upon which all other nodes operate.

Interior node: Manages the dataset's numerous characteristics or properties.

Leaf node: represents the test result, or in classification problems, the class of the target variable.

A simple basic structure of an ANN with two hidden layers and multiple interconnected nodes

The most important indicators are used by the DT algorithm to separate the data into two or more analogous sets. Each feature's entropy is calculated, and the dataset is split up so that the predictors with the lowest entropy are at the top. Eq. ( 1 ) provides the formula for determining the entropy of an attribute.

where in Eq. ( 1 ), \(c\) is the total number of classes and the probability of samples belonging to a class at a given node can be denoted as \(p_{i}\) . A simple basic structure of a DT showing the three basic nodes is shown in Fig. 2 .

Structure of a Decision Tree showing the root node, interior nodes represented by a rectangle, and leaf nodes represented by an oval shape

Support Vector Machine (SVM): An SVM algorithm performs classification by constructing an N-dimensional hyperplane that divides data into classes as efficiently as possible [ 22 ]. SVM-trained models have a strong relationship with ANN. In reality, a sigmoid kernel SVM model is equivalent to a two-layer perceptron ANN. Figure 3 shows the basic illustration of an SVM algorithm.

K-Nearest Neighbors (kNN): According to Shah et al. [ 6 ], kNN is a data classification method that tries to figure out what group a data point belongs to by looking at the data points in its immediate vicinity. Because it does not develop a model of the data beforehand, kNN is an example of a lazy learner method. Only when prompted to poll the data point's neighbors does it do computations. This makes the kNN ML algorithm very simple to implement for data mining and knowledge processing. Examples of similarity metric functions used for calculating distance between data points are: Pearson correlation, Euclidean distance, Jaccard similarity coefficient, simple matching coefficient, and vector cosine similarity.

Linear Regression (LIR): LIR is used to estimate actual values based on continuous variables rather than discrete ones, according to Abdi [ 17 ]. By fitting the optimal line, it builds a relationship between the dependent and independent variables. This best-fit line is also called a regression line and is represented by a linear equation as shown in Eq. ( 2 ).

From Eq. ( 2 ), \(m\) is the gradient of the line, \(c\) is the y-intercept, i.e., the point in which the line crosses the y-axis and, \(x\) and \(y\) are the independent and dependent variables respectively.

Logistic Regression (LOR): LOR is a supervised learning ML technique that can be applied to classification and regression problems. LOR forecasts categorical dataset classification using probability. To forecast the outcome, input data can be aggregated linearly using a logistic or sigmoid function and coefficient values. For assuming most likely data, maximum possibility estimation is used with the sigmoid function, and probability is given in the range of 0 to 1, indicating whether an event is likely to occur or not. The task becomes a classification task when the decision threshold is employed. It can be multinomial, which means there are more than two classes with no ordering; binary, which means there are only two classifications; and ordinal, which means there are more than two classifications with ordering. It is a fairly simple supervised learning ML model to develop, and it can produce good results in classification and prediction tasks [ 10 ].

Naive Bayes (NB): NB is a supervised machine learning technique for regression and classification that is predicated on strong (Naive) independence among predictors and the Bayes theorem [ 11 ]. To put it simply, a NB classifier maintains that the existence of one attribute in a class is unrelated to the existence of any other attribute. For example, a fruit is categorized as orange if it is round, green, and has a diameter of roughly 4 inches. Even though some of these traits depend on the existence of other traits or one another, a NB classification technique would treat each of these traits as independent. NB is a simple, efficient, and user-friendly algorithm that can handle complicated and non-linear data. Mathematically, the Bayes theorem is written as shown in Eq. 3 :

Where in Eq. ( 3 ), \(P(A/B)\) is the posterior probability, \(P(B)\) is the predictor prior probability, \(P(A)\) is the class prior probability, \(P(B/A)\) is the possible probability of the predictor.

Random Forest (RF): RF works by assembling numerous DTs into a forest [ 23 ]. The RF ML algorithm creates a class expectation from each unique tree, using the class with the most votes as the model's forecast. An RF classifier with more trees has a higher chance of improving accuracy. It can be used for both regression and classification tasks, but it performs better on classification tasks and can get around the shortcomings of DT and other ML techniques, such as overfitting and missing values.

Deep Neural Network (DNN): DNN is an ANN with manifold layers between the input and output layers. It can be used as both a feed-forward and a feed-backward ANN. It calculates the likelihood of each output, from the input to the output layer. It works in the same way as the MLP, but it can have a network with many computation cycles and a very large number of hidden layers. It attempts to determine the proper mathematical computation for determining the outcome of both linear and non-linear activities. It can figure out more complex tasks such as image and voice recognition, video, audio, and text summarization, etc. DNN algorithms include recurrent and convolutional neural networks, long short-term memory (LSTM) networks, Bi-LSTM, etc. The DNN ML model can experience over-fitting due to the use of very large numbers of hidden layers. DNNs are widely used in many domains because of their ability to learn complex and nonlinear problems and produce better results [ 10 ].

AdaBoost (AB): AB, in other words, adaptive boosting, is a supervised learning ML algorithm. It is used in combination with several other types of ML algorithms to enhance performance. The outcome of the other ML techniques, most of which are called weak learners, is aggregated into a weighted sum in place of the final result of the improved classifier. AB is adaptive in the sense that it ensures weak learners’ classifiers are modified in favor of those examples incorrectly classified by prior classifiers. AB is also very sensitive to outliers and noisy datasets. In some tasks, it can be less vulnerable to the overfitting problem than other supervised learning ML algorithms. At the start, the distinct base classifiers may be weak. However, as long as each one performs slightly better than random guessing, the final trained ML model can be shown to converge to a very powerful classifier and predictor [ 24 ].

Optimized Gradient Boosting (XGB): Chen and Guestrin [ 25 ] presented their paper on the XGB ML algorithm at the SIGKDD Conference in 2016 and took the ML community by storm. The XGB ML algorithm was developed as a research thesis at the University of Washington. XGB is a DT-based ensemble ML algorithm that uses a gradient-boosting design framework. Gradient boosting makes use of the gradient descent technique to reduce errors in sequentially trained models. algorithm uses tree-pruning, parallel processing, regularization, and handling of missing values to avoid overfitting and bias ML problems [ 25 ].

Bagging (BA): The BA classifier is an ensemble ML algorithm that fits base classifiers each on arbitrary subsets of the original data and then combines their predictions (either by averaging or by voting) to form a final prediction. Each base algorithm or classifier is trained in parallel with a training dataset, which is created by randomly drawing with replacements from the original training dataset. The RF ML algorithm works similarly to BA except for the fact that not all features, i.e., independent variables, are selected in a subset [ 26 ].

Voting Ensembles Classifier (VEC): Similar to the RF ML algorithm, the VEC classifier estimates multiple base models. The final results are obtained by combining the independent predictions or classifications from each classifier through voting. However, the real differences are found in the foundational models. The VEC model does not require or mandate the homogeneity of the base models; put another way, we can arrive at the final result by training different base learner machine learning models [ 26 ]. The VEC algorithm employs both hard and soft voting mechanisms. Hard voting forecasts the final class label as the one that the classification-trained models have predicted the most frequently, while soft voting forecasts class labels using an average of class data. Mathematically, hard voting is represented by Eq. ( 4 ).

In Eq. ( 4 ), \(\hat{y}\) is the target variable class label. \(A_{1}\) , \(A_{2}\) and \(A_{3}\) are they base classifiers used in the VEC classifier respectively. \(x\) is the data points or records. Figure 4 shows the framework of the VEC algorithm using NB, RF, and DT as the base models.

Genetic Algorithm (GA): GA is an evolutionary ML algorithm that carries out an optimization process inspired by the biological theory of evolution using inheritance mutation, natural selection, and cross-over with a binary representation and simple operators based on genetic mutations and genetic recombination. GA is also known as a stochastic global optimization ML algorithm that provides a method for programs to automatically enhance their parameters [ 27 , 28 ].

A basic diagram of how the support vector machine works. The SVM has acknowledged a hyperplane (a line) which makes best use of the separation between the ‘circle’ and ‘triangle’ classes

Figure illustrating the VEC algorithm’s foundational models, NB, RF, and DT. The prediction values for the NB classifier, RF classifier, DT classifier, and final prediction for the VEC model are represented in Fig. 4 as P N , P R , P D , and P F , respectively. The dataset used to train the model is called the Training Set, and the dataset used to test the model is called New Data

3 Classifier performance metrics

The predictive ability of ML algorithms has usually been fundamentally determined by the confusion matrix [ 7 ]. In the ML research field, the confusion matrix is also known as the contingency matrix or error matrix. The principal components of a binary confusion matrix are shown in Table 1 . In Table 1 , the state positive (P) is the number of actual positive cases in the dataset, while the state Negative (N) is the number of actual negative cases in the dataset. True positive (TP) states are the positive cases where the classifier appropriately identifies them. Similarly, true negatives (TN) are negative cases where the algorithm correctly identified them. False positive (FP) states are the negative cases where the algorithm incorrectly recognized them as positive, and false negative (FN) states are the positive cases where the algorithm incorrectly acknowledged them as negative [ 29 ].

Accuracy, precision, error rate, recall, specificity, and false positive rate (FPR), which are based on the confusion matrix, are usually used to evaluate the performance of ML algorithms. In this research, we used accuracy as the performance evaluation metric for evaluating supervised learning ML algorithms for heart disease prediction.

Accuracy (ACC): ACC is computed as the number of all correct diagnoses divided by the total number of data. The best accuracy value is 1.0, while the worst value is 0.0. The best accuracy value in percentage is 100%. Mathematically it is shown in Eq. ( 5 ) [ 7 ].

4 Review of supervised learning machine learning models for heart disease prediction

In this section, we concentrated on the comparative review of the performance of variant supervised learning ML models for heart disease prediction, with emphasis on the accuracy metric of the models. 61 articles from the year “2015–2024” on heart disease prediction that used more than one supervised learning ML algorithm were obtained from the searching of Google Scholar and PubMed databases using the search terms “heart disease prediction” and “machine learning”; “heart disease risk prediction” and “machine learning”. The comparative review of the performance of the different supervised learning ML algorithms used in this review paper is shown in Table 2 with emphasis on the accuracy classifier evaluation metric.

4.1 Heart disease therapy

In this subsection, we gave a concise overview of the treatment and management of heart disease. Primarily, the treatment and management of heart disease largely include [ 87 , 88 , 89 ]:

Lifestyle changes: Heart disease can be managed, treated, or even prevented by making or adjusting to certain lifestyle changes by individuals. You can reduce your risk, manage or treat heart disease by eating a low-fat and low-sodium (salt) diet, which is known as the Dietary Approach to Stop Hypertension (DASH) or heart-healthy diet, getting at least 30 min of moderate exercise at least 5 days of the week, quitting smoking, and limiting alcohol consumption to no more than 1 bottle of drink per day for females and no more than 2 bottles of drink per day for males.

Medications: If lifestyle changes alone aren't sufficient to remedy the situation, the cardiologist may recommend medications to control or manage the heart disease. The kind of medication that is prescribed to a person will depend on the kind of heart disease they have.

An Invasive Approach or Surgery: If lifestyle and medication approaches aren't sufficient to treat or manage the heart disease, the cardiologist may recommend specific procedures or surgery. The type of surgery or procedure will depend on the kind of heart disease and the degree of injury the heart disease has caused to the heart.

In this study, the scope of the lifestyle changes heart disease therapy we are proposing will be limited to dietary lifestyle change.

5 Proposed dietary lifestyle change therapy recommendation system

In this section, we summarized the proposed system for the recommendation of dietary lifestyle change therapy for heart disease. We are focusing mainly on the dietary aspect of lifestyle changes for heart disease therapy in this study. This emphasis is due to the fact that, according to the Center for Science in the Public Interest, 4 of the top 10 leading causes of death in the United States—heart disease, cancer, stroke, and type 2 diabetes—along with the leading global cause of death, heart disease, are directly connected to diet. [ 2 , 90 , 91 , 92 , 93 ]. A major way to prevent, slow, and recover from heart or cardiovascular disease is a heart-healthy diet [ 91 ].

Recommendation systems are software tools used for effective information filtering. They are also used by different domain experts as decision-support tools. They make available customized information or services that suit a user’s needs in a particular context, therefore making them intelligent, personalized software [ 94 , 95 , 96 ]. There are several types of recommendation techniques, namely: content-based filtering, collaborative filtering, and demographic filtering techniques, etc. In the content-based filtering recommendation technique, recommendations are carried out using the extracted features of the information or services that suit the user's needs or preferences based on the information provided by the user or gotten from the user’s profile [ 97 , 98 ]. In this study, in addition to the comparative review of the performance of supervised learning ML models for heart disease prediction, a content-based filtering recommendation system blueprint for recommending dietary lifestyle changesfor heart disease therapy to users diagnosed with heart disease is also proposed.

Dietary lifestyle change heart disease therapy recommendations will primarily focus on suggesting locally available heart-healthy or DASH diets. These diets, which include vegetables, fruits, lean protein, whole grains, and low-fat dairy, will be tailored to the right calorie quantities for individuals diagnosed with heart disease, taking into account their sex, age, and physical activity level as risk factors among those provided by the user. [ 90 , 91 , 92 , 93 , 99 , 100 ].

Sex, age, and physical activity level are extracted from the heart disease risk factors provided by the user. This information is necessary for predicting heart disease and determining the estimated daily quantity of diet calories for individuals diagnosed with the condition.Calories are a measure of the energy that food supplies. The sex, age, and physical activity level (PAL) heart disease risk factors for each unique user determine the daily quantity of calories of DASH or heart-healthy diet to be recommended, which will be filtered using a content-based filtering recommendation technique, as shown in Table 3 from the 2015–2020 Dietary Guidelines for Americans [ 100 ]. The 3 categories of physical activity level used in Table 3 are: sedentary (Sed), moderately active (ModA) and active (Act) for each age-sex group. Figure 5 shows the blueprint for the proposed dietary lifestyle change therapy recommendation system for heart disease.

Blueprint for the Proposed Dietary Lifestyle Change Therapy Recommendation System for Heart Disease. From Fig. 5 below, the trained ML model will accept heart disease risk factors from the web user through the web user interface and predict whether there is presence of heart disease and also proceed to recommend dietary lifestyle change therapy to those diagnosed of heart disease presence using content-based filtering recommendation technique (for filtering the right quantity of daily calories required by each unique user). The dietary lifestyle changes therapy recommendation module will focus majorly on recommending locally available DASH or heart healthy diet in the right quantity of daily calories to those diagnosed of heart disease by filtering and using their sex, age and physical activity level risk factors amongst the heart disease risk factors submitted by the user for at least a week

The final dataset for the review contained 61 published articles, each of which executed more than one variant of supervised learning ML algorithms for heart disease prediction. The 61 selected published articles were reviewed in terms of the techniques of ML algorithms used as well as performance evaluation metrics with emphasis on the accuracies of the implemented supervised learning ML models.

In Table 2 , references to the articles, year of publication, instances of data, datasets used and the corresponding supervised learning ML algorithms used to predict heart disease with their accuracies in percentages are stated. For each of the published articles and their corresponding models, the best-performing supervised learning ML algorithm used in developing the models is also stated in this table. This review research considered 61 published articles for heart disease prediction. 67 algorithms were found to display superior accuracy. Five of the 61 articles used have more than one ML algorithm that shows the same higher level of superior accuracies. Referring to [ 37 ], it has three algorithms, i.e., kNN, RF, and DT (out of 6) that showed the same higher-level accuracies of 100%, respectively, [ 67 ] has two algorithms, i.e., LOR and SVM (out of 6) that showed the same higher-level of superior accuracies, [ 76 ] has two algorithms, i.e., SVM and RF (out of 4) showed the same higher-level of superior accuracies, [ 83 ] has two algorithms, i.e., DT and RF (out of 4) and [ 86 ] has two algorithms kNN and RF (out of 6).

In summation, 61 articles were considered in this review research study, and 67 supervised learning ML algorithms were found to show superior accuracy.

The comparison and assessment of the usage frequency and accuracy of variant supervised learning ML algorithms are shown in Table 4 . It is observed that DT has been used most frequently (44 times out of the 61 articles that were used in this study). This is followed by RF, used in 42 articles. Although the VEC supervised learning ML algorithm has been considered one of the least number of times (12 articles out of 61), it showed the highest average accuracy in percentage (58.3%) followed by RF, which has been used the second highest number of times (42 articles out of 61, and 52.4% higher average accuracy in percentage). Additionally, Fig. 5 showed the implementation blueprint of the proposed dietary therapy recommendation system for heart disease, and Fig. 6 showed the Top 5 Machine Learning Algorithms that showed the Highest Average Accuracy in percentage.

A Bar Chart showing the top 5 Machine Learning Algorithms that showed the Highest Average Accuracy in Percentage

7 Discussion

To avoid the risk of selection bias, we screened and reviewed research articles from the literature that used more than one supervised learning ML algorithm. The same supervised learning ML method can produce different results in different study environments. If two supervised learning ML algorithms were utilized separately in different research studies, a performance comparison between them could give inexact results. The first limitation of our study is that we used a higher-level classification of supervised learning machine learning algorithms to evaluate and compare them for heart disease prediction. Any subclassifications or modifications of the ML algorithms employed in this article were not taken into account. For example, instead of examining them under the DT algorithm, we should have evaluated the performance of the DT ML algorithm using subclassification of DT such as the Classification and Regression Tree (CART) method and the Chi-square Automatic Interaction Detection (CHAID) algorithm DT methods.

Another drawback of this research article is that in comparing different supervised learning ML algorithms, we did not take into account the various hyperparameters that were chosen in variant articles of this research study. It has been proven beyond a reasonable doubt that by using different values for the basic hyperparameters, the same ML algorithm can generate different accuracy outcomes for the same data [ 101 ]. For example, the selection of a variant activation function for the ANN-supervised learning ML algorithm can result in a variation in accuracy results for the same data. Likewise, an MLP or DNN ML algorithm could generate different outcomes depending on the variations in the hidden layer sizes or learning rate of the network. Finally, this study also showed the blueprint of the proposed dietary therapy recommendation system for heart disease without implementation, which will be carried out in the areas of further research.

8 Conclusion and direction for future studies

This study aimed to compare the performance of several supervised learning ML algorithms in predicting heart disease and propose a dietary lifestyle change therapy recommendation system for heart disease. The clinical, demographic, and research scope of this study varies greatly, a comparison and evaluation were only possible if a common standard benchmark for data and research scope existed. As a result, we only included research publications for comparison that used more than one supervised learning ML algorithm on the same dataset for heart disease prediction. Regardless of the differences in frequency and performance, the results demonstrate the potential of these classes of supervised learning ML algorithms in predicting heart disease. VEC gave the best result and the primary advantage is by combining the predictions of multiple models, the ensemble can capture different patterns in the data and make more reliable predictions.

Additionally, Although, health data is demographic [ 102 ], it was observed from the reviewed literature that most of the clinical decision support systems and models developed for heart disease prediction or heart disease risk prediction are based on the CHDUCIRD dataset. Hence, to obtain more effective and efficient heart disease prediction results, it is encouraged that diverse clinical datasets from different demographics and regions should be collected and used for the implementation of supervised learning ML models for heart disease predictions., Lastly, the indigenous DASH diet therapy recommendation system proposed in this research will shift the paradigm from the traditional approach where cardiologists recommend DASH or heart-healthy diets without a formal framework for specifying users' daily calorie needs. It will replace this with a formal system that specifies calorie needs accurately. This change is particularly relevant for low and middle-income countries. This research designed a formal framework for recommending Indigenous DASH diets with the appropriate calorie quantities for individuals diagnosed with or at risk of heart disease. Integrating locally available DASH diets into clinical practice recognizes the importance of economic, cultural, and environmental factors in promoting heart health. It enhances the effectiveness of dietary interventions in reducing the problem of heart disease. The future scope of this study should be extended to include the implementation aspect of the proposed model.

Data availability

No datasets were generated or analysed during the current study.

World Health Organization. Cardiovascular diseases (CVDs). 2021. https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds ). Accessed 20 Dec 2021.

World Health Organization. Cardiovascular diseases. https://www.who.int/health-topics/cardiovascular-diseases#tab=tab_1 . Accessed 21 Dec 2021.

Gupta S, Bharti V, Kumar A. A survey on various machine learning algorithms for disease prediction. Int J Recent Technol Eng. 2019;7(6c):84–7.

Google Scholar

Yahaya L, Oye ND, Garba EJ. A comprehensive review on heart disease prediction using data mining and machine learning techniques. Am J Artif Intell. 2020;4(1):20–9.

Aliyu A. A hybrid model for predicting malaria using data mining techniques (doctoral dissertation, American University of Nigeria, school of information, technology and computing). http://digitallibrary.aun.edu.ng:8080/xmlui/bitstream/handle/123456789/571/Aminu%20Aliyu.pdf?sequence=1&isAllowed=y

Shah D, Patel S, Bharti SK. Heart disease prediction using machine learning techniques. SN Comput Sci. 2020;1(6):1–6. https://doi.org/10.1007/s42979-020-00365-y .

Article Google Scholar

Uddin S, Khan A, Hossain ME, Moni MA. Comparing different supervised machine learning algorithms for disease prediction. BMC Med Inform Decis Mak. 2019;19(1):1–16.

Cunningham P, Cord M, Delany SJ. Supervised learning. In: Cord M, Cunningham P, editors. Machine learning techniques for multimedia. Cognitive technologies. Berlin: Springer; 2008. p. 21–49.

Chapter Google Scholar

Fielding AH. An introduction to machine learning methods. In: Fielding AH, editor. Machine learning methods for ecological applications. Boston: Springer; 1999. p. 1–35.

Katarya R, Meena SK. Machine learning techniques for heart disease prediction: a comparative study and analysis. Heal Technol. 2021;11(1):87–97. https://doi.org/10.1007/s12553-020-00505-7 .

Diwakar M, Tripathi A, Joshi K, Memoria M, Singh P. Latest trends on heart disease prediction using machine learning and image fusion. Mater Today Proc. 2021;37:3213–8.

American Heart Association. High blood pressure. https://www.heart.org/en/health-topics/high-blood-pressure . Accessed 16 Jan 2022.

Palaniappan S, Awang R. Intelligent heart disease prediction system using data mining techniques. In: 2008 IEEE/ACS international conference on computer systems and applications. Ieee: New York. 2008. pp. 108–115.

Joyce BT, Wu D, Hou L, Dai Q, Castaneda SF, Gallo LC, Talavera GA, Sotres-Alvarez D, Van Horn L, Beasley JM, Khambaty T. DASH diet and prevalent metabolic syndrome in the Hispanic community health study/study of Latinos. Prev Med Rep. 2019;15: 100950.

Jordan MI, Mitchell TM. Machine learning: trends, perspectives, and prospects. Science. 2015;349(6245):255–60.

Article MathSciNet Google Scholar

Goodfellow I, Bengio Y, Courville A. Machine learning basics. Deep Learn. 2016;1(7):98–164.

Abdi, A. Three types of machine learning algorithms. University of Twente. 2016. https://www.researchgate.net/publication/310674228_Three_types_of_Machine_Learning_Algorithms . Accessed 15 Oct 2022.

Olaniyi EO, Oyedotun OK, Adnan K. Heart diseases diagnosis using neural networks arbitration. Int J Int Syst Appl. 2015;7(12):72.

Dangare C, Apte S. A data mining approach for prediction of heart disease using neural networks. Int J Comput Engin Technol (IJCET). 2012;3(3):30–40.

Ghodekar N, Chandran P, Pawar S. Heart disease prediction using classification techniques: a comparative study. Int J Manag IT Engin. 2019;9(7):77–89.

Karthiga AS, Mary MS, Yogasini M. Early prediction of heart disease using decision tree algorithm. Int J Adv Res Basic Engin Sci Technol. 2017;3(3):1–6.

Ayodele TO. Types of machine learning algorithms. New Adv Machine Learn. 2010;3:19–48.

Tan P-N, Steinbach M, Kumar V. Introduction to data mining. London: Pearson; 2006.

Nwaneli CU. Changing trend in coronary heart disease in Nigeria. Afrimedic J. 2010;1(1):1–4.

Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016; pp.785–794.

Emakhu J, Shrestha S, Arslanturk S. Prediction system for heart disease based on ensemble classifiers. In: Proceedings of the 5th NA international conference on industrial engineering and operations management detroit, Michigan, USA. 2020; pp. 10—14.

Kumar M, Husain M, Upreti N, Gupta D. Genetic algorithm: Review and application. 2010. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3529843

Brownlee J. Simple genetic algorithm from scratch in python. In: Machine learning mastery. 2021. https://machinelearningmastery.com/simple-genetic-algorithm-from-scratch-in-python/ . Accessed 15 Jan 2022.

Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020;21(1):1–3. https://doi.org/10.1186/s12864-019-6413-7 .

Yadav KK, Sharma A, Badholia A. Heart disease prediction using machine learning techniques. Inf Technol Ind. 2021;9(1):207–14.

Abdar M, Kalhori SR, Sutikno T, Subroto IM, Arji G. Comparing performance of data mining algorithms in prediction heart diseases. Int J Electr Comput Engin. 2015. https://doi.org/10.11591/ijece.v5i6.pp1569-1576 .

Salhi DE, Tari A, Kechadi MT. Using machine learning for heart disease prediction. In: Salhi DE, Tari A, Kechadi MT, editors. CSA. Singapore: Springer; 2020. p. 70–81.

Amin MS, Chiam YK, Varathan KD. Identification of significant features and data mining techniques in predicting heart disease. Telemat Inform. 2019;36:82–93.

Kondababu A, Siddhartha V, Kumar BB, Penumutchi B. A comparative study on machine learning based heart disease prediction. In: Materials today: proceedings. 2021. https://www.sciencedirect.com/science/article/pii/S2214785321005666

Tougui I, Jilbab A, El Mhamdi J. Heart disease classification using data mining tools and machine learning techniques. Heal Technol. 2020;10:1137–44. https://doi.org/10.1007/s12553-020-00438-1 .

Singh YK, Sinha N, Singh SK. Heart disease prediction system using random forest. In: Singh YK, Sinha N, Singh SK, editors. International conference on advances in computing and data sciences. Singapore: Springer; 2016. p. 613–23.

Ali MM, Paul BK, Ahmed K, Bui FM, Quinn JM, Moni MA. Heart disease prediction using supervised machine learning algorithms performance analysis and comparison. Comput Biol Med. 2021;136: 104672.

Tarawneh M, Embarak O. Hybrid approach for heart disease prediction using data mining techniques. In: International conference on emerging internetworking, data & web technologies; Springer: Cham. 2019. pp. 447–454.

Maini E, Venkateswarlu B, Maini B, Marwaha D. Machine learning–based heart disease prediction system for Indian population: An exploratory study done in South India. Med J Armed Forces India. 2021. https://www.sciencedirect.com/science/article/abs/pii/S0377123720302148

Arumugam K, Naved M, Shinde PP, Leiva-Chauca O, Huaman-Osorio A, Gonzales-Yanac T. Multiple disease prediction using Machine learning algorithms. In: Materials today: proceedings. 2021. https://www.sciencedirect.com/science/article/abs/pii/S2214785321052202

Panda D, Dash SR. Predictive system: comparison of classification techniques for effective prediction of heart disease. In: Panda D, Dash SR, editors. Smart intelligent computing and applications. Singapore: Springer; 2020. p. 203–13.

Jindal H, Agrawal S, Khera R, Jain R, Nagrath P. Heart disease prediction using machine learning algorithms. In: Jindal H, Agrawal S, Khera R, Jain R, Nagrath P, editors. IOP conference series: materials science and engineering. Bristol: IOP Publishing; 2021. p. 012072.

Ansari MF, AlankarKaur B, Kaur H. A prediction of heart disease using machine learning algorithms. In: Ansari MF, AlankarKaur B, Kaur H, editors. International conference on image processing and capsule networks. Cham: Springer; 2020. p. 497–504.

Rani P, Kumar R, Ahmed NM, Jain A. A decision support system for heart disease prediction based upon machine learning. J Reliable Intell Environ. 2021;25:1–3. https://doi.org/10.1007/S40860-021-00133-6 .

Kamboj M. Heart disease prediction with machine learning approaches. Int J Sci Res (IJSR). 2018; ISSN 2319–7064.

Nahiduzzaman M, Nayeem MJ, Ahmed MT, Zaman MS. Prediction of heart disease using multi-layer perceptron neural network and support vector machine. In: Nahiduzzaman M, Nayeem MJ, Ahmed MT, Zaman MS, editors. 2019 4th international conference on electrical information and communication technology (EICT). New York: IEEE; 2019. p. 1–6.

Aggarwal R, Pal S. Comparison of Machine Learning Algorithms and Ensemble Technique for Heart Disease Prediction. In: Aggarwal R, Pal (Eds.) International Conference on Intelligent Systems Design and Applications. Springer: Cham. 2020. pp. 1360–1370

Mohapatra S, Dash J, Mohanty S, Hota A. An approach for heart disease prediction using machine learning. In: Mohapatra S, Dash J, Mohanty S, Hota A, editors. Intelligent systems. Singapore: Springer; 2021. p. 1–12.

Khateeb N, Usman M. Efficient heart disease prediction system using K-nearest neighbor classification technique. In: proceedings of the international conference on big data and internet of thing; 2017. pp. 21–26.

Marikani T, Shyamala K. Prediction of heart disease using supervised learning algorithms. Int J Comput Appl. 2017;165(5):41–4.

Mansoor H, Elgendy IY, Segal R, Bavry AA, Bian J. Risk prediction model for in-hospital mortality in women with ST-elevation myocardial infarction: a machine learning approach. Heart Lung. 2017;46(6):405–11.

Pahwa K, Kumar R. Prediction of heart disease using hybrid technique for selecting features. In: Pahwa K, Kumar R, editors. 2017 4th IEEE Uttar Pradesh section international conference on electrical, computer and electronics (UPCON). New York: IEEE; 2017. p. 500–4.

Krishnan S, Geetha S. Prediction of heart disease using machine learning algorithms. In: Krishnan S, Geetha S. (EDs.) 2019 1st international conference on innovations in information and communication technology (ICIICT). IEEE: New York. 2019. pp. 1–5.

Obasi T, Shafiq MO. Towards comparing and using machine learning techniques for detecting and predicting heart attack and diseases. In: Obasi T, Shafiq MO (EDs.) 2019 IEEE international conference on big data (big data); IEEE :New York. 2019. pp. 2393–2402.

Dwivedi AK. Performance evaluation of different machine learning techniques for prediction of heart disease. Neural Comput Appl. 2018;29(10):685–93. https://doi.org/10.1007/s00521-016-2604-1 .

Bahrami B, Shirvani MH. Prediction and diagnosis of heart disease by data mining techniques. J Multidiscip Engin Sci Technol (JMEST). 2015;2(2):164–8.

Beunza JJ, Puertas E, García-Ovejero E, Villalba G, Condes E, Koleva G, Hurtado C, Landecho MF. Comparison of machine learning algorithms for clinical event prediction (risk of coronary heart disease). J Biomed Inform. 2019;97: 103257.

Rajdhan A, Agarwal A, Sai M, Ravi D, Ghuli P. Heart disease prediction using machine learning. Int J Res Technol. 2020;9(04):659–62.

Kavitha M, Gnaneswar G, Dinesh R, Sai YR, Suraj RS. Heart disease prediction using hybrid machine learning model. In: Kavitha M, Gnaneswar G, Dinesh R, Sai YR, Suraj RS, editors. 2021 6th international conference on inventive computation technologies (ICICT). New York: IEEE; 2021. p. 1329–33.

Mohan S, Thirumalai C, Srivastava G. Effective heart disease prediction using hybrid machine learning techniques. IEEE Access. 2019;7:81542–54.

Anitha S, Sridevi N. Heart disease prediction using data mining techniques. J Anal Computation. 2019;13(2):1–9.

Sujatha P, Mahalakshmi K. Performance evaluation of supervised machine learning algorithms in prediction of heart disease. In: Sujatha P, Mahalakshmi K, editors. 2020 IEEE international conference for innovation in technology (INOCON). New York: IEEE; 2020. p. 1–7.

Mehanović D, Mašetić Z, Kečo D. Prediction of heart diseases using majority voting ensemble method. In: Mehanović D, Mašetić Z, Kečo D, editors. International conference on medical and biological engineering. Singapore: Springer; 2019. p. 491–8.

Desai F, Chowdhury D, Kaur R, Peeters M, Arya RC, Wander GS, Gill SS, Buyya R. Health cloud: a system for monitoring health status of heart patients using machine learning and cloud computing. Internet Things. 2022;17: 100485.

Gupta C, Saha A, Reddy NS, Acharya UD. Cardiac disease prediction using supervised machine learning techniques. J Phys Conf Ser. 2022. https://doi.org/10.1088/1742-6596/2161/1/012013/meta .

Qureshi M, Warke N. Application of machine learning for heart disease prediction. In: Proceedings of 2nd international conference on artificial intelligence: advances and applications. Springer: Singapore. 2022. pp. 267–279.

Sarah S, Gourisaria MK, Khare S, Das H. Heart disease prediction using core machine learning techniques—a comparative study. In: Sarah S, Gourisaria MK, Khare S, Das H, editors. Advances in data and information sciences. Singapore: Springer; 2022. p. 247–60.

Garg A, Sharma B, Khan R. Heart disease prediction using machine learning techniques. In: Garg A, Sharma B, Khan R, editors. IOP conference series: materials science and engineering. Bristol: IOP Publishing; 2021. p. 012046.

Jothi KA, Subburam S, Umadevi V, Hemavathy K. Heart disease prediction system using machine learning. In: materials today: proceedings. 2021. https://www.sciencedirect.com/science/article/pii/S2214785320406194

Singh A, Kumar R. Heart disease prediction using machine learning algorithms. In: Singh A, Kumar R, editors. 2020 international conference on electrical and electronics engineering (ICE3). New York: IEEE; 2020. p. 452–7.

Sharma V, Yadav S, Gupta M. Heart disease prediction using machine learning techniques. In: 2020 2nd international conference on advances in computing, communication control and networking (ICACCCN). IEEE: New York. 2020. 177–181.

Swain D, Ballal P, Dolase V, Dash B, Santhappan J. An efficient heart disease prediction system using machine learning. In: machine learning and information processing; Springer: Singapore. 2020. pp. 39–50. https://doi.org/10.1007/978-981-15-1884-3_4

Anbuselvan P. Heart disease prediction using machine learning techniques. Int J Eng Res Technol. 2020;9:515–8.

Dutta P, Paul S, Shaw N, Sen S, Majumder M. Heart disease prediction: a comparative study based on a machine-learning approach. In: Dutta P, Paul S, Shaw N, Sen S, Majumder M, editors. Artificial intelligence and cybersecurity. Boca Raton: CRC Press; 2022. p. 1–18.

Terrada O, Hamida S, Cherradi B, Raihani A, Bouattane O. Supervised machine learning based medical diagnosis support system for prediction of patients with heart disease. Adv Sci Technol Engin Syst J. 2020;5(5):269–77.

Siva Rama Krishna C, Vasanthi M, Hemanth Reddy K, Jaswanth G. Heart disease prediction using machine learning. In: intelligent manufacturing and energy sustainability: proceedings of ICIMES 2022. Springer Nature Singapore: Singapore. 2023. pp. 589–595.

AbdElminaam DS, Mohamed N, Wael H, Khaled A, Moataz A. MLHeartDisPrediction: heart disease prediction using machine learning. J Comput Commun. 2023;2(1):50–65.

Aladeyelu AC, Adekunle GT. Predicting heart disease using machine learning. J Multidiscip Eng Sci Technol. 2023;10(4):15837–15841 https://doi.org/10.21608/jocc.2023.282098 .

Ekle FA, Agu NM, Bakpo FS, Udanor CN, Eneh AH. A machine learning model and application for heart disease prediction using prevalent risk factors in Nigeria. Int J Math Anal Model. 2023;6(2):333–44.

Kadhim MA, Radhi AM. Heart disease classification using optimized machine learning algorithms. Iraqi J Comput Sci Math. 2023;4(2):31–42.

Al Ahdal A, Rakhra M, Rajendran RR, Arslan F, Khder MA, Patel B, Rajagopal BR, Jain R. Monitoring cardiovascular problems in heart patients using machine learning. J Healthc Engin. 2023. https://doi.org/10.1155/2023/9738123 .

Burle R, Gaurkhede S, Dewangan L, Pacharaney U. Prediction of heart disease using machine learning algorithms. In2024 IEEE International conference on interdisciplinary approaches in technology and management for social innovation (IATMSI). IEEE: New York. 2024. 1–7.

Yadav AL, Soni K, Khare S. Heart diseases prediction using machine learning. In2023 14th international conference on computing communication and networking technologies (ICCCNT). IEEE: New York. 2023. pp. 1–7.

Shrivastava PK, Sharma M, Kumar A. HCBiLSTM: a hybrid model for predicting heart disease using CNN and BiLSTM algorithms. Meas Sens. 2023;25:100657.

Mijwil MM, Faieq AK, Aljanabi M. Early detection of cardiovascular disease utilizing machine learning techniques: evaluating the predictive capabilities of seven algorithms. Iraqi J Comput Sci Math. 2024;5(1):263–76.

Ansari GA, Bhat SS, Ansari MD, Ahmad S, Nazeer J, Eljialy AE. Performance evaluation of machine learning techniques (MLT) for heart disease prediction. Comput Math Method Med. 2023. https://doi.org/10.1155/2023/8191261 .

Wedro B, Davis CP. Heart disease (cardiovascular disease, CVD). In: MedicineNet. 2020. https://www.medicinenet.com/heart_disease_coronary_artery_disease/article.htm . Accessed 15 Jan 2022.

Mayo Clinic Heart disease. 2021. https://www.mayoclinic.org . https://www.mayoclinic.org/diseases-conditions/heart-disease/diagnosis-treatment/drc-20353124 . Accessed Jan 16 2022.

Krans B, Butler N. Balanced Diet. Healthline. 2020. https://www.healthline.com/health/balanced-diet#importance . Accessed Jan 16. 2022.

Omotoye EF, Sanusi AR. Management of hypertension using dietary approach to stop hypertension (DASH) among adults in Ekiti state Nigeria. Niger J Nutr Sci. 2016;37(1):136–43.

World Health Organization. Healthy diet. 2020. https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds ). Accessed 20 Dec 2021.

Moore TJ, Conlin PR, Ard J, Svetkey LP, DASH collaborative research group. DASH (dietary approaches to stop hypertension) diet is effective treatment for stage 1 isolated systolic hypertension. Hypertension. 2001;38(2):155–8.

Siervo M, Lara J, Chowdhury S, Ashor A, Oggioni C, Mathers JC. Effects of the dietary approach to stop hypertension (DASH) diet on cardiovascular risk factors: a systematic review and meta-analysis. Br J Nutr. 2015;113(1):1–5.

Adomavicius G, Tuzhilin A. Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Trans Knowl Data Eng. 2005;17(6):734–49.

Burke R. Hybrid recommender systems: survey and experiments. User Model User Adap Inter. 2002;12(4):331–70. https://doi.org/10.1023/A:1021240730564 .

Pazzani MJ, Billsus D. Content-based recommendation systems. In: Pazzani MJ, Billsus D, editors. The adaptive web. Singapore: Springer; 2007. p. 325–41.

Javed U, Shaukat K, Hameed IA, Iqbal F, Alam TM, Luo S. A review of content-based and context-based recommendation systems. Int J Emerg Technol Learn (iJET). 2021;16(3):274–306.

Ekle FA, Vaatyough FB, Ugba PT. The modelling of a hybrid recommender system for Nigerian University bookshops. Int J Res Comput Appl Robot. 2018;6(6):22–9.

Caudle J. The DASH diet. What to eat on the D.A.S.H. diet?. 2022. https://web.facebook.com/DrJenCaudle/videos/353414823379298 . Accessed 21 Mar 2022.

DietaryGuidelines.gov. 2015–2020 dietary guidelines for Americans. https://health.gov/dietaryguidelines/2015/guidelines/appendix-2/ . Accessed 1 Jan 2022.

Lucic M, Kurach K, Michalski M, Bousquet O, Gelly S. Are GANs created equal? a large-scale study. In: Proceedings of the 32nd international conference on neural information processing systems. Curran Associates Inc.: New York. 2018. pp. 698–707.

Grundy E, Murphy MJ. Demography and public health. In: Detels R, Gulliford M, Karim QA, Tan CC, editors. Oxford textbook of global public health. 6th ed. Oxford: Oxford University Press; 2015. p. 718–35.

Download references

Acknowledgements

We want to thank Ameh Daniel Peter (Ph.D.) of University of Edinburgh School of Engineering for providing articles used in this research article that are not open access. Akaangee Nathaniel (M.Sc.) of Federal Medical Centre Makurdi Nutrition and Dietetic Department for his dietary advisory role. Lastly, we want to also appreciate the University of Nigeria, Nsukka (UNN) for providing a conducive environment for carrying out this research.

Availability of supporting data

The articles reviewed in this study were extracted from online databases (i.e., PubMed and Google Scholar).

The authors declare that no funds, grants were received during the preparation of this manuscript.

Author information

Authors and affiliations.

Department of Computer Science, University of Nigeria, Nsukka (UNN), Nigeria

Francis Adoba Ekle

Physician Consultant Cardiologist in the Department of Internal Medicine, Federal Medical Centre, Keffi, Nigeria

Vincent Shidali

Doctoral Student in the Department of Computer Science, University of Nigeria, Nsukka (UNN), Nigeria

Richard Emoche Ochogwu

Doctoral Student in the School of Computing, University of Portsmouth, Portsmouth, UK

Igoche Bernard Igoche

You can also search for this author in PubMed Google Scholar

Contributions

E.F.A, S.V., I.B.I and O.E.R. contributed to the study conception and design. Material preparation, and analysis were performed by E.F.A. The first draft of the manuscript was written by E.F.A and all authors commented on previous versions of the manuscript. E.F.A, S.V., I.B.I and O.E.R. read and approved the final manuscript.

Corresponding author

Correspondence to Francis Adoba Ekle .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Human and animal rights

Consent for publication, competing interests.

The authors declare that there is no actual or potential conflict of interest. The authors also have no relevant financial or non-financial competing interests to disclose.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Ekle, F.A., Shidali, V., Ochogwu, R.E. et al. Machine Learning models for heart disease prediction and dietary lifestyle change therapy recommendation: a systematic review. Discov Artif Intell 4 , 113 (2024). https://doi.org/10.1007/s44163-024-00181-w

Download citation

Received : 10 December 2023

Accepted : 21 October 2024

Published : 19 December 2024

DOI : https://doi.org/10.1007/s44163-024-00181-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Machine learning
Supervised learning machine learning
Heart disease prediction
Dietary therapy
Find a journal
Publish with us
Track your research

Heart Disease Prediction Using Machine Learning

Ieee account.

Change Username/Password
Update Address

Purchase Details

Payment Options
Order History
View Purchased Documents

Profile Information

Communications Preferences
Profession and Education
Technical Interests
US & Canada: +1 800 678 4333
Worldwide: +1 732 981 0060
Contact & Support
About IEEE Xplore
Accessibility
Terms of Use
Nondiscrimination Policy
Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Open access
Published: 19 December 2024

Detecting cardiovascular diseases using unsupervised machine learning clustering based on electronic medical records

Ying Hu ORCID: orcid.org/0009-0008-4587-3311 1 , 2 na1 ,
Hai Yan 3 na1 ,
Ming Liu 2 , 6 ,
Jing Gao 4 ,
Lianhong Xie 4 ,
Chunyu Zhang 1 ,
Lili Wei 4 ,
Yinging Ding 5 &
Hong Jiang ORCID: orcid.org/0000-0002-7260-9646 1 , 2

BMC Medical Research Methodology volume 24 , Article number: 309 ( 2024 ) Cite this article

152 Accesses

2 Altmetric

Metrics details

Electronic medical records (EMR)-trained machine learning models have the potential in CVD risk prediction by integrating a range of medical data from patients, facilitate timely diagnosis and classification of CVDs. We tested the hypothesis that unsupervised ML approach utilizing EMR could be used to develop a new model for detecting prevalent CVD in clinical settings.

We included 155,894 patients (aged ≥ 18 years) discharged between January 2014 and July 2022, from Xuhui Hospital, Shanghai, China, including 64,916 CVD cases and 90,979 non-CVD cases. K-means clustering was used to generate the clustering models with k = 2, 4, and 8 as predetermined number of clusters k = 2, 4, and 8. Bayesian theorem was used to estimate the models’ predictive accuracy.

The overall predictive accuracy of the 2-, 4-, and 8-classification clustering models in the training set was 0.856, 0.8634, and 0.8506, respectively. Similarly, the predictive accuracy of the 2-, 4-, and 8-classification clustering models in the testing set was 0.8598, 0.8659, and 0.8525, respectively. After reducing from 19 dimensions to 2 dimensions by principal component analysis, significant separation was observed for CVD cases and non-CVD cases in both training and testing sets.

Our findings indicate that the utilization of EMR data can support the development of a robust model for CVD detection through an unsupervised ML approach. Further investigation using longitudinal design is needed to refine the model for its applications in clinical settings.

Peer Review reports

Introduction

Cardiovascular diseases (CVD) are the leading cause of death globally, accounting for approximately18 million deaths annually [ 1 ], and this number is expected to rise to 23.6 million by 2030. In China, two out of every five deaths are attributed to CVD, affecting an estimated 330 million people [ 2 ]. Traditional statistics-based prediction tools for future CVD [ 3 ], such as the Framingham Risk Score [ 4 ], Systematic Coronary Risk Evaluation [ 5 ] and QRISK scores [ 6 , 7 ], are commonly used in primary prevention settings. However, these methods use a common set of risk factors and the overall accuracy remains unsatisfactory and limited application for early detection [ 3 , 8 ]. Clinicians diagnose CVD by evaluating the clinical symptoms and signs of patients and using auxiliary diagnostic methods, such as blood tests and imaging (non-invasive and invasive) examinations. These procedures are expensive, time-consuming and often requires specialized expertise. Asymptomatic individuals may be overlooked during routine physical examinations or hospitalization for other unrelated diseases. An automated CVD detection tool that help identify high-risk individuals quickly and accurately is needed.

Machine learning (ML), a technique used to realize artificial intelligence, broadens the scope of traditional statistics by identifying nonlinear relationships and higher-order interactions among numerous variables. It can be categorized into supervised and unsupervised learning [ 8 ]. Supervised ML build models by associating a certain set of features with known outcomes (labeled data) to predict outcomes for new data, including naive Bayes, random forest, Logistic regression, support vector machines (SVM), K-Nearest Neighbor (KNN), artificial neural network [ 9 ] and genetic algorithm [ 10 ]. Unsupervised ML, on the other hand, focuses on identifying the underlying patterns in unlabeled data, including clustering, association and dimensionality reduction. Clustering analysis is a process that involves the identification of distinct subgroups within extensive and intricate data. K-means clustering is unsupervised approach to group objects into K number of clusters number of clusters based on their features. This technique ensures that each data point assigned to a specific cluster is in closer proximity to the centroid of the cluster compared to all other clusters [ 11 ]. Dimension reduction is a process of reducing high-dimensional data to a low-dimensional representation is achieved while preserving the inherent changes and structures in the original full-dimensional data. A recent study [ 12 ] employed unsupervised ML approach, specifically multiple kernel learning-based dimension reduction and K-means clustering, to combine echocardiographic data and clinical parameters to phenotype heart failure patients.

ML has been increasingly utilized to improve the accuracy and speed of CVD prediction and diagnosis [ 13 ]. Nevertheless, the majority of ML-based prediction models are built on community-based populations that share similar features [ 14 , 15 , 16 , 17 ], and the prevalence and severity of CVD may also affect the models’ accuracy, limiting their clinical application [ 8 ]. Importantly, electronic medical records (EMR) as a digital version of paper records were initially introduced in hospitals to improve healthcare efficiency and promote patient care. EMR contain a wide variety of data, such as demographics, diagnoses, medications, laboratory and imaging tests. With the growing availability of rich and large sample size data recorded in EMR, there is growing interest to translate these data into clinical practices through the application of ongoing machine learning and AI advancements [ 18 ]. EMR-trained ML models have the potential in CVD risk prediction by integrating a range of medical data from patients, facilitate timely diagnosis and classification of CVDs [ 19 ]. Nevertheless, there have been limited study conducted on the EMR data for constructing CVD prediction models [ 20 , 21 ].

Thus, using EMR data, we employed K-means clustering and Bayesian theorem to construct a model that can accurately identify the patients with high probability of having CVD in clinical settings. K-means clustering was utilized to generate the clustering models, and Bayesian theorem was utilized to estimate their predictive accuracy. Our work provides an example demonstrating the application of EMR-based ML to develop a prediction model for assessing the likelihood of having the CVD.

Data source

The study obtained data from the electronic medical record (EMR) system and clinical laboratory information system (LIS) of Xuhui Central Hospital, an affiliate of Fudan University in China. The data consisted of diagnostic information and laboratory test results for adult patients who were discharged from January 2014 to July 2022. This study was performed in accordance with the guidelines of the Declaration of Helsinki. The study design was approved by the Ethics Committee of Shanghai Xuhui Central Hospital (approval no: 2023033), and the institutional review board waived the requirement to obtain the informed consent. The medical record number, gender, age and ICD-10 diagnostic information were extracted from the EMR system using SQL statements. A total of 155 894 patients were included.

The primary outcome of this study was determining the presence of CVD in each subject. CVD was defined based on the primary symptoms outlined in the International Classification of Diseases, 10th Revision (ICD-10) diagnostic information). These symptoms including “coronary heart disease arrhythmia”, “coronary artery insufficiency”, “coronary heart disease”, “coronary artery slow flow”, “coronary artery bypass surgery status”, “coronary artery stent thrombosis”, “coronary artery stent implantation status”, “coronary artery stenosis”, “coronary artery fistula”, “coronary atherosclerosis”, “coronary atherosclerotic heart disease” [ 22 ]. Patients exhibiting the aforementioned symptoms were categorized as cases of CVD ( n = 64916) (Table 1 ), while the other patients who did not display these symptoms were classified as non-CVD cases ( n = 90979).

We searched the LIS system for various laboratory test results upon admission, including total cholesterol (TC), triglyceride (TG), high-density lipoprotein (HDL), low-density lipoprotein (LDL), blood glucose, creatine kinase (CK), CK-MB isoenzyme (CK-MB), troponin (Tn), myoglobin (Mb), angiotensin (I/II), aldosterone, hemorheology, brain natriuretic peptide (BNP), glycosylated hemoglobin (GHB), homocysteine (HCY), tumor necrosis factor (TNF), interleukin, C-reactive protein (CRP), D-dimer, fibrinogen, creatinine, urea nitrogen, uric acid, glomerular filtration rate (GFR), plasma viscosity, erythrocyte aggregation index, hemoglobin, blood sodium, blood potassium, and other relevant test results.

Data preprocessing and variable selection

After data cleaning, the incomplete, incorrect, inaccurate, and irrelevant parts of 155 894 patients’ data were identified and were replaced, modified, or deleted. Due to the inherent characteristics of the mining process, the vast majority of data attributes utilized within this method were of a quantitative type, specifically integer or real number data. The analysis eliminated gender as a variable due to its binary nature. The process of selecting predictor variables (features) was conducted by three medical experts with experience in the diagnosis of CVD selected the predictor variables (features) based on comprehensive review of relevant literature. Also, features with missing data in ≥ 20% of patients were removed, and features with missing data for < 20% of the patients were subjected to multiple imputation. The features of these deletions included angiotensin, aldosterone, brain natriuretic peptide, homocysteine, free triiodothyronine, free tetraiodothyronine, and thyroid stimulating hormone.

The preliminary list focused on 15 variables that are clearly implicated in the pathogenesis of CVD [ 23 ], including blood lipids (TC, TG, HDL, LDL), cardiac markers (CK, CK-MB, Mb, Tn), renal function (creatinine, urea nitrogen, uric acid, GFR) and blood glucose markers (glucose, GHB). Four additional variables that have previously been associated with CVD but lack robust clinical evidence, were included in this study. These variables included coagulation markers such as D-dimer and fibrinogen as well as other biomarkers including hemoglobin, blood sodium, blood potassium). Finally, 19 features were selected as input for the ML algorithm. Table 2 shows the description of selected variables. Z-score normalization was used to standardize the numerical variables.

Statistical machine learning analysis

The entire dataset was randomly split into two non-overlapping sets: training set (90%, n = 140304) and testing set (10%, n = 15590). We ran our unsupervised ML algorithm on the training set first to generate the prediction model (i.e., create clusters), and then tested the models using the features of the testing set to assess their ability to accurately infer the class labels for the patients in the testing set. The estimation of the predictive accuracy of the clusters and models was afterwards conducted utilizing the Bayesian theorem. The dimensionality reduction approach of principal component analysis (PCA) was additionally employed to reduce the number of features from 19 to 2 dimensions in both the training and testing sets. This allowed for the visualization of the sample results projected onto the first two components [ 24 ]. The principal components are the continuous solutions derived from the discrete cluster membership markers for K-means clustering, PCA can serve as a tool to evaluate the 2-classification clustering model from a different angle [ 25 ]. The modeling process is depicted in Fig. 1 .

Flow chart of ML approach to establish CVD detection model

K-means clustering and bayesian theorem

K-means clustering was used to classify the data-set into a fixed number (K) of distinct clusters. We selected k = 2, 4, and 8 as predetermined number of clusters and iterated 1 million times to guarantee the stability of the results. The input of the model was a normalized vector of 19 parameters, and the output was whether CVD was present. We used the characteristics of K-means clustering to classify the disease, and classify the patients with or without CVD into two types for clustering. Ideally, patients with CVD should be clustered in several of the three clustering models of 2-, 4-, and 8-classification, while patients without CVD should be clustered in other clusters. However, in reality, it is impossible to achieve the ideal state. Our data only covered the major symptoms of patients who were diagnosed at a given time in the hospital, and they may only represent their occasional situation. Furthermore, not all of 19 features are strongly related to CVD pathogenesis. In practical situations, the more uneven the distribution of CVD and non-CVD ratios in each cluster, the better it is for the cluster to determine whether CVD is present. The more such clusters there are in the entire clustering model, the better it is for the entire clustering model to determine whether CVD is present.

The model was constructed to the accurate classification of patients, enabling to ascertain their disease status (i.e., CVD or non-CVD) with 100% probability. Therefore, after calculating the proportion of CVD in each cluster of the clustering model, the prediction accuracy of a single cluster in the three clustering models was calculated by using the inverse probability principle of Bayesian theorem, and then the overall prediction accuracy of the three clustering models was calculated by using Bayesian theorem. The predictive accuracy of the clustering model was determined by dividing the sum of the size of the bigger group in each cluster by the total number of samples. The specific method to calculate the accuracy by using Bayesian theorem was as follows:

The predictive probability for each cluster by the Bayesian theorem was:

\(\:{X}_{n}^{max}\) refers to the size of the bigger group (CVD cases or non-CVD cases) in the cluster, and \(\:{X}_{n}^{all}\) refers to the total number patients in the cluster.

The overall prediction accuracy (model performance) of our model by Bayesian theorem was:

\(\:{X}_{all}^{all}\) refers to the number of all subjects in the sample.

This shows that the predictive accuracy of the clustering model is determined by dividing the sum of the size of the bigger group in each cluster by the total number of samples.

Model performance

The predictive probability of detecting the existence of CVD for a single cluster was calculated as the number of patients with prevalent CVD divided by the total number of patients. The predictive probability of detecting prevalent CVD in each cluster was obtained from k = 2, 4, and 8 classifications, respectively. After calculating the proportions of CVD and non-CVD cases in each cluster from k = 2, 4, and 8 classifications, the predictive accuracy of each cluster was calculated by Bayesian theorem. We calculated the predictive accuracy (performance) of the overall model, which is equivalent to the predictive accuracy of all single clusters as shown above.

Comparisons of K-means clustering with other ML algorithms

We conducted a comparative experiment with three traditional ML methods to evaluate the performance of our K-means clustering approach. The models included in this comparison were SVM, K-Nearest Neighbor (KNN), and Logistic regression. After establishing the models, we calculated area under the curve (AUC) of the models separately. Finally, we plotted the Receiver Operating Characteristic (ROC) curve. AUC served as the main indicator of model performance.

Characteristics of study subjects

Of 155 894 patients included, we filtered out 41.64% who already experienced a CVD outcome (during or before baseline). The remaining patients (90 979) did not experience any CVD outcome. Coronary atherosclerotic heart disease was the most common CVD (61 951 patients), followed by atherosclerosis (1 226 patients) and arrhythmia type of coronary heart disease (892 patients). Table 1 shows the number of patients according to different CVD symptoms.

Predictive probability of each cluster in 2-, 4-, and 8-classification clustering models

K-means clustering was used to classify the patients in the training set, with 2, 4, and 8 chosen as the predetermined number of clusters. As shown in Fig. 2 ; Table 3 . In the 2-classification clustering model , the predictive probability of detecting prevalent CVD in clusters 1 and 2 were 0.8473 and 0.1384, respectively. In the 4-classification clustering model , the predictive probability of detecting prevalent CVD in clusters 1, 2, 3 and 4 were 0.4418, 0.1288, 0.8899 and 0, respectively. In the 8-classification clustering model , the predictive probability of detecting prevalent CVD in clusters 1, 2, 3, 4, 5, 6, 7 and 8 were 0.0938, 0.6252, 0.8958, 0.4400, 0.3333, 0, 0.4271, and 0.2056, respectively. For each clustering model, the cluster with the highest probability was the one most likely to have prevalent CVD.

The clustering models were further evaluated in the testing set. As shown in Fig. 3 ; Table 4 , in the 2-classification clustering model , the predictive probability of detecting prevalent CVD in clusters 1 and 2 were 0.8518 and 0.1351, respectively. In the 4-classification clustering model , the predictive probability of detecting prevalent CVD in clusters 1, 2, 3 and 4 were 0.4480, 0.1261, 0.8906, and 0, respectively. In the 8-classification clustering model , the predictive probability of detecting prevalent CVD in clusters 1, 2, 3, 4, 5, 6, 7 and 8 were 0.0916, 0.6287, 0.8943, 1, 1, 0, 0.4065 and 0.2109, respectively.

It should be noted that in the 4- and 8-clustering models, two clusters accounting for the majority of the total samples provided the main information needed to determine whether or not CVD was present, whereas other clusters accounting for a relatively small proportion of the overall samples provided minimal information.

Distribution of CVD and non-CVD cases in each cluster with different predetermined number of clusters in the training set

Distribution of CVD and non-CVD cases in each cluster with different predetermined number of clusters in the testing set

Model performance of 2-, 4-, and 8- classification clustering models

Bayesian theorem was used to assess the 2-, 4-, and 8-classification clustering models’ predictive accuracy as the model performance. The overall predictive accuracy of the 2-, 4-, and 8-classification clustering models in the training set was 0.856, 0.8634, and 0.8506, respectively, while the predictive accuracy of the 2-, 4-, and 8-classification clustering models in the testing set was 0.8598, 0.8659, and 0.8525, respectively (Table 5 ). Here, all values from the testing and evaluation sets were similar and above 0.85, showing that the models had good performance in detecting the CVD.

Clustering visualization

Because predictive accuracy was not dependent on the number of classifications as above showed, 2-classification clustering model is simplified and thus optimal. PCA was conducted to reduce 19 dimensions (features) down to two dimensions. PCA plots of the samples projected onto the first two principal components in the training and testing sets are shown in Figs. 4 and 5 , respectively. Significant separation was observed for CVD cases and non-CVD cases in both training and testing sets.

Principal component analysis (PCA) of the training set. PCA plot with samples plotted in two dimensions using their projections on the first two principal components

Principal component analysis (PCA) of the testing set. PCA plot with samples plotted in two dimensions using their projections on the first two principal components

Performance of other models

The evaluation of models of KNN, SVM and Logistic regression was based on the testing set, and the results are presented in Table 6 . The predictive accuracy for each model was as follows: K-means clustering achieved the highest accuracy of 0.8598, followed by KNN with a predictive accuracy of 0.846, SVM with a predictive accuracy of 0.819, and Logistic regression with a predictive accuracy of 0.7992 (Fig. 6 ).

ROC Curve for KNN, Logistic regression and SVM models

In this study, the data retrieved from the EMR was employed to construct a CVD detection model using unsupervised ML algorithm and subsequently assessed its predictive accuracy using the Bayesian theorem. Our study confirms the efficacy of unsupervised ML as a new approach for identifying individuals at high-risk of having CVD by utilizing routine blood tests conducted during physical examinations or hospitalization for other medical conditions. This can assist healthcare providers in assessing the necessity for additional health examinations or appropriate treatment, thereby facilitating early detection of CVD and reducing unnecessary medical expenses.

Unsupervised clustering algorithms, which need no labeling the input data, have proven to be useful in disease detection, diagnosis and classification [ 26 ]. In a recent work, hierarchical clustering analysis was used to evaluate numerous clinical variables and discovered new clinical phenotypes of atrial fibrillation [ 27 ]. The other study utilized K-means clustering to detect the varied etiology and prognosis of heart failure with preserved ejection fraction [ 28 ]. Our investigation showed that by extracting information from underutilized EMR data, the K-means clustering models surpassed the performance of SVM, KNN and Logistic regression models, with a predictive accuracy of over 85% in both the training and testing sets. Our findings suggest that unsupervised ML approach may yield novel tools in the detection of CVD with high accuracy. Furthermore, since the patient’s data may be obtained from the EMR without the necessity of gathering additional health information in the context of limited medical expenditures, the adoption of this strategy is simple and efficient.

Various CVD guidelines recommend different CVD risk prediction tools. The most commonly used tool is Framingham risk score, which incorporate age, sex, diabetes, smoking, systemic blood pressure, and body mass index [ 29 ]. The QRISK2 scores, which is another frequently used prediction tool, incorporate many factors such as age, gender, race, blood pressure, diabetes, family history of coronary heart disease, chronic renal disease, blood lipids, rheumatoid arthritis, medication use, weight, smoking, etc [ 30 ]. However, ML-based prediction models often incorporate a diverse array of variables. An ML-based model for CVD prediction was developed using a dataset from the UK BioBank, which consisted of 423,604 CVD-free patients. The model was built using 473 variables [ 31 ]. However, due to the lack of a solid pathological basis and the inability of professionals to recognize it, this condition is rarely used in clinical settings. The 19 variables in our selection from EMR data was chosen based on their clinical significance. Specifically, TC, TG, HDL, and LDL are key components of blood lipid profiles. Glucose and GHB are linked to diabetes, whereas creatinine, urea nitrogen, urea nitrogen, and GFR are associated with chronic kidney disease. Mb, Tn, and CK-MB are important in diagnosing coronary heart disease since their levels are typically elevated in those with acute coronary syndrome. The current guidelines incorporate these variables, but do not include D-dimer, fibrinogen, hemoglobin, blood sodium, and blood potassium [ 32 , 33 ]. It has been noted that the coagulation indicators D-dimer and fibrinogen exhibit an elevation during thromboembolism. During CVD events, the blood’s coagulation status is shown to be hypercoagulable as a result of activation of coagulation mechanisms [ 34 ]. Hemoglobin as a indicator for blood viscosity, and an increase in blood viscosity has been linked to CVD events [ 35 ]. Elevated sodium levels have an direct influence on the progression of hypertension, which is considered a notable risk factor for ischemic heart disease, stroke, and others [ 36 ]. According to previous reports, serum potassium levels was associated with CVD events and mortality [ 37 ]. Collectively, we believe that these variables may possess some pathological foundations that contribute to the development of CVD. Therefore, our model may serve as a useful model in assessing the likelihood of having the CVD.

Several limitations should be acknowledged. First, this was a cross-sectional analysis of input features and prevalent CVD status recorded in EMR, the temporal order of causality could not be determined. Second, this was a single institution, our models should be externally validated. In addition, we focused on variables that are often recorded in EMR, other major CVD risk factors such as BMI and family history of CVD were not incorporated in analysis as they are not consistently recorded in EHR, however, the prediction accuracy as estimated by Bayesian theory was deemed satisfactory, and thus findings should not be severely affected.

In conclusion, this study demonstrates the application of a ML approach that integrates K-means clustering and Bayesian theorem with EMR data to develop an automated model for evaluating the likelihood of having the CVD. Additional longitudinal investigations including more characteristics (e.g., comorbidities, medication use, and CVD events) across several institutions are needed to improve the model’s accuracy and facilitate its potential implementation applications in clinical context.

Data availability

The datasets used and analyzed during the current study available from the corresponding author on reasonable request.

RuaN Y, Guo Y, Zheng Y, et al. Cardiovascular disease (CVD) and associated risk factors among older adults in six low-and middle-income countries: results from SAGE Wave 1. BMC Public Health. 2018;18(1):778. https://doi.org/10.1186/s12889-018-5653-9 .

Article PubMed PubMed Central Google Scholar

Summary of China Cardiovascular Health and Diseases Report 2020. Chin Circulation J. 2021;36(06):521–45. https://doi.org/10.3969/j.issn.1000-3614.2021.06.001 .

Article Google Scholar

Dimopoulos A C, Nikolaidou M, Caballero F F, et al. Machine learning methodologies versus cardiovascular risk scores in predicting disease risk. BMC Med Res Methodol. 2018;18(1):179. https://doi.org/10.1186/s12874-018-0644-1 .

Article PubMed Google Scholar

Greenland P, Alpert J S, Beller G A, et al. 2010 ACCF/AHA guideline for assessment of cardiovascular risk in asymptomatic adults: a report of the American College of Cardiology Foundation/American Heart Association Task Force on Practice guidelines. Circulation. 2010;122(25):e584–636. https://doi.org/10.1016/j.jacc.2010.09.001 .

Piepoli M F, Hoes A W, Agewalls S, et al. 2016 European guidelines on cardiovascular disease prevention in clinical practice: the Sixth Joint Task Force of the European Society of Cardiology and Other Societies on Cardiovascular Disease Prevention in Clinical Practice (constituted by representatives of 10 societies and by invited experts)developed with the special contribution of the European Association for Cardiovascular Prevention & Rehabilitation (EACPR). Eur Heart J. 2016;37(29):2315–81. https://doi.org/10.1093/eurheartj/ehw106 .

Hippisley-Cox J, Coupland C, Vinogradova Y, et al. Predicting cardiovascular risk in England and Wales: prospective derivation and validation of QRISK2. BMJ. 2008;336(7659):1475–82. https://doi.org/10.1136/bmj.39609.449676.25 .

Hippisley-Cox J, Coupland C. Brindle P. Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study. Bmj, 2017, 357(j2099). https://doi.org/10.1136/bmj.j 2099.

Shu S, Ren J. Clinical application of machine learning-based Artificial Intelligence in the diagnosis, prediction, and classification of Cardiovascular diseases. Circ J. 2021;85(9):1416–25. https://doi.org/10.1253/circj.CJ-20-1121 .

Article CAS PubMed Google Scholar

Trayanova N A, Popescu D M, SHADE JK. Machine learning in Arrhythmia and Electrophysiology. Circ Res. 2021;128(4):544–66. https://doi.org/10.1161/CIRCRESAHA.120.317872 .

Ordikhani M, Saniee Abadeh M, Prugger C, et al. An evolutionary machine learning algorithm for cardiovascular disease risk prediction. PLoS ONE. 2022;17(7):e0271723. https://doi.org/10.1371/journal.pone.0271723 .

Article CAS PubMed PubMed Central Google Scholar

Dalmaijer E S, Nord C L, Astle D E. BMC Bioinformatics. 2022;23(1):205. https://doi.org/10.1186/s12859-022-04675-1 . Statistical power for cluster analysis [J].

Cikes M, Sanchez-Martinez S, Claggett B, et al. Machine learning-based phenogrouping in heart failure to identify responders to cardiac resynchronization therapy. Eur J Heart Fail. 2019;21(1):74–85. https://doi.org/10.1002/ejhf.1333 .

Gautam N, Mueller J, Alqaisi O, Gandhi T, Malkawi A, Tarun T, Alturkmani HJ, Zulqarnain MA, Pontone G, Al’Aref SJ. Machine Learning in Cardiovascular Risk Prediction and Precision Preventive approaches. Curr Atheroscler Rep. 2023;25(12):1069–81. https://doi.org/10.1007/s11883-023-01174-3 .

Song H, Koh Y, Rhee T M, et al. Prediction of incident atherosclerotic cardiovascular disease with polygenic risk of metabolic disease: analysis of 3 prospective cohort studies in Korea. Atherosclerosis. 2022;348:16–24. https://doi.org/10.1016/j.atherosclerosis.2022.03.021 .

Klooster C C V, Bhatt D L, Steg P G, et al. Predicting 10-year risk of recurrent cardiovascular events and cardiovascular interventions in patients with established cardiovascular disease: results from UCC-SMART and REACH. Int J Cardiol. 2021;325:140–8. https://doi.org/10.1016/j.ijcard.2020.09.053 .

Lu P, Guo S, Zhang H, et al. Research on improved depth Belief Network-based prediction of Cardiovascular diseases. J Healthc Eng. 2018. https://doi.org/10.1155/2018/8954878 . 2018(8954878.

Li Y, Sperrin M, Ashcroft DM, Van Staa TP. Consistency of variety of machine learning and statistical models in predicting clinical risks of individual patients: longitudinal cohort study using cardiovascular disease as exemplar. BMJ. 2020;371:m3919. https://doi.org/10.1136/bmj.m3919 .

Tang AS, Woldemariam SR, Miramontes S, et al. Harnessing EHR data for health research. Nat Med. 2024;30:1847–55. https://doi.org/10.1038/s41591-024-03074-8 .

Ward A, Sarraju A, Chung S, Li J, Harrington R, Heidenreich P, Palaniappan L, Scheinker D, Rodriguez F. Machine learning and atherosclerotic cardiovascular disease risk prediction in a multi-ethnic population. NPJ Digit Med. 2020;3:125. https://doi.org/10.1038/s41746-020-00331-1 .

Qiu Y, Wang W, Wu C, et al. A risk factor attention-based model for cardiovascular disease prediction. BMC Bioinformatics. 2022;23(Suppl 8):425. https://doi.org/10.1186/s12859-022-04963-w .

Li Q, Campan A, Ren A, Eid WE. Automating and improving cardiovascular disease prediction using machine learning and EMR data features from a regional healthcare system. Int J Med Inf. 2022;163:104786.

Meng H, Ruan J, Yan Z, et al. New Progress in early diagnosis of atherosclerosis. Int J Mol Sci. 2022;23(16):8939. https://doi.org/10.3390/ijms23168939 .

Francula-Zaninovic S, Nola I A. Management of Measurable Variable Cardiovascular Disease’ risk factors. Curr Cardiol Rev. 2018;14(3):153–63. https://doi.org/10.2174/1573403X14666180222102312 .

Ringnér M. What is principal component analysis? Nat Biotechnol. 2008;26:303–4. https://doi.org/10.1038/nbt0308-303 .

Ding C, He X. K-means Clustering via Principal Component Analysis. Proceedings of the 21 st International Conference on Machine Learning, Banff, Canada, 2004.

Frades I, Matthiesen R. Overview on techniques in cluster analysis. Methods Mol Biol. 2010;593:81–107. https://doi.org/10.1007/978-1-60327-194-3_5 .

Inohara T, Shrader P, Pieper K, et al. Association of of Atrial Fibrillation Clinical Phenotypes with treatment patterns and outcomes: a Multicenter Registry Study. JAMA Cardiol. 2018;3(1):54–63. https://doi.org/10.1001/jamacardio.2017.4665 .

Harada D, Asanoi H, Noto T, et al. Different pathophysiology and outcomes of heart failure with preserved ejection Fraction Stratified by K-Means clustering. Front Cardiovasc Med. 2020;7(607760). https://doi.org/10.3389/fcvm.2020.607760 .

Petruzzo M, Reia A, Maniscalco G T, et al. The Framingham cardiovascular risk score and 5-year progression of multiple sclerosis. Eur J Neurol. 2021;28(3):893–900. https://doi.org/10.1111/ene.14608 .

Brunström M, Andersson J, Eliasson M, et al. [SCORE2 - an updated model for cardiovascular risk prediction]. Lakartidningen. 2021;118:21164.

PubMed Google Scholar

Alaa A M, Bolton T, Di Angelantonio E, et al. Cardiovascular disease risk prediction using automated machine learning: a prospective study of 423,604 UK Biobank participants. PLoS ONE. 2019;14(5):e0213653. https://doi.org/10.1371/journal.pone.0213653 .

Virani Ss, Newby L K, Arnold S V, et al. 2023 AHA/ACC/ACCP/ASPC/NLA/PCNA Guideline for the management of patients with chronic coronary disease: a report of the American Heart Association/American College of Cardiology Joint Committee on Clinical Practice guidelines. Circulation. 2023. https://doi.org/10.1161/CIR.0000000000001168 .

Knuuti J, Wijns W. 2019 ESC guidelines for the diagnosis and management of chronic coronary syndromes. Eur Heart J. 2020;41(3):407–77. https://doi.org/10.1093/eurheartj/ehz425 .

LIndahl B. Acute coronary syndrome - the present and future role of biomarkers. Clin Chem Lab Med. 2013;51(9):1699–706. https://doi.org/10.1515/cclm-2013-0074 .

Canaud B, Rodriguez A. Whole-blood viscosity increases significantly in small arteries and capillaries in hemodiafiltration. Does acute hemorheological change trigger cardiovascular risk events in hemodialysis patient?. Hemodial Int. 2010;14(4):433–40. https://doi.org/10.1111/j.1542-4758.2010.00496.x .

Zhou B, Perel P, Mensah G A, et al. Global epidemiology, health burden and effective interventions for elevated blood pressure and hypertension. Nat Rev Cardiol. 2021;18(11):785–802. https://doi.org/10.1038/s41569-021-00559-8 .

Liu S, Zhao D, Wang M, et al. Association of Serum Potassium Levels with Mortality and Cardiovascular events: findings from the Chinese multi-provincial cohort study. J Gen Intern Med. 2022;37(10):2446–53. https://doi.org/10.1007/s11606-021-07111-x .

Download references

Acknowledgements

Not applicable.

This work was supported by Shanghai Aging and Maternal and Child Health Research Project (No.2020YJZX0141); Clinical Special Project of Shanghai Municipal Health Commission, China(No.202040083). The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Ying Hu and Hai Yan are co-first authors.

Authors and Affiliations

Department of Cardiology, National Clinical Research Center for Interventional Medicine, Shanghai Institute of Cardiovascular Diseases, Zhongshan Hospital, Fudan University, Shanghai, 200032, China

Ying Hu, Chunyu Zhang & Hong Jiang

Shanghai Engineering Research Center of AI Technology for Cardiopulmonary Diseases, Zhongshan Hospital, Fudan University, Shanghai, 200032, China

Ying Hu, Ming Liu & Hong Jiang

Department of General Surgery, Center for Bariatric and Hernia Surgery, Huashan Hospital, Fudan University, Shanghai, 200040, China

Shanghai Xuhui Central Hospital, Zhongshan-Xuhui Hospital, Fudan University, Shanghai, 200031, China

Jing Gao, Lianhong Xie & Lili Wei

Department of Epidemiology, School of Public Health, and Key Laboratory of Public Health Safety of Ministry of Education, Fudan University, Shanghai, 200032, China

Yinging Ding

Department of Health Management Center, Zhongshan Hospital, Fudan University, Shanghai, 200032, China

You can also search for this author in PubMed Google Scholar

Contributions

Ying Hu and Hong Jiang contributed to the study conception and design. Material preparation, data collection and analysis were performed by Ming Liu, Jing Gao and Lianhong Xie. Data curation were managed by Lili Wei and Chunyu Zhang. The first draft of the manuscript was written by Ying Hu and Hai Yan. The review and editing were completed by Hong Jiang and Yinging Ding. All authors have read and agreed to the published version of the manuscript.

Corresponding authors

Correspondence to Yinging Ding or Hong Jiang .

Ethics declarations

Ethics approval and consent to participate.

This study was performed in accordance with the guidelines of the Declaration of Helsinki. The study design was approved by the Ethics Committee of Shanghai Xuhui Central Hospital (approval no.: 2023033), and the institutional review board waived the requirement to obtain the informed consent.

Consent for publication

All authors have approved for its publication.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Hu, Y., Yan, H., Liu, M. et al. Detecting cardiovascular diseases using unsupervised machine learning clustering based on electronic medical records. BMC Med Res Methodol 24 , 309 (2024). https://doi.org/10.1186/s12874-024-02422-z

Download citation

Received : 18 October 2023

Accepted : 25 November 2024

Published : 19 December 2024

DOI : https://doi.org/10.1186/s12874-024-02422-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Machine learning
K-means clustering
Bayesian theorem
Cardiovascular diseases

BMC Medical Research Methodology

ISSN: 1471-2288

General enquiries: [email protected]

IMAGES

(PDF) Prediction of Heart Disease Using Machine Learning Algorithms
(PDF) PREDICTION OF HEART DISEASE BY USING MACHINE LEARNING
Predicting Heart Disease with Machine Learning Techniques
(PDF) Heart Disease Prediction using Machine Learning Techniques
Heart Disease Prediction Using Machine Learning
Figure 1 from Heart Disease Prediction Using Machine Learning

COMMENTS

A proposed technique for predicting heart disease using ...
Oct 7, 2024 · By employing these explainable AI methods, machine learning-based systems for heart disease prediction can provide healthcare professionals and patients with transparent, interpretable, and ...
Using Machine Learning for Heart Disease Prediction
Feb 21, 2021 · This article presents the prediction of the heart diseases by using the machine learning algorithm. One of the major causes of morbidity in the world's population is the prediction of heart attacks.
Heart Disease Prediction Using Machine Learning | IEEE ...
Cardiovascular disease refers to any critical condition that impacts the heart. Because heart diseases can be life-threatening, researchers are focusing on designing smart systems to accurately diagnose them based on electronic health data, with the aid of machine learning algorithms. This work presents several machine learning approaches for predicting heart diseases, using data of major ...
Heart Disease Prediction using Machine Learning | IEEE ...
The research tests and implements various predictive analytic methods such as Decision tree, Random Forest, Logistic regression and KN neighbor to predict the risk of heart disease on 60:40, 70:30 and 80:20 ratio data split. This research paper also highlights the superiority of Random Forest model for predicting heart disease with highest ...
Machine Learning models for heart disease prediction and ...
4 days ago · Introduction Several medical decision support systems for heart disease prediction have been developed by different researchers in today's digital and artificial intelligence-driven society to simplify and ensure effective diagnosis by utilising machine learning (ML) algorithms. Purpose To carry out a systematic comparative review of the performance of variant supervised learning ML models for ...
Enhancing Heart Disease Prediction Accuracy through Machine ...
Apr 14, 2023 · In the medical domain, early identification of cardiovascular issues poses a significant challenge. This study enhances heart disease prediction accuracy using machine learning techniques. Six algorithms (random forest, K-nearest neighbor, logistic regression, Naïve Bayes, gradient boosting, and AdaBoost classifier) are utilized, with datasets from the Cleveland and IEEE Dataport. Optimizing ...
Heart Disease Prediction Using Machine Learning | IEEE ...
One of the main reasons for death worldwide is heart disease, and early detection of the condition can help lower the risk of having a cardiac arrest. This research paper aims to suggest a machine learning-based method for estimating the risk of developing cardiac disease. First recent advancements in the field have been reviewed and then an ML model has been implemented to work on the ...
Effective Heart Disease Prediction Using Machine Learning ...
Feb 6, 2023 · The diagnosis and prognosis of cardiovascular disease are crucial medical tasks to ensure correct classification, which helps cardiologists provide proper treatment to the patient. Machine learning applications in the medical niche have increased as they can recognize patterns from data. Using machine learning to classify cardiovascular disease occurrence can help diagnosticians reduce ...
Prediction of Heart Disease Using a Combination of Machine ...
Jul 1, 2021 · Hungarian-Cleveland datasets were used for predicting heart disease using different machine learning classifiers and PCA was used for dimensionality reduction and feature selection: 3: Zhang et al. 2018: AdaBoost classifier with PCA combination was used for the feature extraction and the accuracy of the prediction was increased: 4: Singh et al ...
Detecting cardiovascular diseases using unsupervised machine ...
4 days ago · Background Electronic medical records (EMR)-trained machine learning models have the potential in CVD risk prediction by integrating a range of medical data from patients, facilitate timely diagnosis and classification of CVDs. We tested the hypothesis that unsupervised ML approach utilizing EMR could be used to develop a new model for detecting prevalent CVD in clinical settings. Methods We ...

A proposed technique for predicting heart disease using machine learning algorithms and an explainable AI method

Similar content being viewed by others

Analyzing the impact of feature selection methods on machine learning algorithms for heart disease prediction

Machine learning-based classification of valvular heart disease using cardiovascular risk factors

Early and accurate detection and diagnosis of heart disease using intelligent computational model

The proposed methodology

Datasets and dataset features

Dataset preparation

Feature selection

The outcome of different feature selection methods

The use of SMOTE and SHAP methods

Experimental results and analysis

Performance evaluation

Experimental evaluation of system performance

Limitations

Conclusions and future work

Data availability

Acknowledgements

Author information

Contributions

Corresponding author

Ethics declarations

Ethical approval

Additional information

Rights and permissions

About this article

Share this article

Quick links

Machine Learning models for heart disease prediction and dietary lifestyle change therapy recommendation: a systematic review

Cite this article

Introduction

Similar content being viewed by others

Application of ensemble machine learning algorithms on lifestyle factors and wearables for cardiovascular risk prediction

Applying data science approach to predicting diseases and recommending drugs in healthcare using machine learning models – A cardio disease case study

Lifestyle Disease Influencing Attribute Prediction Using Novel Majority Voting Feature Selection

1 Introduction

1.1 Motivations

1.2 Main contributions

2 Supervised learning machine learning algorithms

3 Classifier performance metrics

4 Review of supervised learning machine learning models for heart disease prediction

4.1 Heart disease therapy

5 Proposed dietary lifestyle change therapy recommendation system

7 Discussion

8 Conclusion and direction for future studies

Data availability

Acknowledgements

Availability of supporting data

Author information

Contributions

Corresponding author

Ethics declarations

Human and animal rights

Additional information

Rights and permissions

About this article

Share this article

Heart Disease Prediction Using Machine Learning

Purchase Details

Profile Information

Detecting cardiovascular diseases using unsupervised machine learning clustering based on electronic medical records

Introduction

Data source

Data preprocessing and variable selection

Statistical machine learning analysis

K-means clustering and bayesian theorem

Model performance

Comparisons of K-means clustering with other ML algorithms

Characteristics of study subjects

Predictive probability of each cluster in 2-, 4-, and 8-classification clustering models

Model performance of 2-, 4-, and 8- classification clustering models

Clustering visualization

Performance of other models

Data availability

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations