What's Your Question?

How to Write a Research Paper

Writing a research paper is a bit more difficult that a standard high school essay. You need to site sources, use academic data and show scientific examples. Before beginning, you’ll need guidelines for how to write a research paper.

Start the Research Process

Before you begin writing the research paper, you must do your research. It is important that you understand the subject matter, formulate the ideas of your paper, create your thesis statement and learn how to speak about your given topic in an authoritative manner. You’ll be looking through online databases, encyclopedias, almanacs, periodicals, books, newspapers, government publications, reports, guides and scholarly resources. Take notes as you discover new information about your given topic. Also keep track of the references you use so you can build your bibliography later and cite your resources.

Develop Your Thesis Statement

When organizing your research paper, the thesis statement is where you explain to your readers what they can expect, present your claims, answer any questions that you were asked or explain your interpretation of the subject matter you’re researching. Therefore, the thesis statement must be strong and easy to understand. Your thesis statement must also be precise. It should answer the question you were assigned, and there should be an opportunity for your position to be opposed or disputed. The body of your manuscript should support your thesis, and it should be more than a generic fact.

Create an Outline

Many professors require outlines during the research paper writing process. You’ll find that they want outlines set up with a title page, abstract, introduction, research paper body and reference section. The title page is typically made up of the student’s name, the name of the college, the name of the class and the date of the paper. The abstract is a summary of the paper. An introduction typically consists of one or two pages and comments on the subject matter of the research paper. In the body of the research paper, you’ll be breaking it down into materials and methods, results and discussions. Your references are in your bibliography. Use a research paper example to help you with your outline if necessary.

Organize Your Notes

When writing your first draft, you’re going to have to work on organizing your notes first. During this process, you’ll be deciding which references you’ll be putting in your bibliography and which will work best as in-text citations. You’ll be working on this more as you develop your working drafts and look at more white paper examples to help guide you through the process.

Write Your Final Draft

After you’ve written a first and second draft and received corrections from your professor, it’s time to write your final copy. By now, you should have seen an example of a research paper layout and know how to put your paper together. You’ll have your title page, abstract, introduction, thesis statement, in-text citations, footnotes and bibliography complete. Be sure to check with your professor to ensure if you’re writing in APA style, or if you’re using another style guide.


applications of data mining research papers

Computational Intelligence and Neuroscience

Cognitive computing paradigms for medical big data processing and its trends, research and application of the data mining technology in economic intelligence system.

In the context of the rapid development of the modern economy, information is particularly important in the economic field, and information determines the decision-making of enterprises. Therefore, how to quickly dig out information that is beneficial to the enterprise has become a crucial issue. This topic applies data mining technology to economic intelligent systems and obtains the data object model of economic intelligent systems through the integration of information. This article analyzes the interrelationship between its objects on this basis and uses data mining-related methods to mine it. The establishment of economic intelligence systems not only involves the establishment of mathematical models of economic systems, but also includes research on the algorithms applied to them. How to apply an algorithm to quickly and accurately extract the required economic intelligence domain information from the potential information in the database, or to provide a method to find the best solution, involves the use of association rules and classification prediction methods. The application of data mining algorithms can be used to study the application of economic intelligence systems. This paper develops and designs an economic intelligence information database and realizes the economic intelligence system on this basis, and realizes the research results. Finally, this paper has tested the dataset, and the results show that the classification accuracy of this algorithm is 2.64% higher than that of the ID3 algorithm.

1. Introduction

Following the Internet, digital mining has become a new research hotspot, especially in high-dimensional, large-scale, distributed digital mining, which has broader prospects, and the potential economic value is also limitless. Among them, the classification prediction technology will assist future smart business activities and provide important reference decisions.

Regarding the data mining technology and economic intelligence systems, scholars at home and abroad have provided a lot of references. Li and Long studied image detection and quantitative detection analysis of gastrointestinal diseases based on data mining [ 1 ]. Zuo researched and analyzed the characteristics of network viruses and designed a computer data mining module. He combined the data mining technology with the dynamic behavior interception technology to mine hidden information and determine whether there is a virus. He applied this method to network Trojan virus detection [ 2 ]. Buczak and Guven provide a short tutorial description of machine learning (ML) methods and data mining (DM) methods for network analysis [ 3 ]. Xu et al. look at privacy issues related to data mining from a broader perspective and study various methods that help protect sensitive information. He reviewed the most advanced methods and put forward some preliminary ideas for future research directions [ 4 ]. Yan and Zheng found that even after considering data mining, many fundamental signals are important predictors of cross-sectional stock returns. Yan and Zheng’s method is universal, and Yan and Zheng also applied it to past returns-based anomalies [ 5 ]. Emoto et al. use terminal restriction fragment length polymorphism (T-RFLP) data mining analysis to elucidate the gut microbiota profile of patients with coronary artery disease [ 6 ]. Hong et al. proposed a new method to construct a flood sensitivity map in Poyang County, Jiangxi Province, China, by implementing the fuzzy Wolfe and data mining methods [ 7 ]. The data results of these studies are not comprehensive, and the results lack basis; thus, they cannot be recognized by the public.

The innovative point of this research on the realization of economic intelligence systems based on the data mining technology lies in the use of a combination of association rules and clustering two data mining methods to realize data information mining. It can not only mine the potential information of the data well, but also improve the efficiency of data mining. This research uses the least mean square algorithm to optimize the research, which improves the experimental effect to a certain extent.

2. Data Mining Technology and Economic Intelligence Systems

2.1. data warehouse, 2.1.1. the concept of data warehouse.

A data warehouse is a subject-oriented, integrated, irreplaceable, and time-changing collection used to support the analysis, decision-making, and development of an enterprise or organization. The data warehouse is a powerful combination of different resources. After integration, it will change according to the theme and contain historical data, and the data stored in the archive are usually no longer edited [ 8 , 9 ].

The theme is the abstract concept of integrating, categorizing, analyzing, and using information in high-level organizational information systems [ 10 ]. Logically speaking, this is consistent with the analysis goals related to the subject’s macro-analysis field. Themes benefit from a series of tables in the database. Themes can be stored in a multidimensional database. The division of themes must ensure the independence of each theme. Data warehouse integration is the process of extracting, filtering, cleaning, and merging distributed data to meet the needs of decision analysis, to integrate the data in the database [ 11 ].

2.1.2. The Architecture of the Data Warehouse

Simply put, a data warehouse consists of operable external data sources, one or more databases, and one or more data analysis tools. Therefore, its realization process should include three major steps: collecting various source data, storing and managing data, and obtaining the required information. The architecture of the data warehouse is shown in Figure 1 .

applications of data mining research papers

The data processing flow of the data warehouse is as follows: (1) Take out the data needed for decision-making from any business processing system source[ 12 ]; (2) Clean up and integrate data sources; (3) Load and update the data warehouse according to the theme [ 13 ]; the loading is to load metadata. The basic framework and the description of the metadata management system are shown in Figure 2 . (4) According to the needs of the decision support system, organize the data and information in various forms [ 14 ]; (5) Decision-making, data analysis, and processing capabilities and data mining; (6) Flexible and diverse results expression.

applications of data mining research papers

In the field of data warehouse, metadata is divided into technical metadata and business metadata according to usage. First, metadata can provide user-based information. For example, metadata that records business description information of data items can help users use data. Secondly, metadata can support the management and maintenance of data in the system. For example, metadata about the storage method of data items can support the system to access data in the most effective way.

2.1.3. OLAP Technology

Online analysis and processing technology was proposed in 1993 by the inventor of relational data. This technology gives the concept of online analysis of data and has good effects on multidimensional information processing. It has developed rapidly in recent years [ 15 ]. Online analytical processing is to enable analysts, managers, or executives to quickly, consistently, and interactively access information from multiple perspectives. This information is transformed from the original data, can be truly understood by users, and truly reflects the situation of the enterprise. It is a type of software technology that achieves a deeper understanding of data. The technical framework diagram of the OLAP engine is shown in Figure 3 .

applications of data mining research papers

OLAP can be divided into categories according to different data composition methods: relational OLAP, multidimensional OLAP, and hybrid OLAP [ 16 , 17 ].

The following briefly introduces the core organization model of the OLAP technology:   Relational online analytical processing (ROLAP, Relational OLAP): This method is mainly used for the processing and analysis of relational databases, and the organization of data usually adopts a snowflake-shaped method [ 18 ].   Multidimensional online analytical processing (MOLAP, Multi-Pimensicnal, OLAP): This method relies on indexing technology; first, it preprocesses the advanced nature of the relational data, organizes it into the form of multidimensional data, analyzes and builds indexes, and performs retrieval [ 19 ].   Front-end online analysis (Desktop OLAP): It is a data analysis method that partially downloads data from the server to the client and can reorganize the data on the client. It provides flexible and simple information processing [ 20 ]. The difference between online analytical processing and online transaction processing is shown in Table 1 .

2.2. Technical Basis of Data Mining

2.2.1. concepts and characteristics of data mining.

Data mining is a series of methods and techniques used to extract knowledge from data. The extraction process is very complicated, mainly due to the large-scale, irregular, and noisy characteristics of the data [ 21 ].

The general process of knowledge discovery is as follows: (1) Identify and gradually understand the application areas. (2) Select the dataset to be studied. (3) Data integration. (4) Data cleaning, deduplication, and error removal. (5) Develop models and construct hypotheses. (6) Data mining. (7) Interpret and express the results and display them in a humane way. (8) Inspection results. (9) Manage the discovered knowledge.

Data mining technology has the following characteristics: (1) The amount of data is often very large [ 22 , 23 ]. The so-called data mining must be built on the basis of massive data, as small-scale data cannot reflect statistical laws. It is completely meaningless to mine small-scale data, and the knowledge found is not enough to reflect the actual situation in real life. It can be said that it is essentially wrong. In the face of massive amounts of data, it is particularly important to reduce the time complexity of the related algorithms, so that real useful knowledge and information can be mined in an effective time. (2) Potentially useful [ 24 ]: The result of data mining should be an undiscovered rule or pattern, which can provide certain guidance for life and production. The results of excavations like “Many people holding umbrellas on rainy days” are meaningless. (3) Independence and indivisibility: The data mining process is an extremely complex process, which cannot be completed in a few simple steps. However, these steps cooperate with each other and cannot complete the corresponding work independently. In this process, on the one hand, it is necessary to select specific algorithms to achieve efficiency of mining, and on the other hand, the relevant operators are required to have solid business skills. It can analyze and process data according to actual business needs, make reasonable interpretations of the mining results, and correctly apply them to future actual work.

2.2.2. Methods of Data Mining

Data mining methods can be divided into descriptive analysis and predictive analysis according to their realized functions. In the final analysis, descriptive analysis is a useful preparation for predictive analysis. It fully reflects the overall distribution of the data and can show the inherent characteristics of the relevant data. Correspondingly, predictive analysis is based on descriptive analysis and treats its analysis results from a developmental perspective, thereby generating a prediction of future data. It gives the final decision-maker a data-level reminder. Predictive analysis mainly refers to prediction based on classification or based on statistical regression problems. At the same time, the main representatives of descriptive analysis are association rule mining and cluster analysis methods. Several common methods of classification, clustering, and association rules are introduced.

(1) Classification . Classification is the process of categorizing data into different categories based on a well-defined conceptual description. The naive Bayes method of statistics and the decision tree learning method in machine learning are the different implementations of classification techniques.

(2) Association Rules . The core purpose of the association rule analysis is to discover the interrelated and interdependent relationships that exist in the data. Association rule mining is also a data mining method derived from database theory. Specifically, in relational databases, there are often some data that appear synchronously, which is called a pattern. When this pattern appears frequently in the database, it is considered that there is a specific association relationship, which is called an association rule. It is determined that the revised content is consistent with the original intention of the author. The current research generally uses support and confidence as its measurement criteria. In the research process, scholars further strengthened the relevant parameters of association rules from different perspectives. It incorporates the interest level and other indicators into the consideration range; thus, many new methods and new applications have been proposed, and new developments in association rule mining have also appeared.

When specific patterns in the dataset meet the support and confidence thresholds, they become the association rules contained in it. First, the mining results need to reflect the closeness of the connections between the data. Data with less connections should not appear as a result of the association rules. Therefore, a reasonable minimum support threshold needs to be set. At the same time, we also pay attention to the credibility of the association rules and make reasonable requirements for the credibility, which is reflected in the setting of the minimum confidence threshold. For example, assuming that the minimum support rate is 50% and the minimum confidence rate is 50%, then . Details are shown in Table 2 .

(3) Clustering . Compared with association rules, cluster analysis is another common method of data mining. It refers to a certain similarity estimation of the data to be analyzed in an unsupervised environment, and the data with higher similarity is combined, which is called clustering. The clustered result set has the characteristics of similarity of the same class and differences between classes, which is very suitable for grasping the distribution of data and its association.

(4) Sequence . Sequence pattern analysis is similar to association analysis, and its purpose is to dig out the connections between data, but the focus of sequence pattern analysis is to analyze the causal relationship between the data before and after.

2.2.3. The Architecture of Data Mining

The core technology of DM is artificial intelligence, machine learning, statistics, etc., but a DM system is not a simple combination of multiple technologies, but a complete whole. It also needs the support of other assistive technologies to complete a series of tasks of data collection, preprocessing, data analysis, and result presentation, and finally present the analysis results to the user. According to the functions, the entire data mining system can be roughly divided into a three-level structure, as shown in Figure 4 .

applications of data mining research papers

2.2.4. The Steps of Data Mining

The four steps of data mining can be summarized as follows: (1) Data selection (2) Data transformation (3) Mining data (4) Interpret the results

The completion of the data selection process obtains the specific partial data needed for data mining. At this time, it is necessary to further format the selected data to provide an identifiable data input source for the next step of data mining.

After the completion of the two tasks of data selection and data format conversion, the next step is to use various data mining algorithms and integrated tools for the mining process. In the mining process, data warehouse and data mining algorithms are often used in combination. On the one hand, it reduces the calculation of specific statistical values; on the other hand, it reduces the consumption of extra time and space resources generated by data exchange. Although this is a commonly used method, it does not actually restrict the algorithm from using the original data under appropriate conditions. In most cases, this approach is essential. The main advantage of using a data warehouse is that most of the data has been integrated into a suitable format, making it easier for data mining tools to extract high-quality information.

A reasonable interpretation of the mining results is the final step in the process. Through the steps, the mining results after analysis and processing are obtained. In this step, these mining results need to be sent to the final decision-maker through the DSS. Interpreting the mining results not only requires a reasonable interpretation of the results itself, but also requires a deeper filtering of the data before sending it to the decision-making system. Once the mining results are unexplainable or unsatisfactory, the entire mining process needs to be repeated until useful results are produced.

In summary, data mining is not only an independent and indivisible process, but the realization of the process is also extremely complicated. Before data can be provided to data mining tools, many steps need to be performed correctly. In addition, we cannot guarantee that the existing data mining tools will not produce meaningless results during the work process. Here, to a large extent, data mining is not a direct analysis operation on the original data, but is based on a data warehouse, and the data warehouse provides a direct source of input data to the data mining tool. At the same time, the DSS tool will make the next step of processing the mining results. In this way, data mining will be combined with DSS to provide a final solution for enterprises to implement data mining strategies. In general, the developers of data mining tools are also the people who preprocess the data. Therefore, a well-designed data mining tool will integrate related tools for data integration and format conversion. It is worth noting that although data mining uses data warehouses to provide processed data as input, in most cases, it is not necessary. Data can be downloaded directly from the operation file to a general file, which contains data that can be used for data mining and analysis.

Data mining technologies will deepen the economic statistics accumulated over a long period of time into the conditions required by data users. Many characteristics of the data mining technology will be involved in the process of practice. According to these characteristics, ensure that economic statistics can play a role to the greatest extent and serve the needs of managers.

2.3. Economic Intelligence Systems

Economic data mining is the introduction of data mining methods and techniques in computer technology into economic research, and the integration of data mining and econometric methods. It has interdisciplinary characteristics. The problems caused by information technology often push the technology forward. The large amount of historical data accumulated by the database system provides data support and technical background for data mining and has a wide range of applications.

First, it creates an economic data warehouse and an economic data mining model. It includes a collection of economic indicator data, the establishment of a data warehouse, the cleaning, conversion, loading, and drilling of data, and the selection of economic data mining models. Then, it conducts data analysis through the model, performs multidimensional OLAP analysis and data display, and performs methods such as association, clustering, and abnormal data analysis. By observing measurement results, analyzing economic phenomena, predicting possible situations, and discovering knowledge, it provides a basis for scientific decision-making and obtains valuable information from it. Finally, it is applied to management decision-making and becomes an auxiliary tool for related departments or enterprises. The basic framework of OLAP modeling is shown in Figure 5 .

applications of data mining research papers

In the application of economic data mining technology, enterprise managers can be used for data warehouse creation, data loading, and drilling. The update of multidimensional datasets uses analysis managers to establish fact tables, dimensions, and granularity. Through the creation of multidimensional datasets, OLAP is realized, and economic data mining models are selected to conduct data mining. The basic framework and description of the OLAP analysis service are shown in Figure 6 .

applications of data mining research papers

The development environment of the economic intelligence system is shown in Table 3 .

3. The Least Mean Square Algorithm

The weight vector is defined using the least mean square algorithm (LMS) learning rule:

The formula of the cost function P is as follows:

P takes the partial derivative of each element of , namely,

Applying the chain rule through formula ( 6 ), we get

Differentiate on both sides of formula ( 5 ), and we have

At the same time, formula ( 4 ) is differentiated on both sides of , and there are

At the same time, formula ( 3 ) is differentiated on both sides of , and there are

Finally, formula ( 2 ) is differentiated on both sides of , and there are

Incorporating formula ( 11 ) into formula ( 7 ), we get

Therefore, when n is presented in the network, the LMS learning rule can be written as

Formulate the error signal term for the output layer:

Therefore, formula ( 13 ) can be written as

For mode n , formula ( 13 ) can be written as the form of each vector component of the weight vector :

The error signal is equal to the error:

The relationship between the gradient vector, the cost function, the and dynamic vector factor :

Because , formula ( 18 ) can be transformed into

Compare the non-incremental and incremental cases, respectively, and test the performance of the incremental decision tree algorithm, the ID3 algorithm, and the Bayesian algorithm. Using this dataset, non-incremental learning is performed by the incremental decision tree algorithm, the ID3 algorithm, and the Bayesian algorithm. The time-consuming situation of the three algorithms for non-incremental learning is shown in Figure 7 .


It can be seen from Figure 7 that the incremental decision tree algorithm is generally more time-consuming in terms of learning time, which is also an inevitable result. The accuracy of the three algorithms for non-incremental learning is shown in Figure 8 .


It can be concluded from Figure 8 that the classification accuracy of the incremental decision tree algorithm is 3.75% higher than that of the ID3 algorithm. Compared with the Bayesian algorithm, this algorithm improves the classification accuracy by 8.64%. Therefore, the classification accuracy of the incremental decision tree algorithm is better than the two algorithms.

Using this dataset, incremental learning is performed by the incremental decision tree algorithm and the ID3 algorithm, respectively. The time-consuming situation of incremental learning for the two algorithms is shown in Figure 9 .


The incremental decision tree algorithm and the ID3 algorithm perform incremental learning, and the accuracy of the two algorithms for incremental learning is shown in Figure 10 .


It can be concluded from Figure 10 that the classification accuracy of the algorithm in this paper is 2.64% higher than that of the ID3 algorithm. Based on it, it can be concluded that the actual classification accuracy of the incremental decision tree algorithm in this paper is significantly better than the Bayesian algorithm and the ID3 algorithm, and it can meet the requirements of incremental learning. The disadvantage is that it takes extra time to sell, which is understandable. In general, it can handle the inadequacy of the decision tree algorithm for incremental learning and achieve the expected goal.

4. Discussion

Cluster analysis is an unsupervised learning algorithm that clusters data into clusters based on similarity. Cluster analysis is different from the data classification method. Data classification is to determine the boundary of a specific category based on the existing expert knowledge, and then classify the data according to this boundary. The cluster analysis does not know the clear definition of the category in advance. It mainly estimates the degree of similarity between the data based on the similarity measure defined by the clustering algorithm, finds out the groups of data with higher similarity, and aggregates them into clusters, i.e., produces the desired clustering results. If one wants to use the clustering algorithm to obtain useful knowledge, the operator needs to have an in-depth understanding of the analysis data and related domain knowledge to more accurately evaluate the accuracy of the clustering results.

Cluster analysis can play its role in clustering at any step in the complete knowledge discovery process and get the corresponding processing results. For example, in the process of data preprocessing, the results of the preprocessing can be easily obtained for data with a relatively simple structure and stored in the data warehouse. For data with a relatively complex structure and relatively closely related data, it is difficult to obtain the expected results through simple analysis and processing, and hence cluster analysis can be used. It obtains the association relationship between the data to grasp the data as a whole and obtain a better preprocessing effect.

Association rules can not only be used to analyze the association model between products, but also to recommend products to customers to improve cross-selling capabilities. The discovery of association rules can be done offline. As the number of commodities increases, the number of rules increases exponentially. However, through the decision-maker's choice of support and confidence, the selection of interested modes and algorithms, the efficiency can also be improved.

5. Conclusion

In the actual application process, since the data sample set will not be collected at the beginning, more and more data will be generated over time. In other words, in practical applications, the incremental problem is almost certainly an important problem to be encountered. This article first briefly introduces the research status of data mining and then gives a detailed description of data warehouse and data mining technology, including definitions, characteristics, and architecture. Secondly, this article introduces the design of an economic intelligent system, including the design environment and the specific implementation process of the system. Finally, this paper proposes an incremental decision tree algorithm and tests the dataset, and the result meets the expected goal. There are many data mining methods, and this article only discusses two preliminarily. In future work, how to further study more optimized and efficient algorithms based on actual work, and apply the improved algorithms to economic intelligence systems are all issues worthy of further discussion.

Data Availability

Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

T. Li and L. Long, “Imaging examination and quantitative detection and analysis of gastrointestinal diseases based on data mining technology,” Journal of Medical Systems , vol. 44, no. 1, pp. 1–15, 2020.

C. Zuo, “Defense of computer network viruses based on data mining technology,” International Journal on Network Security , vol. 20, no. 4, pp. 805–810, 2018.

A. Buczak and E. Guven, “A survey of data mining and machine learning methods for cyber security intrusion detection,” IEEE Communications Surveys & Tutorials , vol. 18, no. 2, pp. 1153–1176, 2017.

L. Xu, C. Jiang, J. Wang, J. Yuan, and Y. Ren, “Information security in big data: privacy and data mining,” IEEE Access , vol. 2, no. 2, pp. 1149–1176, 2017.

X. Yan and L. Zheng, “Fundamental analysis and the cross-section of stock returns: a data-mining approach,” Review of Financial Studies , vol. 30, no. 4, pp. 1382–1423, 2017.

T. Emoto, T. Yamashita, T. Kobayashi et al., “Characterization of gut microbiota profiles in coronary artery disease patients using data mining analysis of terminal restriction fragment length polymorphism: gut microbiota could be a diagnostic marker of coronary artery disease,” Heart and Vessels , vol. 32, no. 1, pp. 39–46, 2017.

H. Hong, P. Tsangaratos, I. Ilia, J. Liu, A.-X. Zhu, and W. Chen, “Application of fuzzy weight of evidence and data mining techniques in construction of flood susceptibility map of Poyang County, China,” The Science of the Total Environment , vol. 625, no. 1, pp. 575–588, 2018.

J. B. Varley, A. Miglio, V.-A. Ha, M. J. van Setten, G.-M. Rignanese, and G. Hautier, “High-throughput design of non-oxide p-type transparent conducting materials: data mining, search strategy, and identification of boron phosphide,” Chemistry of Materials , vol. 29, no. 6, pp. 2568–2573, 2017.

C. Angeli, S. K. Howard, J. Ma, J. Yang, and P. A. Kirschner, “Data mining in educational technology classroom research: can it make a contribution?” Computers & Education , vol. 113, pp. 226–242, 2017.

W. A. N. G. Zhao-Yi, H. U. A. N. G. Zheng-De, Y. A. N. G. Ping, R. E. N. Ting, and L. I. Xin-Hui, “Regularity of wind-dispelling medication prescribed by li dong-yuan: a data mining technology-based study,” Digital Chinese Medicine , vol. 3, no. 1, pp. 20–33, 2020.

L. Ogiela, M. R. Ogiela, and H. Ko, “Intelligent data management and security in cloud computing,” Sensors , vol. 20, no. 12, p. 3458, 2020.

P. Wang, Y. Zhang, and H. Yang, “Research on economic optimization of m cluster based on chaos sparrow search algorithm,” Computational Intelligence and Neuroscience , vol. 2021, no. 3, Article ID 5556780, 18 pages, 2021.

G. T. Soldatos and V. Erotokritos, “Emotional intelligence in a neoclassical framework and the nature of capitalism,” Journal of economic studies , vol. 46, no. 1, pp. 2–17, 2019.

S. Fatima, K. C. Desouza, J. S. Denford, and G. S. Dawson, “What explains governments interest in artificial intelligence? A signaling theory approach,” Economic Analysis and Policy , vol. 71, no. 4, pp. 238–254, 2021.

Y. Han and D. Yong, “A hybrid intelligent model for the assessment of critical success factors in high-risk emergency system,” Journal of Ambient Intelligence and Humanized Computing , vol. 9, no. 6, pp. 1–21, 2018.

A. Salmasnia, A. Rahimi, and B. Abdzadeh, “An integration of NSGA-II and DEA for economic-statistical design of T2-Hotelling control chart with double warning lines,” Neural Computing & Applications , vol. 31, no. S2, pp. 1173–1194, 2019.

A. C. Vilasi, “Intelligence, globalization, complex and multi-level society,” Open Journal of Political Science , vol. 8, no. 1, pp. 47–56, 2018.

B. T. Arisetty and S. Manikandaswamy, “Intelligent driver assitance for vehichle safety,” International ournal of computational intelligence research , vol. 13, no. 9, pp. 2189–2195, 2017.

M. Paduraru, “Need for competitive intelligence departments, a national imperative,” Competitive Intelligence Magazine , vol. 21, no. 4, pp. 56–73, 2018.

V. Manvelyan, “Megatrends and air transport: l,” Foresight , vol. 20, no. 3, pp. 334-335, 2018.

A. Gerunov, “Modelling economic choice under radical uncertainty: machine learning approaches,” International Journal of Business Intelligence and Data Mining , vol. 14, no. 1-2, pp. 238–253, 2019.

M. Ahmadi, S. Jafarzadeh-Ghoushchi, R. Taghizadeh, and A. Sharifi, “Presentation of a new hybrid approach for forecasting economic growth using artificial intelligence approaches,” Neural Computing & Applications , vol. 31, no. 12, pp. 8661–8680, 2019.

A. Lapatinas and A. Litina, “Intelligence and economic sophistication,” Empirical Economics , vol. 57, no. 5, pp. 1731–1750, 2019.

G. A. Montes and B. Goertzel, “Distributed, decentralized, and democratized artificial intelligence,” Technological Forecasting and Social Change , vol. 141, pp. 354–358, 2019.

Welcome to TOP 10 research articles

Top 10 data mining papers: recommended reading ? datamining & knowledgement management research, citation count: 85, data mining and its applications for knowledge management: a literature review from 2007 to 2012.

Tipawan Silwattananusarn 1 and KulthidaTuamsuk 2

1 Ph.D. Student in Information Studies Program, Khon Kaen University, Thailand and 2 Head, Information & Communication Management Program, Khon Kaen University, Thailand

Data mining is one of the most important steps of the knowledge discovery in databases process and is considered as significant subfield in knowledge management. Research in data mining continues growing in business and in learning organization over coming decades. This review paper explores the applications of data mining techniques which have been developed to support knowledge management process. The journal articles indexed in ScienceDirect Database from 2007 to 2012 are analyzed and classified. The discussion on the findings is divided into 4 topics: (i) knowledge resource; (ii) knowledge types and/or knowledge datasets; (iii) data mining tasks; and (iv) data mining techniques and applications used in knowledge management. The article first briefly describes the definition of data mining and data mining functionality. Then the knowledge management rationale and major knowledge management tools integrated in knowledge management cycle are described. Finally, the applications of data mining techniques in the process of knowledge management are summarized and discussed.

Data mining; Data mining applications; Knowledge management

[1] An, X. & Wang, W. (2010). Knowledge management technologies and applications: A literature review . IEEE, 138-141. doi:10.1109/ICAMS.2010.5553046

[2] Berson, A., Smith, S.J. &Thearling, K. (1999). Building Data Mining Applications for CRM. New York: McGraw-Hill .

[3] Cant, F.J. & Ceballos, H.G. (2010). A multiagent knowledge and information network approach for managing research assets . Expert Systems with Applications, 37(7), 5272-5284.doi:10.1016/j.eswa.2010.01.012

[4] Cheng, H., Lu, Y. & Sheu, C. (2009). An ontology-based business intelligence application in a financial knowledge management system .Expert Systems with Applications, 36, 36143622. Doi:10.1016/j.eswa.2008.02.047

[5] Dalkir, K. (2005). Knowledge Management in Theory and Practice . Boston: Butterworth-Heinemann.

[6] Dawei, J. (2011). The Application of Date Mining in Knowledge Management .2011 International Conference on Management of e-Commerce and e-Government, IEEE Computer Society, 7-9. doi:10.1109/ICMeCG.2011.58

[7] Fayyad, U., Piatetsky-Shapiro, G. & Smyth, P. (1996). From Data Mining to Knowledge Discovery in Databases.AI Magazine, 17(3), 37-54.

[8] Gorunescu, F. (2011). Data Mining: Concepts, Models, and Techniques . India: Springer.

[9] Han, J. &Kamber, M. (2012). Data Mining: Concepts and Techniques . 3rd.ed. Boston: Morgan Kaufmann Publishers.

[10] Hwang, H.G., Chang, I.C., Chen, F.J. & Wu, S.Y. (2008). Investigation of the application of KMS for diseases classifications: A study in a Taiwanese hospital . Expert Systems with Applications, 34(1), 725-733. doi:10.1016/j.eswa.2006.10.018

[11] Lavrac, N., Bohanec, M., Pur, A., Cestnik, B., Debeljak, M. &Kobler, A. (2007).Data mining and visualization for decision support and modeling of public health-care resources.Journal of Biomedical Informatics, 40, 438-447. doi:10.1016/j.jbi.2006.10.003

[12] Li, X., Zhu, Z. & Pan, X. (2010). Knowledge cultivating for intelligent decision making in small & middle businesses .Procedia Computer Science, 1(1), 2479-2488. doi:10.1016/j.procs.2010.04.280

[13] Li, Y., Kramer, M.R., Beulens, A.J.M., Van Der Vorst, J.G.A.J. (2010). A framework for early warning and proactive control systems in food supply chain networks. Computers in Industry, 61, 852862. Doi:101.016/j.compind.2010.07.010

[14] Liao, S.H., Chen, C.M., Wu, C.H. (2008). Mining customer knowledge for product line and brand extension in retailing. Expert Systems with Applications, 34(3), 1763-1776. doi:10.1016/j.eswa.2007.01.036

[15] Liao, S. (2003). Knowledge management technologies and applications-literature review from 1995 to 2002 . Expert Systems with Applications, 25, 155-164. doi:10.1016/S0957-4174(03)00043-5

[16] Liu, D.R. & Lai, C.H. (2011). Mining group-based knowledge flows for sharing task knowledge. Decision Support Systems ,50(2), 370-386. doi:10.1016/j.dss.2010.09.004

[17] Lee, M.R. & Chen, T.T. (2011). Revealing research themes and trends in knowledge management: From 1995 to 2010. Knowledge-Based Systems.doi:10.1016/j.knosys.2011.11.016

[18] McInerney, C.R. & Koenig, M.E. (2011). Knowledge Management (KM) Processes in Organizations: Theoretical Foundations and Practice . USA: Morgan & Claypool Publishers. doi:10.2200/S00323ED1V01Y201012ICR018

[19] McInerney, C. (2002). Knowledge Management and the Dynamic Nature of Knowledge .Journal of the American Society for Information Science and Technology, 53(12), 1009-1018. doi:10.1002/asi.10109

[20] Ngai, E., Xiu, L. &Chau, D. (2009). Application of data mining techniques in customer relationship management: A literature review and classification . Expert Systems with Applications, 36, 2592- 2602. doi:10.1016/j.eswa.2008.02.021

[21] Ruggles, R.L. (ed.). (1997). Knowledge Management Tools. Boston: Butterworth-Heinemann.

[22] Sher, P.J. & Lee, V.C. (2004). Information technology as a facilitator for enhancing dynamic capabilities through knowledge management.Information & Management, 41, 933-945. doi:10.1016/

[23] Tseng, S.M. (2008). The effects of information technology on knowledge management systems .Expert Systems with Applications, 35, 150-160. doi:10.1016/j.eswa.2007.06.011

[24] Ur-Rahman, N. & Harding, J.A. (2012). Textual data mining for industrial knowledge management and text classification: A business oriented approach . Expert Systems with Applications, 39, 4729-4739. doi:10.1016/j.eswa.2011.09.124

[25] Wang, F. & Fan, H. (2008). Investigation on Technology Systems for Knowledge Management.IEEE, 1-4. doi:10.1109/WiCom.2008.2716

[26] Wang, H. & Wang, S. (2008). A knowledge management approach to data mining process for business intelligence. Industrial Management & Data Systems, 108(5), 622-634.

[27] Wu, W., Lee, Y.T., Tseng, M.L. & Chiang, Y.H. (2010). Data mining for exploring hidden patterns between KM and its performance.Knowledge-Based Systems, 23, 397-401. doi:10.1016/j.knosys.2010.01.014

Citation Count: 83

Analysis of heart diseases dataset using neural network approach.

K. Usha Rani

Dept. of Computer Science, Sri Padmavathi Mahila Visvavidyalayam (Womens University), Tirupati – 517502 , Andhra Pradesh, India

One of the important techniques of Data mining is Classification. Many real world problems in various fields such as business, science, industry and medicine can be solved by using classification approach. Neural Networks have emerged as an important tool for classification. The advantages of Neural Networks helps for efficient classification of given data. In this study a Heart diseases dataset is analyzed using Neural Network approach. To increase the efficiency of the classification process parallel approach is also adopted in the training phase.

Data mining, Classification, Neural Networks, Parallelism, Heart Disease

[1] John Shafer, Rakesh Agarwal, and Manish Mehta, (1996) SPRINT:A scalable parallel classifier for data mining , In Proc. Of the VLDB Conference, Bombay, India..

[2] Sunghwan Sohn and Cihan H. Dagli, (2004) Ensemble of Evolving Neural Networks in classification , Neural Processing Letters 19: 191-203, Kulwer Publishers.

[3] K. Anil Jain, Jianchang Mao and K.M. Mohiuddi, (1996) Artificial Neural Networks: A Tutorial , IEEE Computers, pp.31-44.

[4] George Cybenk,, (1996)Neural Networks in Computational Science and Engineering, IEEE Computational Science and Engineering, pp.36-42

[5] R. Rojas, (1996) Neural Networks: a systematic introduction, Springer-Verlag.

[6] R.P.Lippmann,Pattern classification using neural networks, (1989) IEEE Commun. Mag., pp.4764.

[7] Simon Haykin, (2001) Neural Networks A Comprehensive Foundation , Pearson Education.

[8] B.Widrow, D. E. Rumelhard, and M. A. Lehr, (1994) Neural networks: Applications in industry, business and science, Commun. ACM, vol. 37, pp.93105.

[9] W. G. Baxt, (1990) Use of an artificial neural network for data analysis in clinical decisionmaking: The diagnosis of acute coronary occlusion , Neural Comput., vol. 2, pp. 480489..

[10] Dr. A. Kandaswamy, (1997) Applications of Artificial Neural Networks in Bio Medical Engineering, The Institute of Electronics and Telecommunicatio Engineers, Proceedings of the Zonal Seminar on Neural Networks, Nov 20-21.

[11] A. Kusiak, K.H. Kernstine, J.A. Kern, K A. McLaughlin and T.L. Tseng, (2000) Data mining: Medical and Engineering Case Studies , Proceedings of the Industrial Engineering Research Conference, Cleveland, Ohio, May21-23,pp.1-7.

[12] H. B. Burke, (1994) Artificial neural networks for cancer research: Outcome prediction , Sem. Surg. Oncol., vol. 10, pp. 7379.

[13] H. B. Burke, P. H. Goodman, D. B. Rosen, D. E. Henson, J. N. Weinstein, F. E. Harrell, J. R.Marks, D. P. Winchester, and D. G. Bostwick, (1997) Artificial neural networks improve the accuracy of cancer survival prediction , Cancer, vol. 79, pp. 8578621997.

[14] Siri Krishan Wasan1,Vasudha Bhatnagar2 and Harleen Kaur, (2006) The impact of Data Mining Techniques on Medical Diagnostics, Data Science Journal, Volume 5, 119-126.

[15] Scales, R., & Embrechts, M., (2002) Computational Intelligence Techniques for Medical Diagnostic, Proceedings of Walter Lincoln Hawkins, Graduate Research Conference from the World Wide Web:

[16] S. M. Kamruzzaman , Md. Monirul Islam, (2006) An Algorithm to Extract Rules from Artificial Neural Networks for Medical Diagnosis Problems, International Journal of Information Technology, Vol. 12 No. 8.

[17] Hasan Temurtas, Nejat Yumusak, Feyzullah Temurtas, (2009) A comparative study on diabetes disease diagnosis using neural networks, Expert Systems with Applications: An International Journal , Volume 36 Issue 4. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.1, No.5, September 2011 8

[18] D Gil, M Johnsson, JM Garcia Chamizo, (2009) , Application of artificial neural networks in the diagnosis of urological dysfunctions , Expert Systems with Applications Volume 36, Issue 3, Part 2, Pages 5754-5760, Elsevier

[19] R. Dybowski and V. Gant, (2007), Clinical Applications of Artificial Neural Networks , Cambridge University Press.

[20] O. Er, N. Yumusak and F. Temurtas, (2010) “Chest disease diagnosis using artificial neural networks”, Expert Systems with Applications, Vol.37, No.12, pp. 7648-7655.

[21] S. Moein, S. A. Monadjemi and P. Moallem, (2009) “ A Novel Fuzzy-Neural Based Medical Diagnosis System “, International Journal of Biological & Medical Sciences, Vol.4, No.3, pp. 146-150.

Citation Count: 80

Predicting students? performance using id3 and c4.5 classification algorithms.

Kalpesh Adhatrao, Aditya Gaykar, Amiraj Dhawan, Rohit Jha and Vipul Honrao

Department of Computer Engineering, Fr. C.R.I.T., Navi Mumbai, Maharashtra, India

An educational institution needs to have an approximate prior knowledge of enrolled students to predict their performance in future academics. This helps them to identify promising students and also provides them an opportunity to pay attention to and improve those who would probably get lower grades. As a solution, we have developed a system which can predict the performance of students from their previous performances using concepts of data mining techniques under Classification. We have analyzed the data set containing information about students, such as gender, marks scored in the board examinations of classes X and XII, marks and rank in entrance examinations and results in first year of the previous batch of students. By applying the ID3 (Iterative Dichotomiser 3) and C4.5 classification algorithms on this data, we have predicted the general and individual performance of freshly admitted students in future examinations.

Classification, C4.5, Data Mining, Educational Research, ID3, Predicting Performance

[1] Han, J. and Kamber, M., (2006) Data Mining: Concepts and Techniques , Elsevier.

[2] Dunham, M.H., (2003) Data Mining: Introductory and Advanced Topics, Pearson Education Inc.

[3] Kantardzic, M., (2011) Data Mining: Concepts, Models, Methods and Algorithms, Wiley-IEEE Press.

[4] Ming, H., Wenying, N. and Xu, L., (2009) An improved decision tree classification algorithm based on ID3 and the application in score analysis, Chinese Control and Decision Conference (CCDC), pp1876-1879.

[5] Xiaoliang, Z., Jian, W., Hongcan Y., and Shangzhuo, W., (2009) Research and Application of the improved Algorithm C4.5 on Decision Tree , International Conference on Test and Measurement (ICTM), Vol. 2, pp184-187.

[6] CodeIgnitor User Guide Version 2.14,

[7] RapidMiner,

[8] MySQL The worlds most popular open source database,

Citation Count: 51

Diagnosis of diabetes using classification mining techniques.

Aiswarya Iyer, S. Jeyalatha and Ronak Sumbaly

Department of Computer Science, BITS Pilani Dubai, United Arab Emirates

Diabetes has affected over 246 million people worldwide with a majority of them being women. According to the WHO report, by 2025 this number is expected to rise to over 380 million. The disease has been named the fifth deadliest disease in the United States with no imminent cure in sight. With the rise of information technology and its ontinued advent into the medical and healthcare sector, the cases of diabetes as well as their symptoms are well documented. This paper aims at finding solutions to diagnose the disease by analyzing the patterns found in the data through classification analysis by employing Decision Tree and Nave Bayes algorithms. The research hopes to propose a quicker and more efficient technique of diagnosing the disease, leading to timely treatment of the patients.

Classification, Data Mining, Decision Tree, Diabetes and Nave Bayes.

[1] National Diabetes Information Clearinghouse (NDIC),

[2] Global Diabetes Community,

[3] Jiawei Han and Micheline Kamber, “Data Mining Concepts and Techniques, Morgan Kauffman Publishers, 2001

[4] S. Kumari and A. Singh, A Data Mining Approach for the Diagnosis of Diabetes Mellitus , Proceedings of Seventh lnternational Conference on Intelligent Systems and Control, 2013, pp. 373-375

[5] C. M. Velu and K. R. Kashwan, Visual Data Mining Techniques for Classification of Diabetic Patients, 3rd IEEE International Advance Computing Conference (IACC), 2013

[6] Sankaranarayanan.S and Dr Pramananda Perumal.T, Predictive Approach for Diabetes Mellitus Disease through Data Mining Technologies , World Congress on Computing and Communication Technologies, 2014, pp. 231-233

[7] Mostafa Fathi Ganji and Mohammad Saniee Abadeh, Using fuzzy Ant Colony Optimization for Diagnosis of Diabetes Disease, Proceedings of ICEE 2010, May 11-13, 2010

[8] T.Jayalakshmi and Dr.A.Santhakumaran, A Novel Classification Method for Diagnosis of Diabetes Mellitus Using Artificial Neural Networks , International Conference on Data Storage and Data Engineering, 2010, pp. 159-163

[9] Sonu Kumari and Archana Singh, A Data Mining Approach for the Diagnosis of Diabetes Mellitus, Proceedings of71hlnternational Conference on Intelligent Systems and Control (ISCO 2013)

[10] Neeraj Bhargava, Girja Sharma, Ritu Bhargava and Manish Mathuria, Decision Tree Analysis on J48 Algorithm for Data Mining. Proceedings of International Journal of Advanced Research in Computer Science and Software Engineering, Volume 3, Issue 6, June 2013.

[11] Michael Feld, Dr. Michael Kipp, Dr. Alassane Ndiaye and Dr. Dominik Heckmann Weka: Practical machine learning tools and techniques with Java implementations

[12] White, A.P., Liu, W.Z.: Technical note: Bias in information-based measures in decision tree induction . Machine Learning 15(3), 321329 (1994)

Citation Count: 42

A new clutering approach for anomaly intrusion detection.

Ravi Ranjan and G. Sahoo

Department of Information Technology, Birla Institute of Technology, Mesra, Ranchi

Recent advances in technology have made our work easier compare to earlier times. Computer network is growing day by day but while discussing about the security of computers and networks it has always been a major concerns for organizations varying from smaller to larger enterprises. It is true that organizations are aware of the possible threats and attacks so they always prepare for the safer side but due to some loopholes attackers are able to make attacks. Intrusion detection is one of the major fields of research and researchers are trying to find new algorithms for detecting intrusions. Clustering techniques of data mining is an interested area of research for detecting possible intrusions and attacks. This paper presents a new clustering approach for anomaly intrusion detection by using the approach of K-medoids method of clustering and its certain modifications. The proposed algorithm is able to achieve high detection rate and overcomes the disadvantages of K-means algorithm.

Clustering, data mining, intrusion detection, network security

[1] J. Anderson, Computer security threat monitoring and surveillance , 1980.

[2] Dorothy E. Denning, An intrusion-detection model, IEEE Transactions on software engineering, pp. 222232, 1987.

[3] Kemmerer, R., and Vigna, G. Intrusion Detection: A Brief History and Overview. IEEE Security & Privacy, v1 n1, Apr 2002, p27-30.

[4] S. Staniford-Chen, S. Cheung, R. Crawford., M. Dilger, J. Frank, J. Hoagland, K. Levitt, C.Wee, R.Yip, D. Zerkle . GrIDS- A Graph-Based Intrusion Detection system for Large Networks . Proc National Information Systems Security conf, 1996.

[5] M.Jianliang, S.Haikun and B.Ling. The Application on Intrusion Detection based on K- Means Cluster Algorithm . International Forum on Information Technology and Application, 2009.

[6] Yu Guan, Ali A. Ghorbani and Nabil Belacel. Y-means: a clustering method for Intrusion Detection. In Canadian Conference on Electrical and Computer Engineering, pages 14, Montral, Qubec, Canada, May 2003.

[7] Zhou Mingqiang, HuangHui, WangQian, A Graph-based Clustering Algorithm for Anomaly Intrusion Detection In computer science and education (ICCSE), 7th International Conference ,2012.

[8] Chitrakar, R. and Huang Chuanhe, Anomaly detection using Support Vector Machine Classification with K-Medoids clustering In Internet (AH-ICI), 3rd Asian Himalayas International conference, 2012.

[9] Yang Jian, An Improved Intrusion Detection Algorithm Based on DBSCAN, Micro Computer Information, 25,1008-0570(2009)01- 3- 0058-03, 58-60,2009.

[10] Li Xue-yong, Gao Guo- A New Intrusion Detection Method Based on Improved DBSCAN , In Information Engineering (ICIE), WASE International conference, 2010.

[11] Lei Li, De-Zhang, Fang-Cheng Shen, A novel rule-based Intrusion Detection System using data mining , In ICCSIT, IEEE International conference, 2010.

[12] Z. Muda, W. Yassin, M.N. Sulaiman and N.I.Udzir, Intrusion Detection based on K-Means Clustering and OneR Classification In Information Assurance and Security (IAS), 7th International conference, 2011.

[13] Zhengjie Li, Yongzhong Li, Lei Xu, Anomaly intrusion detection method based on K-means clustering algorithm with particle swarm optimization , In ICM, 2011.

[14] Kapil Wankhade, Sadia Patka, Ravindra Thool, An Overview of Intrusion Detection Based on Data Mining Techniques , In Proceedings of 2013 International Conference on Communication Systems and Network Technologies, IEEE, 2013, pp.626-629. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.4, No.2, March 2014 38

[15] H. Fatma, L. Mohamed, A two-stage technique to improve intrusion detection systems based on data mining algorithms, In ICMSAO, 2013.

[16] A.M. Chandrasekhar, K. Raghuveer, Intrusion detection technique by using K-means,fuzzy neural network and SVM classifiers , In ICCCI, 2013.

[17] Margaret H. Dunham, Data Mining: Introductory and Advanced Topics ,ISBN: 0130888923, published by Pearson Education, Inc.,2003.

[18] KDD.KDDCup1999Data. , 1999.

Citation Count: 34

Incremental learning: areas and methods ?a survey.

Prachi Joshi 1 and Parag Kulkarni 2

1 Assistant Professor, MIT College of Engineering, Pune and 2 Adjunct Professor, College of Engineering, Pune

While the areas of applications in data mining are growing substantially, it has become extremely necessary for incremental learning methods to move a step ahead. The tremendous growth of unlabeled data has made incremental learning take up a big leap. Starting from BI applications to image classifications, from analysis to predictions, every domain needs to learn and update. Incremental learning allows to explore new areas at the same time performs knowledge amassing. In this paper we discuss the areas and methods of incremental learning currently taking place and highlight its potentials in aspect of decision making. The paper essentially gives an overview of the current research that will provide a background for the students and research scholars about the topic.

Incremental, learning, mining, supervised, unsupervised, decision-making

[1] Y. Lui, J. Cai, J. Yin, A. Fu, Clustering text data streams , Journal of Computer Science and Technology, 2008, pp 112-128.

[2] A. Fahim, G. Saake, A. Salem, F. Torky, M. Ramadan, K-means for spherical clusters with large variance in sizes , Journal of World Academy of Science, Engineering and Technology, 2008.

[3] F. Camastra, A. Verri, A novel kernel method for clustering, IEEE Transactions on Pattern Analysis and Machince Intelligence, Vol. 27, no.5, 2005, pp 801-805.

[4] F. Shen, H. Yu, Y. Kamiya, O. Hasegawa, An Online Incremental Semi-Supervised Learning Method , Journal of advanced Computational Intelligence and Intelligent Informatics, Vol. 14, No.6, 2010.

[5] T. Zhang, R. Ramakrishnan, M. Livny, Birch: An efficient data clustering method for very large databases, Proc. ACM SIGMOD Intl.Conference on Management of Data , 1996, pp.103-114.

[6] S. Deelers, S. Auwantanamongkol, Enhancing k-means algorithm with initial cluster centers derived from data partitioning along the data axis with highest variance , International Journal of Electrical and Computer Science, 2007, pp 247-252.

[7] S. Young, A. Arel, T. Karnowski, D. Rose, A Fast and Stable Incremental Clustering Algorithm , Proc. of International Conference on Information Technology New Generations, 2010, pp 204-209.

[8] M. Charikar, C. Chekuri, T. Feder, R. Motwani, Incremental clustering and dynamic information retrival, Proc. of ACM symposium on Theory of Computeion , 1997, pp 626- 635.

[9] K. Hammouda, Incremental document clustering using Cluster similarity histograms , Proc. of IEEE International Conference on Web Intelligence, 2003, pp 597- 601.

[10] X. Su, Y. Lan,R. Wan, Y. Qin, A fast incremental clustering algorithm , Proc. of International Symposium on Information Processing, 2009, pp 175-178.

[11] T. Li, HIREL: An incremental clustering for relational data sets , Proc. of IEEE International Conference on Data Mining, 2008, pp 887 892.

[12] P. Lin, Z. Lin, B. Kuang, P. Huang, A Short Chinese Text Incremental Clustering Algorithm Based on Weighted Semantics and Naive Bayes , Journal of Computational Information Systems, 2012, pp 4257- 4268.

[13] C. Chen, S. Hwang, Y. Oyang, An Incremental hierarchical data clustering method based on gravity theory , Proc. of PAKDD, 2002, pp 237-250.

[14] M. Ester, H. Kriegel, J. Sander, M. Wimmer, X. Xu, Incremental Clustering for Mining in a Data Warehousing Environment , Proc. of Intl. Conference on very large data bases, 1998, pp 323-333.

[15] G. Shaw, Y. Xu, Enhancing an incremental clustering algorithm for web page collections, Proc. of IEEE/ACM/WIC Joint Conference on Web Intelligence and and Intelligent Agent Technology, 2009.

[16] C. Hsu, Y. Huang, Incremental clustering of mixed data based on distance hierarchy , Journal of Expert systems and Applications, 35, 2008, pp 1177 1185.

[17] S. Asharaf, M. Murty, S. Shevade, Rough set based incremental clustering of interval data, Pattern Recognition Letters, Vol.27 (9), 2006, pp 515-519.

[18] Z. Li, Incremental Clustering of trajectories , Computer and Information Science, Springer 2010, pp 32-46.

[19] S. Elnekava, M. Last, O. Maimon, I ncremental clustering of mobile objects , Proc. of IEEE International Conference on Data Engineering, 2007, pp 585-592.

[20] S. Furao, A. Sudo, O. Hasegawa, An online incremental learning pattern -based reasoning system, Journal of Neural Networks, Elsevier, Vol. 23,(1), 2010.pp 135-143.

[21] S. Ferilli, M. Biba, T.Basile, F. Esposito, Incremental Machine learning techniques for document layout understanding , Proc. of IEEE Conference on Pattern Recognition, 2008, pp 1-4.

[22] S. Ozawa, S. Pang, N. Kasabov, Incremental Learning of chunk data for online pattern classification systems, IEEE Transactions on Neural Networks, Vo. 19 (6), 2008, pp 1061-1074.

[23] Z. Chen, L. Huang, Y. Murphey, Incremental learning for text document classification , Proc. of IEEE Conference on Neural Networks, 2007, pp 2592-2597. 51

[24] R. Polikar, L. Upda, S. Upda, V. Honavar, Learn ++: An incremental learning algorithm for supervised neural networks , IEEE Transactions on Systems, Man and Cybernatics, Vol.31 (4), 2001, pp 497-508.

[25] H. He, S. Chen, K. Li, X. Xu, Incremental learning from stream data, IEEE Transactions on Neural Networks , Vol.22(12), 2011, pp 1901-1914.

[26] A. Bouchachia, M. Prosseger, H. Duman, Semi supervised incremental learning, Proc. of IEEE International Conference on Fuzzy Systems, 2010 pp 1-7.

[27] R. Zhang, A. Rudnicky, A new data section principle for semi-supervised incremental learning , Computer Science department, paper 1374, 2006, .

[28] Z. Li, S. Watchsmuch, J. Fritsch, G. Sagerer, Semi-supervised incremental learning of manipulative tasks, Proc. of International Conference on Machine Vision Applications, 2007, pp 73-77.

[29] A. Misra, A. Sowmya, P. Compton, Incremental learning for segmentation in medical images , Proc. of IEEE Conference on Biomedical Imaging, 2006.

[30] P. Kranen, E. Muller, I. Assent, R. Krieder, T. Seidl, Incremental Learning of Medical Data for MultiStep Patient Health Classification, Database technology for life sciences and medicine, 2010.

[31] J. Wu, B. Zhang, X. Hua, J, Zhang, A semi-supervised incremental learning framework for sports video view classification, Proc. of IEEE Conference on Multi-Media Modelling, 2006.

[32] S. Wenzel, W. Forstner, Semi supervised incremental learning of hierarchical appearance models , The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences. Vol.37,2008.

[33] S. Ozawa, S. Toh, S. Abe, S. Pang, N. Kasabov, Incremental Learning for online face recognition , Proc. of IEEE Conference on Neural Networks, Vol. 5, 2005 pp 3174-3179.

[34] Z. Erdem, R. Polikar, F. Gurgen, N. Yumusak, Ensemble of SVMs for Incremental Learning , Multiple Classifier Systems, Springer Verlang,, 2005, pp 246-256.

[35] X. Yang, B. Yuan, W. Liu, Dynamic Weighting ensembles for incremental learning , Proc. of IEEE conference in pattern recognition. 2009, pp 1-5.

[36] R. Elwell, R. Polikar, Incremental Learning of Concept drift in nonstationary environments, IEEE Transactions on Neural Networks, Vol.22 (10), 2011 pp 1517- 1531.

[37] W. Khreich, E. Granger, A. Miri, R. Sabourin, A survey of techniques for incremental learning of HMM parameters , Journal of Information Science, Elsevier, 2012.

[38] O. Buffet, A. Duetch, F. Charpillet, Incremental Reinforcement Learning for designing multi-agent systems , Proc. of ACM International Conference on Autonomous Agents, 2001.

[39] E. Demidova, X. Zhou, W. Nejdl, A probabilistic scheme for keyword-based incremental query construction, IEEE Transactions on Knowledge and Data Engineering, 2012, pp 426-439.

[40] R. Roscher, W. Forestner, B. Waske, I2VM: Incremental import vector machines , Journal of Image and Vision Computing, Elsevier, 2012.

Citation Count: 33

A prototype decision support system for optimizing the effectiveness of elearning in educational institutions.

S. Abu-Naser, A. Al-Masri, Y. Abu Sultan and I. Zaqout

Al Azhar University Gaza, Palestine,

In this paper, a prototype of a Decision Support System (DSS) is proposed for providing the knowledge for optimizing the newly adopted e-learning education strategy in educational institutions. If an educational institution adopted e-learning as a new strategy, it should undertake a preliminary evaluation to determine the percentage of success and areas of weakness of this strategy. If this evaluation is done manually, it would not be an easy task to do and would not provide knowledge about all pitfall symptoms. The proposed DSS is based on exploration (mining) of knowledge from large amounts of data yielded from the operating the institution to its business. This knowledge can be used to guide and optimize any new business strategy implemented by the institution. The proposed DSS involves Database engine, Data Mining engine and Artificial Intelligence engine. All these engines work together in order to extract the knowledge necessary to improve the effectiveness of any strategy, including e-learning

DSS, E-learning, knowledge, Database, Data mining, Artificial Intelligence.

[1] Power, D.J., (2002) Decision Support Systems: Concepts and Resources for Managers . Quorum Books/Greenwood Publishing.

[2] Han, J. and M. Kamberm (2006). Data mining: concepts and techniques. Amsterdam ; Boston San Francisco, CA, Elsevier; Morgan Kaufmann.

[3] Clark, R. C., & Mayer, R. E., (2003). e-Learning and the Science of Instruction: Proven Guidelines for Consumers and Designers of Multimedia Learning. San Francisco: Jossey-Bass.

[4] Kamber, M., Winstone, L., Gong, W., Cheng, S. and Han, J. (1997). Generalization and decision tree induction: efficient classification in data mining . In 7th International Workshop on Research Issues in Data Engineering (RIDE ’97) High Performance Database Management for Large-Scale Applications, pp.111.

[5] Agrawal, R., Imielinski,T. and Swami, A., (1993), M ining association rules between sets of items in large databases In Prooc. of the ACM SIGMOD Int’l Conf. on Management of Data (ACM SIGMOD ’93), Washington, USA.

[6] MERCERON, A. and YACEF, K,. (2005). Educational Data Mining: a Case Study. In Artificial Intelligence in Education (AIED2005), C.-K. LOOI, G. MCCALLA, B.

[7] Russell S., Peter Norvig, P., (2010), Artificial intelligence: a modern approach, 3rd edition, Prentice Hall.

[8] Power, D.J., A Brief History of Decision Support Systems, DSSResources.COM, World-Wide Web, (2011), , version 2.6

[9] Sanjeev, P. and Zytkow, J.M., (1995). Discovering enrollment knowledge in university databases . In KDD, pp. 246-251.

[10]Luan, J., (2002).Data mining, knowledge management in higher education, potential applications. In workshop associate of institutional research international conference, Toronto, pp. 1- 18.

[11]Deniz, D.Z. and Ersan, I., (2001) Using an academic DSS for student, course and program assessment, International Conference on Engineering Education, Oslo, pp.6B8-126B8 17.

[12]Deniz, D.Z. and Ersan, I., (2002). An academic decision-support system based on academic performance evaluation for student and program assessment , International Journal of Engineering Education, Vol. 18, No. 2, pp.236244.

[13]Minaei-Bidgli, B. and Punch,W.,(2003). Using genetic algorithms for data mining optimizing in an educational web-based system . In GECCO, pp. 2252-2263.

[14]Dasgupta, P. and Khazanchi, D., (2005). Adaptive decision support for academic course scheduling using intelligent software agents. International Journal of Technology in Teaching and Learning, Vol. 1, No 2,pp., 63-78.

[15]Mansmann, S. and Scholl, M. H., (2007 ). Decision Support System for Managing Educational Capacity Utilization in Education , IEEE Transactions Vol. 50, No. 2, pp. 143 150.

[16]Inmon, W.H. and Kelley, C., (1993). Rdb/VMS: Developing the Data Warehouse. QED Publishing Group, Boston.

[17]Agrawal, R., Gupta, A., and Sarawagi, S., (1995). Modeling multidimensional databases . IBM Research Report.

[18]Han, J.; Cercone, N. and Cai, Y., (1991). Attribute-Oriented Induction in Relational Databases In G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge Discovery in Databases, pp. 213-228.

[19]Lauden, K. and Lauden J., (2009). Management information Systems. Prentice Hall ; 11th edition.

[20]Nwelih, E. and Chiemeke, S.C. (2010) Academic Advising Decision Support System for Nigerian Universities, Anthology of Abstracts of the 3rd International Conference on ICT for Africa, March 25-27, Yaound, Cameroon. Baton Rouge, LA: International Center for IT and Development.

[21]Marta Zorrilla, Diego Garca and Elena lvarez.(2010). A Decision Support System to improve eLearning Environments. BEWEB 2010 – International Workshop on Business intelligence and the WEB ,March 22-26, 2010 – Lausanne (Switzerland).

[22]Roberto Llorente and Maria Morant, (2011), Data Mining in Higher Education, Kimito Funatsu, InTech, 2011.

[23]Falakmasir M., and Habibi J., (2010), Using Educational Data Mining Methods to Study the Impact of Virtual Classroom in E-Learning, Educational Data Mining 2010, 3rd International Conference on Educational Data Mining , Pittsburgh, PA, USA, June 11-13, 2010.

[24]Rajibussalim M., (2010), Mining Students Interaction Data from a System that Support Learning by Reflection, Educational Data Mining 2010, 3rd International Conference on Educational Data Mining , Pittsburgh, PA, USA, June 11-13, 2010.

[25]Kumar R. and Chadrasekaran R.,(2011), Attribute Correction – Data Cleaning Using Association Rule and Clustering Methods , International Journal of Data Mining & Knowledge Management Process (IJDKP). Vol(1),No(2).

[26]Srinivas K., Raghavendra G. and Govardhan A., (2011), Survey on Prediction of Heart Morbidity Using Data Mining Techniques

Citation Count: 27

Experimental study of data clustering using k-means and modified algorithms.

M.P.S Bhatia and Deepika Khurana

University of Delhi, New Delhi, India

The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in exploratory data analysis. This paper presents results of the experimental study of different approaches to k- Means clustering, thereby comparing results on different datasets using Original k-Means and other modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and execution time.

Data Mining, Clustering Algorithm, k- Means, Silhouette Validity Index.

[1] Ran Vijay Singh and M.P.S Bhatia , Data Clustering with Modified K-means Algorithm , IEEE International Conference on Recent Trends in Information Technology, ICRTIT 2011, pp 717-721.

[2] D. Napoleon and P. Ganga lakshmi, An Efficient K-Means Clustering Algorithm for Reducing Time Complexity using Uniform Distribution Data Points , IEEE 2010.

[3] Tajunisha and Saravanan, Performance Analysis of k-means with different initialization methods for high dimensional data International Journal of Artificial Intelligence & Applications (IJAIA), Vol.1, No.4, October 2010

[4] Neha Aggarwal and Kriti Aggarwal, A Mid- point based k mean Clustering Algorithm for Data Mining . International Journal on Computer Science and Engineering (IJCSE) 2012.

[5] Barile Barisi Baridam, More work on k-means Clustering algortithm: The Dimensionality Problem . International Journal of Computer Applications (0975 8887)Volume 44 No.2, April 2012.

[6] Shi Na, Li Xumin, Guan Yong Research on K-means clustering algorithm. Proc of Third International symposium on Intelligent Information Technology and Security Informatics, IEEE 2010.

[7] Ahamad Shafeeq and Hareesha Dynamic clustering of data with modified K-mean algorithm , Proc. International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore 2012.

[8] Kohei Arai,Ali Ridho Barakbah, Hierarchical K-means: an algorithm for centroids initialization for K-means.

[9] Data Mining Concepts and Techniques,Second edition Jiawei Han and Micheline Kamber.

[10] Towards more accurate clustering method by using dynamic time warping International Journal of Data Mining and Knowledge Management Process (IJDKP), Vol.3, No.2,March 2013.

[11] C. S. Li, Cluster Center Initialization Method for K-means Algorithm Over Data Sets with Two Clusters , 2011 International Conference on Advances in Engineering, Elsevier, pp. 324-328, vol.24, 2011.

[12] A Review of Data Clustering Approaches Vaishali Aggarwal, Anil Kumar Ahlawat, B.N Panday. ISSN: 2277-3754 International Journal of Engineering and Innovative Technology (IJEIT) Volume 1, Issue 4, April 2012.

[13] Ali Alijamaat, Madjid Khalilian, and Norwati Mustapha, A Novel Approach for High Dimensional Data Clustering 2010 Third International Conference on Knowledge Discovery and Data Mining.

[14] Zhong Wei, et al. “ Improved K-Means Clustering Algorithm for Exploring Local Protein Sequence Motifs Representing Common Structural Property ” IEEE Transactions on Nanobioscience, Vol.4., No.3. Sep. 2005. 255-265.

[15] K.A.Abdul Nazeer, M.P.Sebastian, I mproving the Accuracy and Efficiency of the k-means Clustering Algorithm ,Proceeding of the World Congress on Engineering, vol 1,london, July 2009.

[16] Mu-Chun Su and Chien-Hsing Chou A Modified version of k-means Algorithm with a Distance Based on Cluster Symmetry .IEEE Transactions On Pattern Analysis and Machine Intelligence, Vol 23 No. 6 ,June 2001.

Citation Count: 26

Data, text and web mining for business intelligence : a survey.

Abdul-Aziz Rashid Al-Azmi

Department of Computer Engineering, Kuwait University, Kuwait

The Information and Communication Technologies revolution brought a digital world with huge amounts of data available. Enterprises use mining technologies to search vast amounts of data for vital insight and knowledge. Mining tools such as data mining, text mining, and web mining are used to find hidden knowledge in large databases or the Internet. Mining tools are automated software tools used to achieve business intelligence by finding hidden relations, and predicting future events from vast amounts of data. This uncovered knowledge helps in gaining completive advantages, better customers relationships, and even fraud detection. In this survey, well describe how these techniques work, how they are implemented. Furthermore, we shall discuss how business intelligence is achieved using these mining tools. Then look into some case studies of success stories using mining tools. Finally, we shall demonstrate some of the main challenges to the mining technologies that limit their potential

[1] Bill Palace, (1996) Technology Note prepared for Management 274A Anderson Graduate School of Management at UCLA.

[2] Trevor Hastie, Robert Tibshirani, and Jerome Friedman, (2008) The Elements of Statistical Learning: Data Mining, Inference and Prediction , New York, Springer-Verlag, ISBN 0 387 95284-5

[3] Doug Alexander, (2011) Data Mining, [email protected]

[4] Michael Goebel, Le Gruenwald, (1999) A Survey Of Data Mining And Knowledge Discovery Software Tools , SIGKDD Explorations, Vol. 1, Issue 1. Pg 20, ACM SIGKDD.

[5] Chidanand Apte, Bing Liu, Edwin P.D. Pednault, Padhraic Smyth, (2002) Business Applications of Data Mining, Communications of the ACM, Vol. 45, No. 8.

[6] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth, (1996) From Data Mining to Knowledge Discovery in Databases , AI Magazine, American Association for Artificial Intelligence AAAI, Vol. 17 No. 3.

[7] Marti Hearst, (2003) What Is Text Mining? SIMS, UC Berkeley.

[8] Prof. Anita Wasilewska, (2011) Web Mining Presentation 1 CSE 590 Data Mining, Stony Brook.

[9] Prasanna Desikan, Colin DeLong, Sandeep Mane, Kalyan Beemanapalli, Kuo-Wei Hsu, Prasad Sriram, Jaideep Srivastava, Vamsee Venuturumilli, (2009) Web Mining for Business Computing Handbooks in Information Systems v.3, Emerald Group Publishing Limited.

[10] MineIT (2010) Web Mining, The E-Tailers Holy Grail?

[11] Maria C. Ferreira de Oliveira and H. Levkowitz, (2003) From Visual Data Exploration to Visual Data Mining: A Survey IEEE Transactions on Visualization and Computer Graphics, Vol. 9, No. 3.

[12] E.H. Chi, (2000) A Taxonomy of Visualization Techniques Using the Data State Reference Model, In the Proceedings of the Information Visualization Symposium InfoVis 2000, pp. 69-75.

[13] A. Hotho, A. Nurnberger, G. Paa, (2005) A Brief Survey of Text Mining GLDV-Journal for Computational Linguistics and Language Technologies.

[14] The Cross Industry Standard Process for Data Mining Blog (2008).

[15] Feldman, R. & Dagan, I. (1995) Knowledge discovery in texts In Proceeding of the First International Conference on Knowledge Discovery (KDD), pp. 112117.

[16] Michele Fattori, Giorgio Pedrazzi, Roberta Turra, (2003) Text mining applied to patent mapping: a practical business case World Patent Information, Volume 25, Issue 4.

[17] Ajith Abraham, (2003) Business Intelligence from Web Usage Mining Journal of Information & Knowledge Management, Vol. 2, No. 4, iKMS & World Scientific Publishing Co.

[18] Vishal Gupta, Gurpreet S. Lehal, (2009) A Survey of Text Mining Techniques and Applications Journal of Emerging Technologies in Web Intelligence, Vol. 1, No. 1.

[19] W. H. Inmon, (1996) The Data Warehouse and Data Mining Communications of the ACM, Vol. 39, No. 11, ACM.

[20] Rajender Singh Chhillar, (2008) Extraction Transformation Loading, A Road to Data Warehouse, Second National Conference Mathematical Techniques: Emerging Paradigms for Electronics and IT Industries, India, pp. 384-388.

[20] Samia Jones, Omprakash K. Gupta, 2006) Web Data Mining: A Case Study Communications of the IIMA, Vol. 6, Issue 4.

[21] J.R. Quinlan, (1986) Induction of Decision Trees, Machine Learning, Kluwer Academic Publishers, Boston.

[22] Cohen KB, Hunter L, (2008) Getting Started in Text Minin g PLoS Comput Biol.

[23] Judy Redfearn and the JISC Communications team, (2006) What Text Mining can do Briefing paper, Joint Information Systems Committee JISC.

[24] Neto, J., Santos, A., Kaestner, C., Freitas, A. 2000) Document Clustering and Text Summarization In the Proceeding of the 4th International Conference Practical Applications of Knowledge Discovery and Data Mining PADD-2000, London, UK.

[25] R. Kosla and H. Blockeel, (2000) Web mining research a survey , SIGKDD Explorations, vol. 2, pp.115.

[26] Sankar K. Pal, Varun Talwar, Pabitra Mitra, (2002) Web Mining in Soft Computing Framework: Relevance, State of the Art and Future Directions IEEE Transactions on Neural Networks, Vol. 13,No. 5.

[27] Ralf Mikut, and Markus Reischl, (2011) Data mining tools Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Vol. 1, Issue 5.

[28] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten (2009) The WEKA data mining software: an update SIGKDD Explorer News.

[29] Dorronsoro, J., Ginel, F., Sanchez, C. & Cruz, C. (1997) Neural Fraud Detection in Credit Card Operations IEEE Transactions on Neural Networks.

[30] Clifton Phua, Vincent Lee, Kate Smith, Ross Gayler, (2010) A Comprehensive Survey of Data Mining-based Fraud Detection Research Cornell University library, CoRR.

[31] Sang Jun Lee, Keng Siau, (2001) A Review of Data Mining Techniques Industrial Management and Data Systems, 101/1, MCB University Press.

[32] IBM, SurfAid Analytics (2003).

[33] Federico Michele Facca, Pier Luca Lanzi, (2005) Mining interesting knowledge from weblogs: a survey Data & Knowledge Engineering, 53, Elsevier.

[34] Mu-Chen Chen, Cheng-Lung Huang, Kai-Ying Chen, Hsiao-Pin Wu, (2005) Aggregation of Orders in Distribution Centers using Data Mining Expert Systems with Applications, Volume 28, Issue 3, Pages 453-460, Elsevier.

[35] Van den Berg, J. P. (1999) A literature survey on planning and control of warehousing systems IIE Transactions, 31, PP.751762.

[36] Fitzsimons, M., Khabaza, T., and Shearer, C. (1993) The Application of Rule Induction and Neural Networks for Television Audience Prediction In Proceedings of ESOMAR/EMAC/AFM Symposium on Information Based Decision Making in Marketing, Paris, pp 69-82.

[37] Gregory Piatetsky-Shapiro, Ron Brachman, Tom Khabaza (1996) An Overview of Issues in Developing Industrial Data Mining and Knowledge Discovery Applications KDD-96 Proceedings.

[38] Amir F. Atiya, (2001) Bankruptcy Prediction for Credit Risk Using Neural Networks: A Survey and New Results IEEE Transactions on Neural Networks, vol. 12, no. 4.

[39] M. Crouhy, D. Galai, and R. Mark, (2000) A comparative analysis of current credit risk models , J. Banking & Finance, vol. 24, pp. 59117.

[40] Marinela Mircea, Bogdan Ghilic-Micu, Marian Stoica, (2007) Combining Business Intelligence with Cloud Computing to Delivery Agility in Actual Economy Department of Economic Informatics The Bucharest Academy of Economic Studies.

[41] Thiagarajan Ramakrishnan, Mary C. Jones, Anna Sidorova, (2011) Factors Influencing Business Intelligence and Data Collection Strategies: An empirical investigation , Decision Support Systems.

[42] Surajit Chaudhuri, Vivek Narasayya, (2011) New Frontiers in Business Intelligence The 37th International Conference on Very Large Data Bases, Seattle, Washington, Vol. 4, No. 12, VLDB.

[43] Consumer Packaged Goods Company Multi-Model Study, (1998) Data Mining Case Study: Retail. [44] IBM Software Group Case Study. (2010) Great Canadian Gaming Corporation Leverages IBM Cognos 8: Solutions for Financial Consolidation and Reporting Standardization.

[45] A. Vellidoa, P.J.G. Lisboaa, J. Vaughan, (1999) Neural Networks in Business: a Survey of Applications (19921998) Expert Systems with Applications 17, pp. 5170, Elsevier Science.

[46] Injazz J. Chen, K. Popovich, (2003) “Understanding Customer Relationship Management (CRM): People, process and technology”, Business Process Management Journal, Vol. 9, pp.672 688.

[47] Dave Smith (2010) Using Data and Text Mining to Drive Innovation PhUSE 2010, UK.

[48] Dien D. Phan, Douglas R. Vogel, (2010) A Model of Customer Relationship Management and Business Intelligence Systems for Catalogue and Online Retailers, Information & Management, Vol.47, Issue 2, Pages 69-77.

[49] Christian Thomsen, Torben Bach Pedersen (2009) A Survey of Open Source Tools for Business Intelligence International Journal of Data Warehousing and Mining, Vol. 5, Issue 3, IGI Global.

[50] Meryem Duygun Fethi, Fotios Pasiouras (2010) Assessing Bank Efficiency and Performance with Operational Research and Artificial Intelligence Techniques: A survey European Journal of Operational Research, pp. 189198, Elsevier.

[51] Rafael Berlanga, Oscar Romero, Alkis Simitsis, Victoria Nebot, Torben Bach Pedersen, Alberto Abell, Mara Jos Aramburu (2012 ) Semantic Web Technologies for Business Intelligence IGI.

[52] Manuel Meja-Lavalle, Ricardo Sosa R., Nemorio Gonzlez M., and Liliana Argotte R. (2009) Survey of Business Intelligence for Energy Markets E. Corchado et al. (Eds.): HAIS, LNAI 5572, pp. 235243, Springer-Verlag Berlin Heidelberg.

[53] Shantanu Godbole, Shourya Roy, (2008) Text Classification, Business Intelligence, and Interactivity: Automating C-Sat Analysis for Services Industry KDD08, ACM Las Vegas, USA.

[54] Carlos Rodrguez, Florian Daniel, F. Casati, Cinzia Cappiello (2010) Toward Uncertain Business Intelligence: The Case of Key Indicators Internet Computing, IEEE, vol.14, no.4, pp.32-40.

[55] K.A. Taipale (2003) “ Data Mining and Domestic Security: Connecting the Dots to Make Sense of Data ” Columbia Science and Technology Law Review 5.

[56] Will Hedfield (2009) Case study: Jaeger uses data mining to reduce losses from crime and waste.

[57] K. Laundon and J. Laundon (2011) Foundations of Business Intelligence: Databases and Information Management Managing Information Systems: Managing the Digital Firm, Pearson Education Inc.

[58] Oksana Grabova, Jerome Darmont, Jean-Hugues Chauchat, Iryna Zolotaryova (2010) Business Intelligence for Small and Middle-Sized Enterprises SIGMOD Rec. 39.

[59] Byung-Kwon Park and Il-Yeol Song (2011) Toward total business intelligence incorporating structured and unstructured data In Proceedings of the 2nd International Workshop on Business intelligencE and the WEB (BEWEB ’11), ACM, NY, USA.

[60] Y. Li, M.R. Kramer, A.J.M. Beulens, J.G.A.J. van der Vorst (2010) A Framework for Early Warning and Proactive Control Systems in Food Supply Chain Networks Computers in Industry, Vol. 61, Issue 9, pp. 852-862.

[61] MAIA Intelligence (2009) Business Intelligence in Manufacturing.

[62] Srinivasa Rao P, Saurabh Swarup (2001) Business Intelligence and Logistics Wipro Technologies.

[63] Atos, (2011) Business Intelligence solutions: Decisions that are Better-Informed Leading to LongTerm Competitive Advantage.

[64] K. Laundon and J. Laundon (2012) Enhancing Decision Making Managing Information Systems: Managing the Digital Firm, Pearson Education, Pearson Hall.

[65] INSEAD, World Economic Forum (2009) The Global Information Technology Report 20082009: Mobility in a Networked World, Geneva.

[66] Aura-Mihaela Mocanu, Daniela Litan, Stefan Olaru, A. Munteanu (2010) Information Systems in the Knowledge Based Economy WSEAS Transactions on Business and Economics, Issue 1, Vol. 7

[67] A. S. Al- Mudimigh, F. Saleem, Z. Ullah, F. N. Al-Aboud (2009) Implementation of Data Mining Engine on CRM -Improve Customer Satisfaction International Conference on Information and Communication Technologies ICICT ’09, vol., no., pp.193-197.

[68] Case study by Zap Technology, (2010) KFC/Pizza Hut makes efficiency gains with Zap Business Intelligence: Businesses become more agile, responsive and performance-focused.

Citation Count: 21

Applications of data mining techniques in life insurance.

A. B. Devale 1 and R. V. Kulkarni 2

1 Arts, Commerce, Science College, Palus Dist. Sangli, Maharashtra and 2 Shahu Institute of Business Research, Kolhapur, Maharashtra

Knowledge discovery in financial organization have been built and operated mainly to support decision making using knowledge as strategic factor. In this paper, we investigate the use of various data mining techniques for knowledge discovery in insurance business. Existing software are inefficient in showing such data characteristics. We introduce different exhibits for discovering knowledge in the form of association rules, clustering, classification and correlation suitable for data characteristics. Proposed data mining techniques, the decision- maker can define the expansion of insurance activities to empower the different forces in existing life insurance sector.

Insurance, Association rules, Clustering, Classification, Correlation, Data mining.

[1] Alex Berson and Stephen J. Smith, Data Warehousing, Data Mining, And OLAP ,MC GraowHill, 1997.

[2] Bigus and Joseph P, Data Mining With Neural Networks, MC GrawHill, New York 1996.

[3] Christopher J. Matheus, Gregory PiatetshyShapiro and Dwight Mcneill, Selecting and Reporting what is Interesting The Kefir Application to Health Care Data, Advances in Knowledge Discovery and Data Mining, AAA1 Press/The MIT Press, 1996.

[4] Dasrathy B. V., Ed, Nearest Neighbor Norms: NN Pattern Classification Techniques ,IEEE, Computer Society Press, Calif. 1990.

[5] David Cheung, Vincent T., Ada W. Fu and Yongjian Fv, Efficient Mining of Association Rules in Distributed Databases , IEEE, 1996.

[6] Graig Silverstein, Sergey Brin and Rajeev Montwani, Beyond Market Baskets: Generalizing Association Rules to Dependence Rules, Data Mining and Knowledge Discovery, Vol. 2, No. 1, Jan 1998, Kluwer Academic Publishers.

[7] Hongjun LU, Ling Feng and Jiawei Han, Beyond Intratransaction Association Analysis: Mining Multidimensional Intertransaction Association Rules, ACM Transactions on Information Systems, Vol. 18, October 2000.

[8] Huan Liu, Farhad Hussain, Chew Lim Tan and Manoranjan Dash, Discretization: An Enabling Technique “, Data Mining and Knowledge Discovery, vol. 6 No. 4, October 2002.

[9] J. Date, “ An Introduction to Database Systems “, Addition Wesley Longman, Seven Edition, 2000.

[10] Jiawei Han, Laks V. S. Lakshmanan and Raymond T.NG, Constraint-Based Multidimensional Data Mining, IEEE, August 1999.

[11] Jorg-Uwe Kietz, Regina Zucker and Anca Vaduva, Mining Mart: Combining Case- Based Reasoning and multi-Strategy Learning Into a Frame For Reusing KDD-Applications, Proc 5th Workshop on Multi-Strategy Learning (MSL 2000) Portugal, June 2000, Kluwer Academic Publishers.

[12] Ken Orr, Data Warehouse Technology, Copyright. The Ken Or Institute, 1997.

[13] Krzysztof J. Cios, Witold Pedryez and Roman W. Surniarski, Data Mining Methods for Knowledge Discovery, Kluwer Academic Publishers 1998 Second Printing 2000.

[14] Mariano Fernendez Lopez, Asuncion Gomez-Perez, Juan Pazos Sierra, Polytechnic and Alejandro Pazos Sierra, Building a Chemical Ontology Using Methontology and the Ontology Design Environment , IEEE Intelligent System. Jan / Feb 1999.

[15] Martin Staudt, Anca Vaduva and Thomas c, Metadata Management and Data Warehouse “, Technical Report, Information System Research, Swiss Life, University of Zurich, Department of Computer Science, July 1999. [email protected]

[16] Ming-Syan chen, Jiawei Han and Philip S. Yu, Data Mining: An Overview From a Database Perspective , IEEE Transactions on Knowledge and Data Engineering Vol. 8, No. 6, Dec. 1996.

[17] Natalya Friedman Noy and Carole D. Hafner, The State of The Art in Ontology Design , AI Magazine Vol. 18, No. 3, Fall 1997.

[18] Rakesh A. grawal, Parallel Mining of Associations Rule , IEEE, Dec 1996.

[19] Ramakrishnan Srikant and Rakesh A. Grawal, Mining Quantitative Association Rules in Large Relational Tables , Proc Sigmod 96, 6/96 Montreal Canada, 1996 ACM.

[20] Ramakrishnan Srikant and Rakesh A. Grawal, Mining Generalized Association Rules “, Proceedings of The 21st VLDB Conference, Zurich, Switzerland, 1995.

[21] Raymond T. Ng, Laks V. S. Lakshmanan, Jiawei Hon and Alex Pany, Exploratory Mining and Pruning Optimizations of Constrained Associations Rules , ACM 1998 page 13.

[22] Mr. A. B. Devale and Dr. R. V. Kulkarni A REVIEW OF DATA MINING TECHNIQUES IN INSURANCE SECTOR Golden Research Thoughts Vol – I , ISSUE – VII [ January 2012 ]

Data mining in clinical big data: the frequently used databases, steps, and methodological models

Military Medical Research volume  8 , Article number:  44 ( 2021 ) Cite this article

18k Accesses

94 Citations

2 Altmetric

Metrics details

Many high quality studies have emerged from public databases, such as Surveillance, Epidemiology, and End Results (SEER), National Health and Nutrition Examination Survey (NHANES), The Cancer Genome Atlas (TCGA), and Medical Information Mart for Intensive Care (MIMIC); however, these data are often characterized by a high degree of dimensional heterogeneity, timeliness, scarcity, irregularity, and other characteristics, resulting in the value of these data not being fully utilized. Data-mining technology has been a frontier field in medical research, as it demonstrates excellent performance in evaluating patient risks and assisting clinical decision-making in building disease-prediction models. Therefore, data mining has unique advantages in clinical big-data research, especially in large-scale medical public databases. This article introduced the main medical public database and described the steps, tasks, and models of data mining in simple language. Additionally, we described data-mining methods along with their practical applications. The goal of this work was to aid clinical researchers in gaining a clear and intuitive understanding of the application of data-mining technology on clinical big-data in order to promote the production of research results that are beneficial to doctors and patients.

With the rapid development of computer software/hardware and internet technology, the amount of data has increased at an amazing speed. “Big data” as an abstract concept currently affects all walks of life [ 1 ], and although its importance has been recognized, its definition varies slightly from field to field. In the field of computer science, big data refers to a dataset that cannot be perceived, acquired, managed, processed, or served within a tolerable time by using traditional IT and software and hardware tools. Generally, big data refers to a dataset that exceeds the scope of a simple database and data-processing architecture used in the early days of computing and is characterized by high-volume and -dimensional data that is rapidly updated represents a phenomenon or feature that has emerged in the digital age. Across the medical industry, various types of medical data are generated at a high speed, and trends indicate that applying big data in the medical field helps improve the quality of medical care and optimizes medical processes and management strategies [ 2 , 3 ]. Currently, this trend is shifting from civilian medicine to military medicine. For example, the United States is exploring the potential to use of one of its largest healthcare systems (the Military Healthcare System) to provide healthcare to eligible veterans in order to potentially benefit > 9 million eligible personnel [ 4 ]. Another data-management system has been developed to assess the physical and mental health of active-duty personnel, with this expected to yield significant economic benefits to the military medical system [ 5 ]. However, in medical research, the wide variety of clinical data and differences between several medical concepts in different classification standards results in a high degree of dimensionality heterogeneity, timeliness, scarcity, and irregularity to existing clinical data [ 6 , 7 ]. Furthermore, new data analysis techniques have yet to be popularized in medical research [ 8 ]. These reasons hinder the full realization of the value of existing data, and the intensive exploration of the value of clinical data remains a challenging problem.

Computer scientists have made outstanding contributions to the application of big data and introduced the concept of data mining to solve difficulties associated with such applications. Data mining (also known as knowledge discovery in databases) refers to the process of extracting potentially useful information and knowledge hidden in a large amount of incomplete, noisy, fuzzy, and random practical application data [ 9 ]. Unlike traditional research methods, several data-mining technologies mine information to discover knowledge based on the premise of unclear assumptions (i.e., they are directly applied without prior research design). The obtained information should have previously unknown, valid, and practical characteristics [ 9 ]. Data-mining technology does not aim to replace traditional statistical analysis techniques, but it does seek to extend and expand statistical analysis methodologies. From a practical point of view, machine learning (ML) is the main analytical method in data mining, as it represents a method of training models by using data and then using those models for predicting outcomes. Given the rapid progress of data-mining technology and its excellent performance in other industries and fields, it has introduced new opportunities and prospects to clinical big-data research [ 10 ]. Large amounts of high quality medical data are available to researchers in the form of public databases, which enable more researchers to participate in the process of medical data mining in the hope that the generated results can further guide clinical practice.

This article provided a valuable overview to medical researchers interested in studying the application of data mining on clinical big data. To allow a clearer understanding of the application of data-mining technology on clinical big data, the second part of this paper introduced the concept of public databases and summarized those commonly used in medical research. In the third part of the paper, we offered an overview of data mining, including introducing an appropriate model, tasks, and processes, and summarized the specific methods of data mining. In the fourth and fifth parts of this paper, we introduced data-mining algorithms commonly used in clinical practice along with specific cases in order to help clinical researchers clearly and intuitively understand the application of data-mining technology on clinical big data. Finally, we discussed the advantages and disadvantages of data mining in clinical analysis and offered insight into possible future applications.

Overview of common public medical databases

A public database describes a data repository used for research and dedicated to housing data related to scientific research on an open platform. Such databases collect and store heterogeneous and multi-dimensional health, medical, scientific research in a structured form and characteristics of mass/multi-ownership, complexity, and security. These databases cover a wide range of data, including those related to cancer research, disease burden, nutrition and health, and genetics and the environment. Table 1 summarizes the main public medical databases [ 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 ]. Researchers can apply for access to data based on the scope of the database and the application procedures required to perform relevant medical research.

Data mining: an overview

Data mining is a multidisciplinary field at the intersection of database technology, statistics, ML, and pattern recognition that profits from all these disciplines [ 27 ]. Although this approach is not yet widespread in the field of medical research, several studies have demonstrated the promise of data mining in building disease-prediction models, assessing patient risk, and helping physicians make clinical decisions [ 28 , 29 , 30 , 31 ].

Data-mining models

Data-mining has two kinds of models: descriptive and predictive. Predictive models are used to predict unknown or future values of other variables of interest, whereas descriptive models are often used to find patterns that describe data that can be interpreted by humans [ 32 ].

Data-mining tasks

A model is usually implemented by a task, with the goal of description being to generalize patterns of potential associations in the data. Therefore, using a descriptive model usually results in a few collections with the same or similar attributes. Prediction mainly refers to estimation of the variable value of a specific attribute based on the variable values of other attributes, including classification and regression [ 33 ].

Data-mining methods

After defining the data-mining model and task, the data mining methods required to build the approach based on the discipline involved are then defined. The data-mining method depends on whether or not dependent variables (labels) are present in the analysis. Predictions with dependent variables (labels) are generated through supervised learning, which can be performed by the use of linear regression, generalized linear regression, a proportional hazards model (the Cox regression model), a competitive risk model, decision trees, the random forest (RF) algorithm, and support vector machines (SVMs). In contrast, unsupervised learning involves no labels. The learning model infers some internal data structure. Common unsupervised learning methods include principal component analysis (PCA), association analysis, and clustering analysis.

Data-mining algorithms for clinical big data

Data mining based on clinical big data can produce effective and valuable knowledge, which is essential for accurate clinical decision-making and risk assessment [ 34 ]. Data-mining algorithms enable realization of these goals.

Supervised learning

A concept often mentioned in supervised learning is the partitioning of datasets. To prevent overfitting of a model, a dataset can generally be divided into two or three parts: a training set, validation set, and test set. Ripley [ 35 ] defined these parts as a set of examples used for learning and used to fit the parameters (i.e., weights) of the classifier, a set of examples used to tune the parameters (i.e., architecture) of a classifier, and a set of examples used only to assess the performance (generalized) of a fully-specified classifier, respectively. Briefly, the training set is used to train the model or determine the model parameters, the validation set is used to perform model selection, and the test set is used to verify model performance. In practice, data are generally divided into training and test sets, whereas the verification set is less involved. It should be emphasized that the results of the test set do not guarantee model correctness but only show that similar data can obtain similar results using the model. Therefore, the applicability of a model should be analysed in combination with specific problems in the research. Classical statistical methods, such as linear regression, generalized linear regression, and a proportional risk model, have been widely used in medical research. Notably, most of these classical statistical methods have certain data requirements or assumptions; however, in face of complicated clinical data, assumptions about data distribution are difficult to make. In contrast, some ML methods (algorithmic models) make no assumptions about the data and cross-verify the results; thus, they are likely to be favoured by clinical researchers [ 36 ]. For these reasons, this chapter focuses on ML methods that do not require assumptions about data distribution and classical statistical methods that are used in specific situations.

Decision tree

A decision tree is a basic classification and regression method that generates a result similar to the tree structure of a flowchart, where each tree node represents a test on an attribute, each branch represents the output of an attribute, each leaf node (decision node) represents a class or class distribution, and the topmost part of the tree is the root node [ 37 ]. The decision tree model is called a classification tree when used for classification and a regression tree when used for regression. Studies have demonstrated the utility of the decision tree model in clinical applications. In a study on the prognosis of breast cancer patients, a decision tree model and a classical logistic regression model were constructed, respectively, with the predictive performance of the different models indicating that the decision tree model showed stronger predictive power when using real clinical data [ 38 ]. Similarly, the decision tree model has been applied to other areas of clinical medicine, including diagnosis of kidney stones [ 39 ], predicting the risk of sudden cardiac arrest [ 40 ], and exploration of the risk factors of type II diabetes [ 41 ]. A common feature of these studies is the use of a decision tree model to explore the interaction between variables and classify subjects into homogeneous categories based on their observed characteristics. In fact, because the decision tree accounts for the strong interaction between variables, it is more suitable for use with decision algorithms that follow the same structure [ 42 ]. In the construction of clinical prediction models and exploration of disease risk factors and patient prognosis, the decision tree model might offer more advantages and practical application value than some classical algorithms. Although the decision tree has many advantages, it recursively separates observations into branches to construct a tree; therefore, in terms of data imbalance, the precision of decision tree models needs improvement.

The RF method

The RF algorithm was developed as an application of an ensemble-learning method based on a collection of decision trees. The bootstrap method [ 43 ] is used to randomly retrieve sample sets from the training set, with decision trees generated by the bootstrap method constituting a “random forest” and predictions based on this derived from an ensemble average or majority vote. The biggest advantage of the RF method is that the random sampling of predictor variables at each decision tree node decreases the correlation among the trees in the forest, thereby improving the precision of ensemble predictions [ 44 ]. Given that a single decision tree model might encounter the problem of overfitting [ 45 ], the initial application of RF minimizes overfitting in classification and regression and improves predictive accuracy [ 44 ]. Taylor et al. [ 46 ] highlighted the potential of RF in correctly differentiating in-hospital mortality in patients experiencing sepsis after admission to the emergency department. Nowhere in the healthcare system is the need more pressing to find methods to reduce uncertainty than in the fast, chaotic environment of the emergency department. The authors demonstrated that the predictive performance of the RF method was superior to that of traditional emergency medicine methods and the methods enabled evaluation of more clinical variables than traditional modelling methods, which subsequently allowed the discovery of clinical variables not expected to be of predictive value or which otherwise would have been omitted as a rare predictor [ 46 ]. Another study based on the Medical Information Mart for Intensive Care (MIMIC) II database [ 47 ] found that RF had excellent predictive power regarding intensive care unit (ICU) mortality [ 48 ]. These studies showed that the application of RF to big data stored in the hospital healthcare system provided a new data-driven method for predictive analysis in critical care. Additionally, random survival forests have recently been developed to analyse survival data, especially right-censored survival data [ 49 , 50 ], which can help researchers conduct survival analyses in clinical oncology and help develop personalized treatment regimens that benefit patients [ 51 ].

The SVM is a relatively new classification or prediction method developed by Cortes and Vapnik and represents a data-driven approach that does not require assumptions about data distribution [ 52 ]. The core purpose of an SVM is to identify a separation boundary (called a hyperplane) to help classify cases; thus, the advantages of SVMs are obvious when classifying and predicting cases based on high dimensional data or data with a small sample size [ 53 , 54 ].

In a study of drug compliance in patients with heart failure, researchers used an SVM to build a predictive model for patient compliance in order to overcome the problem of a large number of input variables relative to the number of available observations [ 55 ]. Additionally, the mechanisms of certain chronic and complex diseases observed in clinical practice remain unclear, and many risk factors, including gene–gene interactions and gene-environment interactions, must be considered in the research of such diseases [ 55 , 56 ]. SVMs are capable of addressing these issues. Yu et al. [ 54 ] applied an SVM for predicting diabetes onset based on data from the National Health and Nutrition Examination Survey (NHANES). Furthermore, these models have strong discrimination ability, making SVMs a promising classification approach for detecting individuals with chronic and complex diseases. However, a disadvantage of SVMs is that when the number of observation samples is large, the method becomes time- and resource-intensive, which is often highly inefficient.

Competitive risk model

Kaplan–Meier marginal regression and the Cox proportional hazards model are widely used in survival analysis in clinical studies. Classical survival analysis usually considers only one endpoint, such as the impact of patient survival time. However, in clinical medical research, multiple endpoints usually coexist, and these endpoints compete with one another to generate competitive risk data [ 57 ]. In the case of multiple endpoint events, the use of a single endpoint-analysis method can lead to a biased estimation of the probability of endpoint events due to the existence of competitive risks [ 58 ]. The competitive risk model is a classical statistical model based on the hypothesis of data distribution. Its main advantage is its accurate estimation of the cumulative incidence of outcomes for right-censored survival data with multiple endpoints [ 59 ]. In data analysis, the cumulative risk rate is estimated using the cumulative incidence function in single-factor analysis, and Gray’s test is used for between-group comparisons [ 60 ].

Multifactor analysis uses the Fine-Gray and cause-specific (CS) risk models to explore the cumulative risk rate [ 61 ]. The difference between the Fine-Gray and CS models is that the former is applicable to establishing a clinical prediction model and predicting the risk of a single endpoint of interest [ 62 ], whereas the latter is suitable for answering etiological questions, where the regression coefficient reflects the relative effect of covariates on the increased incidence of the main endpoint in the target event-free risk set [ 63 ]. Currently, in databases with CS records, such as Surveillance, Epidemiology, and End Results (SEER), competitive risk models exhibit good performance in exploring disease-risk factors and prognosis [ 64 ]. A study of prognosis in patients with oesophageal cancer from SEER showed that Cox proportional risk models might misestimate the effects of age and disease location on patient prognosis, whereas competitive risk models provide more accurate estimates of factors affecting patient prognosis [ 65 ]. In another study of the prognosis of penile cancer patients, researchers found that using a competitive risk model was more helpful in developing personalized treatment plans [ 66 ].

Unsupervised learning

In many data-analysis processes, the amount of usable identified data is small, and identifying data is a tedious process [ 67 ]. Unsupervised learning is necessary to judge and categorize data according to similarities, characteristics, and correlations and has three main applications: data clustering, association analysis, and dimensionality reduction. Therefore, the unsupervised learning methods introduced in this section include clustering analysis, association rules, and PCA.

Clustering analysis

The classification algorithm needs to “know” information concerning each category in advance, with all of the data to be classified having corresponding categories. When the above conditions cannot be met, cluster analysis can be applied to solve the problem [ 68 ]. Clustering places similar objects into different categories or subsets through the process of static classification. Consequently, objects in the same subset have similar properties. Many kinds of clustering techniques exist. Here, we introduced the four most commonly used clustering techniques.

Partition clustering

The core idea of this clustering method regards the centre of the data point as the centre of the cluster. The k-means method [ 69 ] is a representative example of this technique. The k-means method takes n observations and an integer, k , and outputs a partition of the n observations into k sets such that each observation belongs to the cluster with the nearest mean [ 70 ]. The k-means method exhibits low time complexity and high computing efficiency but has a poor processing effect on high dimensional data and cannot identify nonspherical clusters.

Hierarchical clustering

The hierarchical clustering algorithm decomposes a dataset hierarchically to facilitate the subsequent clustering [ 71 ]. Common algorithms for hierarchical clustering include BIRCH [ 72 ], CURE [ 73 ], and ROCK [ 74 ]. The algorithm starts by treating every point as a cluster, with clusters grouped according to closeness. When further combinations result in unexpected results under multiple causes or only one cluster remains, the grouping process ends. This method has wide applicability, and the relationship between clusters is easy to detect; however, the time complexity is high [ 75 ].

Clustering according to density

The density algorithm takes areas presenting a high degree of data density and defines these as belonging to the same cluster [ 76 ]. This method aims to find arbitrarily-shaped clusters, with the most representative algorithm being DBSCAN [ 77 ]. In practice, DBSCAN does not need to input the number of clusters to be partitioned and can handle clusters of various shapes; however, the time complexity of the algorithm is high. Furthermore, when data density is irregular, the quality of the clusters decreases; thus, DBSCAN cannot process high dimensional data [ 75 ].

Clustering according to a grid

Neither partition nor hierarchical clustering can identify clusters with nonconvex shapes. Although a dimension-based algorithm can accomplish this task, the time complexity is high. To address this problem, data-mining researchers proposed grid-based algorithms that changed the original data space into a grid structure of a certain size. A representative algorithm is STING, which divides the data space into several square cells according to different resolutions and clusters the data of different structure levels [ 78 ]. The main advantage of this method is its high processing speed and its exclusive dependence on the number of units in each dimension of the quantized space.

In clinical studies, subjects tend to be actual patients. Although researchers adopt complex inclusion and exclusion criteria before determining the subjects to be included in the analyses, heterogeneity among different patients cannot be avoided [ 79 , 80 ]. The most common application of cluster analysis in clinical big data is in classifying heterogeneous mixed groups into homogeneous groups according to the characteristics of existing data (i.e., “subgroups” of patients or observed objects are identified) [ 81 , 82 ]. This new information can then be used in the future to develop patient-oriented medical-management strategies. Docampo et al. [ 81 ] used hierarchical clustering to reduce heterogeneity and identify subgroups of clinical fibromyalgia, which aided the evaluation and management of fibromyalgia. Additionally, Guo et al. [ 83 ] used k-means clustering to divide patients with essential hypertension into four subgroups, which revealed that the potential risk of coronary heart disease differed between different subgroups. On the other hand, density- and grid-based clustering algorithms have mostly been used to process large numbers of images generated in basic research and clinical practice, with current studies focused on developing new tools to help clinical research and practices based on these technologies [ 84 , 85 ]. Cluster analysis will continue to have extensive application prospects along with the increasing emphasis on personalized treatment.

Association rules

Association rules discover interesting associations and correlations between item sets in large amounts of data. These rules were first proposed by Agrawal et al. [ 86 ] and applied to analyse customer buying habits to help retailers create sales plans. Data-mining based on association rules identifies association rules in a two-step process: 1) all high frequency items in the collection are listed and 2) frequent association rules are generated based on the high frequency items [ 87 ]. Therefore, before association rules can be obtained, sets of frequent items must be calculated using certain algorithms. The Apriori algorithm is based on the a priori principle of finding all relevant adjustment items in a database transaction that meet a minimum set of rules and restrictions or other restrictions [ 88 ]. Other algorithms are mostly variants of the Apriori algorithm [ 64 ]. The Apriori algorithm must scan the entire database every time it scans the transaction; therefore, algorithm performance deteriorates as database size increases [ 89 ], making it potentially unsuitable for analysing large databases. The frequent pattern (FP) growth algorithm was proposed to improve efficiency. After the first scan, the FP algorithm compresses the frequency set in the database into a FP tree while retaining the associated information and then mines the conditional libraries separately [ 90 ]. Association-rule technology is often used in medical research to identify association rules between disease risk factors (i.e., exploration of the joint effects of disease risk factors and combinations of other risk factors). For example, Li et al. [ 91 ] used the association-rule algorithm to identify the most important stroke risk factor as atrial fibrillation, followed by diabetes and a family history of stroke. Based on the same principle, association rules can also be used to evaluate treatment effects and other aspects. For example, Guo et al. [ 92 ] used the FP algorithm to generate association rules and evaluate individual characteristics and treatment effects of patients with diabetes, thereby reducing the readability rate of patients with diabetes. Association rules reveal a connection between premises and conclusions; however, the reasonable and reliable application of information can only be achieved through validation by experienced medical professionals and through extensive causal research [ 92 ].

PCA is a widely used data-mining method that aims to reduce data dimensionality in an interpretable way while retaining most of the information present in the data [ 93 , 94 ]. The main purpose of PCA is descriptive, as it requires no assumptions about data distribution and is, therefore, an adaptive and exploratory method. During the process of data analysis, the main steps of PCA include standardization of the original data, calculation of a correlation coefficient matrix, calculation of eigenvalues and eigenvectors, selection of principal components, and calculation of the comprehensive evaluation value. PCA does not often appear as a separate method, as it is often combined with other statistical methods [ 95 ]. In practical clinical studies, the existence of multicollinearity often leads to deviation from multivariate analysis. A feasible solution is to construct a regression model by PCA, which replaces the original independent variables with each principal component as a new independent variable for regression analysis, with this most commonly seen in the analysis of dietary patterns in nutritional epidemiology [ 96 ]. In a study of socioeconomic status and child-developmental delays, PCA was used to derive a new variable (the household wealth index) from a series of household property reports and incorporate this new variable as the main analytical variable into the logistic regression model [ 97 ]. Additionally, PCA can be combined with cluster analysis. Burgel et al. [ 98 ] used PCA to transform clinical data to address the lack of independence between existing variables used to explore the heterogeneity of different subtypes of chronic obstructive pulmonary disease. Therefore, in the study of subtypes and heterogeneity of clinical diseases, PCA can eliminate noisy variables that can potentially corrupt the cluster structure, thereby increasing the accuracy of the results of clustering analysis [ 98 , 99 ].

The data-mining process and examples of its application using common public databases

Open-access databases have the advantages of large volumes of data, wide data coverage, rich data information, and a cost-efficient method of research, making them beneficial to medical researchers. In this chapter, we introduced the data-mining process and methods and their application in research based on examples of utilizing public databases and data-mining algorithms.

The data-mining process

Figure  1 shows a series of research concepts. The data-mining process is divided into several steps: (1) database selection according to the research purpose; (2) data extraction and integration, including downloading the required data and combining data from multiple sources; (3) data cleaning and transformation, including removal of incorrect data, filling in missing data, generating new variables, converting data format, and ensuring data consistency; (4) data mining, involving extraction of implicit relational patterns through traditional statistics or ML; (5) pattern evaluation, which focuses on the validity parameters and values of the relationship patterns of the extracted data; and (6) assessment of the results, involving translation of the extracted data-relationship model into comprehensible knowledge made available to the public.

figure 1

The steps of data mining in medical public database

Examples of data-mining applied using public databases

Establishment of warning models for the early prediction of disease.

A previous study identified sepsis as a major cause of death in ICU patients [ 100 ]. The authors noted that the predictive model developed previously used a limited number of variables, and that model performance required improvement. The data-mining process applied to address these issues was, as follows: (1) data selection using the MIMIC III database; (2) extraction and integration of three types of data, including multivariate features (demographic information and clinical biochemical indicators), time series data (temperature, blood pressure, and heart rate), and clinical latent features (various scores related to disease); (3) data cleaning and transformation, including fixing irregular time series measurements, estimating missing values, deleting outliers, and addressing data imbalance; (4) data mining through the use of logical regression, generation of a decision tree, application of the RF algorithm, an SVM, and an ensemble algorithm (a combination of multiple classifiers) to established the prediction model; (5) pattern evaluation using sensitivity, precision, and the area under the receiver operating characteristic curve to evaluate model performance; and (6) evaluation of the results, in this case the potential to predicting the prognosis of patients with sepsis and whether the model outperformed current scoring systems.

Exploring prognostic risk factors in cancer patients

Wu et al. [ 101 ] noted that traditional survival-analysis methods often ignored the influence of competitive risk events, such as suicide and car accident, on outcomes, leading to deviations and misjudgements in estimating the effect of risk factors. They used the SEER database, which offers cause-of-death data for cancer patients, and a competitive risk model to address this problem according to the following process: (1) data were obtained from the SEER database; (2) demography, clinical characteristics, treatment modality, and cause of death of cecum cancer patients were extracted from the database; (3) patient data were deleted when there were no demographic, clinical, therapeutic, or cause-of-death variables; (4) Cox regression and two kinds of competitive risk models were applied for survival analysis; (5) the results were compared between three different models; and (6) the results revealed that for survival data with multiple endpoints, the competitive risk model was more favourable.

Derivation of dietary patterns

A study by Martínez Steele et al. [ 102 ] applied PCA for nutritional epidemiological analysis to determine dietary patterns and evaluate the overall nutritional quality of the population based on those patterns. Their process involved the following: (1) data were extracted from the NHANES database covering the years 2009–2010; (2) demographic characteristics and two 24 h dietary recall interviews were obtained; (3) data were weighted and excluded based on subjects not meeting specific criteria; (4) PCA was used to determine dietary patterns in the United States population, and Gaussian regression and restricted cubic splines were used to assess associations between ultra-processed foods and nutritional balance; (5) eigenvalues, scree plots, and the interpretability of the principal components were reviewed to screen and evaluate the results; and (6) the results revealed a negative association between ultra-processed food intake and overall dietary quality. Their findings indicated that a nutritionally balanced eating pattern was characterized by a diet high in fibre, potassium, magnesium, and vitamin C intake along with low sugar and saturated fat consumption.

The use of “big data” has changed multiple aspects of modern life, with its use combined with data-mining methods capable of improving the status quo [ 86 ]. The aim of this study was to aid clinical researchers in understanding the application of data-mining technology on clinical big data and public medical databases to further their research goals in order to benefit clinicians and patients. The examples provided offer insight into the data-mining process applied for the purposes of clinical research. Notably, researchers have raised concerns that big data and data-mining methods were not a perfect fit for adequately replicating actual clinical conditions, with the results potentially capable of misleading doctors and patients [ 86 ]. Therefore, given the rate at which new technologies and trends progress, it is necessary to maintain a positive attitude concerning their potential impact while remaining cautious in examining the results provided by their application.

In the future, the healthcare system will need to utilize increasingly larger volumes of big data with higher dimensionality. The tasks and objectives of data analysis will also have higher demands, including higher degrees of visualization, results with increased accuracy, and stronger real-time performance. As a result, the methods used to mine and process big data will continue to improve. Furthermore, to increase the formality and standardization of data-mining methods, it is possible that a new programming language specifically for this purpose will need to be developed, as well as novel methods capable of addressing unstructured data, such as graphics, audio, and text represented by handwriting. In terms of application, the development of data-management and disease-screening systems for large-scale populations, such as the military, will help determine the best interventions and formulation of auxiliary standards capable of benefitting both cost-efficiency and personnel. Data-mining technology can also be applied to hospital management in order to improve patient satisfaction, detect medical-insurance fraud and abuse, and reduce costs and losses while improving management efficiency. Currently, this technology is being applied for predicting patient disease, with further improvements resulting in the increased accuracy and speed of these predictions. Moreover, it is worth noting that technological development will concomitantly require higher quality data, which will be a prerequisite for accurate application of the technology.

Finally, the ultimate goal of this study was to explain the methods associated with data mining and commonly used to process clinical big data. This review will potentially promote further study and aid doctors and patients.


Biologic Specimen and Data Repositories Information Coordinating Center

China Health and Retirement Longitudinal Study

China Health and Nutrition Survey

China Kadoorie Biobank

Cause-specific risk

Comparative Toxicogenomics Database

EICU Collaborative Research Database

Frequent pattern

Global burden of disease

Gene expression omnibus

Health and Retirement Study

International Cancer Genome Consortium

Medical Information Mart for Intensive Care

National Health and Nutrition Examination Survey

Principal component analysis

Paediatric intensive care

Random forest

Surveillance, epidemiology, and end results

Support vector machine

The Cancer Genome Atlas

Herland M, Khoshgoftaar TM, Wald R. A review of data mining using big data in health informatics. J Big Data. 2014;1(1):1–35.

Article   Google Scholar  

Wang F, Zhang P, Wang X, Hu J. Clinical risk prediction by exploring high-order feature correlations. AMIA Annu Symp Proc. 2014;2014:1170–9.

PubMed   PubMed Central   Google Scholar  

Xu R, Li L, Wang Q. dRiskKB: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text. BMC Bioinform. 2014;15:105. .

Article   CAS   Google Scholar  

Ramachandran S, Erraguntla M, Mayer R, Benjamin P, Editors. Data mining in military health systems-clinical and administrative applications. In: 2007 IEEE international conference on automation science and engineering; 2007. .

Vie LL, Scheier LM, Lester PB, Ho TE, Labarthe DR, Seligman MEP. The US army person-event data environment: a military-civilian big data enterprise. Big Data. 2015;3(2):67–79. .

Article   PubMed   Google Scholar  

Mohan A, Blough DM, Kurc T, Post A, Saltz J. Detection of conflicts and inconsistencies in taxonomy-based authorization policies. IEEE Int Conf Bioinform Biomed. 2012;2011:590–4. .

Luo J, Wu M, Gopukumar D, Zhao Y. Big data application in biomedical research and health care: a literature review. Biomed Inform Insights. 2016;8:1–10. .

Article   CAS   PubMed   PubMed Central   Google Scholar  

Bellazzi R, Zupan B. Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inform. 2008;77(2):81–97.

Sahu H, Shrma S, Gondhalakar S. A brief overview on data mining survey. Int J Comput Technol Electron Eng. 2011;1(3):114–21.

Google Scholar  

Obermeyer Z, Emanuel EJ. Predicting the future - big data, machine learning, and clinical medicine. N Engl J Med. 2016;375(13):1216–9.

Article   PubMed   PubMed Central   Google Scholar  

Doll KM, Rademaker A, Sosa JA. Practical guide to surgical data sets: surveillance, epidemiology, and end results (SEER) database. JAMA Surg. 2018;153(6):588–9.

Johnson AE, Pollard TJ, Shen L, Lehman LW, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3: 160035. .

Ahluwalia N, Dwyer J, Terry A, Moshfegh A, Johnson C. Update on NHANES dietary data: focus on collection, release, analytical considerations, and uses to inform public policy. Adv Nutr. 2016;7(1):121–34.

Vos T, Lim SS, Abbafati C, Abbas KM, Abbasi M, Abbasifard M, et al. Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet. 2020;396(10258):1204–22. .

Palmer LJ. UK Biobank: Bank on it. Lancet. 2007;369(9578):1980–2. .

Cancer Genome Atlas Research Network, Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, et al. The cancer genome atlas pan-cancer analysis project. Nat Genet. 2013;45(10):1113–20. .

Davis S, Meltzer PS. GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics. 2007;23(14):1846–7.

Article   PubMed   CAS   Google Scholar  

Zhang J, Bajari R, Andric D, Gerthoffert F, Lepsa A, Nahal-Bose H, et al. The international cancer genome consortium data portal. Nat Biotechnol. 2019;37(4):367–9.

Article   CAS   PubMed   Google Scholar  

Chen Z, Chen J, Collins R, Guo Y, Peto R, Wu F, et al. China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. Int J Epidemiol. 2011;40(6):1652–66.

Davis AP, Grondin CJ, Johnson RJ, Sciaky D, McMorran R, Wiegers J, et al. The comparative toxicogenomics database: update 2019. Nucleic Acids Res. 2019;47(D1):D948–54. .

Zeng X, Yu G, Lu Y, Tan L, Wu X, Shi S, et al. PIC, a paediatric-specific intensive care database. Sci Data. 2020;7(1):14.

Giffen CA, Carroll LE, Adams JT, Brennan SP, Coady SA, Wagner EL. Providing contemporary access to historical biospecimen collections: development of the NHLBI Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC). Biopreserv Biobank. 2015;13(4):271–9.

Zhang B, Zhai FY, Du SF, Popkin BM. The China Health and Nutrition Survey, 1989–2011. Obes Rev. 2014;15(Suppl 1):2–7. .

Zhao Y, Hu Y, Smith JP, Strauss J, Yang G. Cohort profile: the China Health and Retirement Longitudinal Study (CHARLS). Int J Epidemiol. 2014;43(1):61–8.

Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O. The eICU collaborative research database, a freely available multi-centre database for critical care research. Sci Data. 2018;5:180178. .

Fisher GG, Ryan LH. Overview of the health and retirement study and introduction to the special issue. Work Aging Retire. 2018;4(1):1–9.

Iavindrasana J, Cohen G, Depeursinge A, Müller H, Meyer R, Geissbuhler A. Clinical data mining: a review. Yearb Med Inform. 2009:121–33.

Zhang Y, Guo SL, Han LN, Li TL. Application and exploration of big data mining in clinical medicine. Chin Med J. 2016;129(6):731–8. .

Ngiam KY, Khor IW. Big data and machine learning algorithms for health-care delivery. Lancet Oncol. 2019;20(5):e262–73.

Huang C, Murugiah K, Mahajan S, Li S-X, Dhruva SS, Haimovich JS, et al. Enhancing the prediction of acute kidney injury risk after percutaneous coronary intervention using machine learning techniques: a retrospective cohort study. PLoS Med. 2018;15(11):e1002703.

Rahimian F, Salimi-Khorshidi G, Payberah AH, Tran J, Ayala Solares R, Raimondi F, et al. Predicting the risk of emergency admission with machine learning: development and validation using linked electronic health records. PLoS Med. 2018;15(11):e1002695.

Kantardzic M. Data Mining: concepts, models, methods, and algorithms. Technometrics. 2003;45(3):277.

Jothi N, Husain W. Data mining in healthcare—a review. Procedia Comput Sci. 2015;72:306–13.

Piatetsky-Shapiro G, Tamayo P. Microarray data mining: facing the challenges. SIGKDD. 2003;5(2):1–5. .

Ripley BD. Pattern recognition and neural networks. Cambridge: Cambridge University Press; 1996.

Book   Google Scholar  

Arlot S, Celisse A. A survey of cross-validation procedures for model selection. Stat Surv. 2010;4:40–79. .

Shouval R, Bondi O, Mishan H, Shimoni A, Unger R, Nagler A. Application of machine learning algorithms for clinical predictive modelling: a data-mining approach in SCT. Bone Marrow Transp. 2014;49(3):332–7.

Momenyan S, Baghestani AR, Momenyan N, Naseri P, Akbari ME. Survival prediction of patients with breast cancer: comparisons of decision tree and logistic regression analysis. Int J Cancer Manag. 2018;11(7):e9176.

Topaloğlu M, Malkoç G. Decision tree application for renal calculi diagnosis. Int J Appl Math Electron Comput. 2016.

Li H, Wu TT, Yang DL, Guo YS, Liu PC, Chen Y, et al. Decision tree model for predicting in-hospital cardiac arrest among patients admitted with acute coronary syndrome. Clin Cardiol. 2019;42(11):1087–93.

Ramezankhani A, Hadavandi E, Pournik O, Shahrabi J, Azizi F, Hadaegh F. Decision tree-based modelling for identification of potential interactions between type 2 diabetes risk factors: a decade follow-up in a Middle East prospective cohort study. BMJ Open. 2016;6(12):e013336.

Carmona-Bayonas A, Jiménez-Fonseca P, Font C, Fenoy F, Otero R, Beato C, et al. Predicting serious complications in patients with cancer and pulmonary embolism using decision tree modelling: the EPIPHANY Index. Br J Cancer. 2017;116(8):994–1001.

Efron B. Bootstrap methods: another look at the jackknife. In: Kotz S, Johnson NL, editors. Breakthroughs in statistics. New York: Springer; 1992. p. 569–93.

Chapter   Google Scholar  

Breima L. Random forests. Mach Learn. 2010;1(45):5–32. .

Franklin J. The elements of statistical learning: data mining, inference and prediction. Math Intell. 2005;27(2):83–5.

Taylor RA, Pare JR, Venkatesh AK, Mowafi H, Melnick ER, Fleischman W, et al. Prediction of in-hospital mortality in emergency department patients with sepsis: a local big data-driven, machine learning approach. Acad Emerg Med. 2016;23(3):269–78.

Lee J, Scott DJ, Villarroel M, Clifford GD, Saeed M, Mark RG. Open-access MIMIC-II database for intensive care research. Annu Int Conf IEEE Eng Med Biol Soc. 2011:8315–8. .

Lee J. Patient-specific predictive modelling using random forests: an observational study for the critically Ill. JMIR Med Inform. 2017;5(1):e3.

Wongvibulsin S, Wu KC, Zeger SL. Clinical risk prediction with random forests for survival, longitudinal, and multivariate (RF-SLAM) data analysis. BMC Med Res Methodol. 2019;20(1):1.

Taylor JMG. Random survival forests. J Thorac Oncol. 2011;6(12):1974–5.

Hu C, Steingrimsson JA. Personalized risk prediction in clinical oncology research: applications and practical issues using survival trees and random forests. J Biopharm Stat. 2018;28(2):333–49.

Dietrich R, Opper M, Sompolinsky H. Statistical mechanics of support vector networks. Phys Rev Lett. 1999;82(14):2975.

Verplancke T, Van Looy S, Benoit D, Vansteelandt S, Depuydt P, De Turck F, et al. Support vector machine versus logistic regression modelling for prediction of hospital mortality in critically ill patients with haematological malignancies. BMC Med Inform Decis Mak. 2008;8:56. .

Yu W, Liu T, Valdez R, Gwinn M, Khoury MJ. Application of support vector machine modelling for prediction of common diseases: the case of diabetes and pre-diabetes. BMC Med Inform Decis Mak. 2010;10:16. .

Son YJ, Kim HG, Kim EH, Choi S, Lee SK. Application of support vector machine for prediction of medication adherence in heart failure patients. Healthc Inform Res. 2010;16(4):253–9.

Schadt EE, Friend SH, Shaywitz DA. A network view of disease and compound screening. Nat Rev Drug Discov. 2009;8(4):286–95.

Austin PC, Lee DS, Fine JP. Introduction to the analysis of survival data in the presence of competing risks. Circulation. 2016;133(6):601–9.

Putter H, Fiocco M, Geskus RB. Tutorial in biostatistics: competing risks and multi-state models. Stat Med. 2007;26(11):2389–430. .

Klein JP. Competing risks. WIREs Comp Stat. 2010;2(3):333–9. .

Haller B, Schmidt G, Ulm K. Applying competing risks regression models: an overview. Lifetime Data Anal. 2013;19(1):33–58. .

Fine JP, Gray RJ. A proportional hazards model for the subdistribution of a competing risk. J Am Stat Assoc. 1999;94(446):496–509.

Koller MT, Raatz H, Steyerberg EW, Wolbers M. Competing risks and the clinical community: irrelevance or ignorance? Stat Med. 2012;31(11–12):1089–97.

Lau B, Cole SR, Gange SJ. Competing risk regression models for epidemiologic data. Am J Epidemiol. 2009;170(2):244–56.

Yang J, Li Y, Liu Q, Li L, Feng A, Wang T, et al. Brief introduction of medical database and data mining technology in big data era. J Evid Based Med. 2020;13(1):57–69.

Yu Z, Yang J, Gao L, Huang Q, Zi H, Li X. A competing risk analysis study of prognosis in patients with esophageal carcinoma 2006–2015 using data from the surveillance, epidemiology, and end results (SEER) database. Med Sci Monit. 2020;26:e918686.

Yang J, Pan Z, He Y, Zhao F, Feng X, Liu Q, et al. Competing-risks model for predicting the prognosis of penile cancer based on the SEER database. Cancer Med. 2019;8(18):7881–9.

Miotto R, Wang F, Wang S, Jiang X, Dudley JT. Deep learning for healthcare: review, opportunities and challenges. Brief Bioinform. 2018;19(6):1236–46.

Alashwal H, El Halaby M, Crouse JJ, Abdalla A, Moustafa AA. The application of unsupervised clustering methods to Alzheimer’s disease. Front Comput Neurosci. 2019;13:31.

Macqueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA: University of California Press;1967.

Forgy EW. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics. 1965;21:768–9.

Johnson SC. Hierarchical clustering schemes. Psychometrika. 1967;32(3):241–54.

Zhang T, Ramakrishnan R, Livny M. BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Rec. 1996;25(2):103–14.

Guha S, Rastogi R, Shim K. CURE: an efficient clustering algorithm for large databases. ACM SIGMOD Rec. 1998;27(2):73–84.

Guha S, Rastogi R, Shim K. ROCK: a robust clustering algorithm for categorical attributes. Inf Syst. 2000;25(5):345–66.

Xu D, Tian Y. A comprehensive survey of clustering algorithms. Ann Data Sci. 2015;2(2):165–93.

Kriegel HP, Kröger P, Sander J, Zimek A. Density-based clustering. WIRES Data Min Knowl. 2011;1(3):231–40. .

Ester M, Kriegel HP, Sander J, Xu X, editors. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of 2nd international conference on knowledge discovery and data mining Portland, Oregon: AAAI Press; 1996. p. 226–31.

Wang W, Yang J, Muntz RR. STING: a statistical information grid approach to spatial data mining. In: Proceedings of the 23rd international conference on very large data bases, Morgan Kaufmann Publishers Inc.; 1997. p. 186–95.

Iwashyna TJ, Burke JF, Sussman JB, Prescott HC, Hayward RA, Angus DC. Implications of heterogeneity of treatment effect for reporting and analysis of randomized trials in critical care. Am J Respir Crit Care Med. 2015;192(9):1045–51.

Ruan S, Lin H, Huang C, Kuo P, Wu H, Yu C. Exploring the heterogeneity of effects of corticosteroids on acute respiratory distress syndrome: a systematic review and meta-analysis. Crit Care. 2014;18(2):R63.

Docampo E, Collado A, Escaramís G, Carbonell J, Rivera J, Vidal J, et al. Cluster analysis of clinical data identifies fibromyalgia subgroups. PLoS ONE. 2013;8(9):e74873.

Sutherland ER, Goleva E, King TS, Lehman E, Stevens AD, Jackson LP, et al. Cluster analysis of obesity and asthma phenotypes. PLoS ONE. 2012;7(5):e36631.

Guo Q, Lu X, Gao Y, Zhang J, Yan B, Su D, et al. Cluster analysis: a new approach for identification of underlying risk factors for coronary artery disease in essential hypertensive patients. Sci Rep. 2017;7:43965.

Hastings S, Oster S, Langella S, Kurc TM, Pan T, Catalyurek UV, et al. A grid-based image archival and analysis system. J Am Med Inform Assoc. 2005;12(3):286–95.

Celebi ME, Aslandogan YA, Bergstresser PR. Mining biomedical images with density-based clustering. In: International conference on information technology: coding and computing (ITCC’05), vol II. Washington, DC, USA: IEEE; 2005. .

Agrawal R, Imieliński T, Swami A, editors. Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD conference on management of data. Washington, DC, USA: Association for Computing Machinery; 1993. p. 207–16. .

Sethi A, Mahajan P. Association rule mining: A review. TIJCSA. 2012;1(9):72–83.

Kotsiantis S, Kanellopoulos D. Association rules mining: a recent overview. GESTS Int Trans Comput Sci Eng. 2006;32(1):71–82.

Narvekar M, Syed SF. An optimized algorithm for association rule mining using FP tree. Procedia Computer Sci. 2015;45:101–10.

Verhein F. Frequent pattern growth (FP-growth) algorithm. Sydney: The University of Sydney; 2008. p. 1–16.

Li Q, Zhang Y, Kang H, Xin Y, Shi C. Mining association rules between stroke risk factors based on the Apriori algorithm. Technol Health Care. 2017;25(S1):197–205.

Guo A, Zhang W, Xu S. Exploring the treatment effect in diabetes patients using association rule mining. Int J Inf Pro Manage. 2016;7(3):1–9.

Pearson K. On lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci. 1901;2(11):559–72.

Hotelling H. Analysis of a complex of statistical variables into principal components. J Educ Psychol. 1933;24(6):417.

Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci. 2016;374(2065):20150202.

Zhang Z, Castelló A. Principal components analysis in clinical studies. Ann Transl Med. 2017;5(17):351.

Apio BRS, Mawa R, Lawoko S, Sharma KN. Socio-economic inequality in stunting among children aged 6–59 months in a Ugandan population based cross-sectional study. Am J Pediatri. 2019;5(3):125–32.

Burgel PR, Paillasseur JL, Caillaud D, Tillie-Leblond I, Chanez P, Escamilla R, et al. Clinical COPD phenotypes: a novel approach using principal component and cluster analyses. Eur Respir J. 2010;36(3):531–9.

Vogt W, Nagel D. Cluster analysis in diagnosis. Clin Chem. 1992;38(2):182–98.

Layeghian Javan S, Sepehri MM, Layeghian Javan M, Khatibi T. An intelligent warning model for early prediction of cardiac arrest in sepsis patients. Comput Methods Programs Biomed. 2019;178:47–58. .

Wu W, Yang J, Li D, Huang Q, Zhao F, Feng X, et al. Competitive risk analysis of prognosis in patients with cecum cancer: a population-based study. Cancer Control. 2021;28:1073274821989316. .

Martínez Steele E, Popkin BM, Swinburn B, Monteiro CA. The share of ultra-processed foods and the overall nutritional quality of diets in the US: evidence from a nationally representative cross-sectional study. Popul Health Metr. 2017;15(1):6.

Download references

This study was supported by the National Social Science Foundation of China (No. 16BGL183).

Author information

Wen-Tao Wu and Yuan-Jie Li have contributed equally to this work

Authors and Affiliations

Department of Clinical Research, The First Affiliated Hospital of Jinan University, Tianhe District, 613 W. Huangpu Avenue, Guangzhou, 510632, Guangdong, China

Wen-Tao Wu, Ao-Zi Feng, Li Li, Tao Huang & Jun Lyu

School of Public Health, Xi’an Jiaotong University Health Science Center, Xi’an, 710061, Shaanxi, China

Department of Human Anatomy, Histology and Embryology, School of Basic Medical Sciences, Xi’an Jiaotong University Health Science Center, Xi’an, 710061, Shaanxi, China

Yuan-Jie Li

Department of Neurology, The First Affiliated Hospital of Jinan University, Tianhe District, 613 W. Huangpu Avenue, Guangzhou, 510632, Guangdong, China

You can also search for this author in PubMed   Google Scholar


WTW, YJL and JL designed the review. JL, AZF, TH, LL and ADX reviewed and criticized the original paper. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to An-Ding Xu or Jun Lyu .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit . The Creative Commons Public Domain Dedication waiver ( ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Cite this article.

Wu, WT., Li, YJ., Feng, AZ. et al. Data mining in clinical big data: the frequently used databases, steps, and methodological models. Military Med Res 8 , 44 (2021).

Download citation

Received : 24 January 2020

Accepted : 03 August 2021

Published : 11 August 2021


Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Military Medical Research

ISSN: 2054-9369

applications of data mining research papers

Please note that Internet Explorer version 8.x is not supported as of January 1, 2016. Please refer to this support page for more information.


Procedia CIRP

Data mining definitions and applications for the management of production complexity.

Production complexity has increased considerably in recent years due to increasing customer requirements for individual products. At the same time, continuous digitization has led to the recording of extensive, granular production data. Research claims that using production data in data mining methods can lead to managing production complexity effectively. However, manufacturing companies widely do not use such data mining methods. In order to support manufacturing companies in utilizing data mining, this paper presents both a literature review on definitions of data mining, artificial intelligence and machine learning as well as a categorization of existing approaches of applying data mining to manage production complexity.

Cited by (0)

Captcha Page

We apologize for the inconvenience...

To ensure we keep this website safe, please can you confirm you are a human by ticking the box below.

If you are unable to complete the above request please contact us using the below link, providing a screenshot of your experience.

Please solve this CAPTCHA to request unblock to the website


  1. Data Mining Tutorial

    applications of data mining research papers

  2. Data Mining Research Papers

    applications of data mining research papers

  3. 😍 Data mining research paper. What are some good research topics in data mining?. 2019-03-04

    applications of data mining research papers

  4. Practical Applications of Data Mining by Sang C. Suh

    applications of data mining research papers

  5. Data Mining: Principles, Applications and Emerging Challenges

    applications of data mining research papers

  6. (PDF) Web Data Mining research: A survey

    applications of data mining research papers


  1. Data Mining (Spring 2020)

  2. Challenges and Opportunities for Educational Data Mining ! Research Paper review

  3. Lecture 06 Data Preparation in Data Mining

  4. Data mining and warehousing questions paper #shorts #questionpaper

  5. Course Contents

  6. Wow lucky day!


  1. How Do Researchers Collect Data?

    There are various ways for researchers to collect data. It is important that this data come from credible sources, as the validity of the research is determined by where it comes from. Keep reading to learn how researchers go about collecti...

  2. How Do You Make an Acknowledgment in a Research Paper?

    To make an acknowledgement in a research paper, a writer should express thanks by using the full or professional names of the people being thanked and should specify exactly how the people being acknowledged helped.

  3. How to Write a Research Paper

    Writing a research paper is a bit more difficult that a standard high school essay. You need to site sources, use academic data and show scientific examples. Before beginning, you’ll need guidelines for how to write a research paper.

  4. (PDF) Data mining techniques and applications

    Data mining is proved to be one of the important tools for identifying useful information from very large amount of data bases in almost all the industries.

  5. (PDF) A Study On Applications Of Data Mining

    This paper gives a gist of how data mining is utilized in different fields. ResearchGate Logo. Discover the world's research. 20+ million members

  6. Research and Application of the Data Mining Technology in

    The application of data mining algorithms can be used to study the application of economic intelligence systems. This paper develops and


    Data Mining and Its Applications for Knowledge Management: A Literature Review from 2007 to 2012.

  8. Literature review of data mining applications in academic libraries

    Whereas journal articles currently represent the highest level of research, other formats, like books, are confined to gathering and spreading knowledge that is.

  9. Data mining in clinical big data: the frequently used databases

    To allow a clearer understanding of the application of data-mining technology on clinical big data, the second part of this paper introduced

  10. Data Mining in Product Service Systems Design

    Based on the analysis the paper proposes a set of research questions for each

  11. The Survey of Data Mining Applications And Feature Scope

    In this paper we have focused a variety of techniques, approaches and different areas of the research which are helpful and marked as the important field of

  12. Data Mining Definitions and Applications for the Management of

    this paper presents both a literature review on definitions of data mining

  13. Research and application of data mining algorithm

    In this paper, we mainly studies the basic principle and algorithm knowledge of data mining, and applies ridge regression and random forest algorithm model in.

  14. data mining: concepts, background and methods of

    In this paper, based on a broad view ... DATA MINING APPLICATIONS [2].