Data mining, or knowledge discovery in databases, is the nontrivial extraction of implicit, previously unknown and potentially useful information from data. Statistical methods are used that enable trends and other relationships to be identified in large databases.
The major reason that data mining has attracted attention is due to the wide availability of vast amounts of data, and the need for turning such data into useful information and knowledge. The knowledge gained can be used for applications ranging from risk monitoring, business management, production control, market analysis, engineering, and science exploration.
In general, three types of data mining techniques are used: association, regression, and classification.
Association analysis is the discovery of association rules showing attribute-value conditions that occur frequently together in a given set of data. Association analysis is widely used to identify the correlation of individual products within shopping carts.
Regression analysis creates models that explain dependent variables through the analysis of independent variables. As an example, the prediction for a product’s sales performance can be created by correlating the product price and the average customer income level.
Classification and prediction
Classification is the process of designing a set of models to predict the class of objects whose class label is unknown. The derived model may be represented in various forms, such as if-then rules, decision trees, or mathematical formulas.
A decision tree is a flow-chart-like tree structure where each node denotes a test on an attribute value, each branch represents an outcome of the test, and each tree leaf represents a class or class distribution. Decision trees can be converted to classification rules.
Classification can be used for predicting the class label of data objects. Prediction encompasses the identification of distribution trends based on the available data.
The data mining process consists of an iterative sequence of the following steps:
- Data coherence and cleaning to remove noise and inconsistent data.
- Data integration such that multiple data sources may be combined.
- Data selection where data relevant to the analysis are retrieved.
- Data transformation where data are consolidated into forms appropriate for mining.
- Pattern recognition and statistical techniques are applied to extract patterns.
- Pattern evaluation to identify interesting patterns representing knowledge.
- Visualization techniques are used to present mined knowledge to users.
Limits of Data Mining
GIGO (garbage in garbage out) is almost always referenced with respect to data mining, as the quality of the knowledge gained through data mining is dependent on the quality of the historical data. We know data inconsistencies and dealing with multiple data sources represent large problems in data management. Data cleaning techniques exist to deal with detecting and removing errors and inconsistencies from data to improve data quality. However, detecting these inconsistencies is extremely difficult. How can we identify a transaction that is incorrectly labeled as suspicious? Learning from incorrect data leads to inaccurate models.
Another limitation of data mining is that it only extracts knowledge limited to the specific set of historical data, and answers can only be obtained and interpreted with regards to previous trends learned from the data. This limits one’s ability to benefit from new trends. Because the decision tree is trained specifically on the historical data set, it does not account for personalization within the tree. Additionally, data mining (decision trees, rules, clusters) are non-incremental and do not adapt while in production.