Data mining technologies. Data Mining - data mining. Tools for analyzing text information

data mining) and “rough” exploratory analysis, which forms the basis of operational analytical processing data (OnLine Analytical Processing, OLAP), while one of the main provisions of Data Mining is the search for non-obvious patterns. Data Mining tools can find such patterns independently and also independently build hypotheses about relationships. Since formulating a hypothesis regarding dependencies is the most difficult task, the advantage of Data Mining over other analysis methods is obvious.

Most statistical methods for identifying relationships in data use the concept of sample averaging, which leads to operations on non-existent values, while Data Mining operates on real values.

OLAP is more suitable for understanding historical data. Data Mining relies on historical data to answer questions about the future.

Prospects for Data Mining Technology

The potential of Data Mining gives the green light to expand the boundaries of technology application. Regarding the prospects of Data Mining, the following development directions are possible:

identifying types of subject areas with their corresponding heuristics, the formalization of which will facilitate the solution of relevant Data Mining problems related to these areas;
creation of formal languages and logical tools with the help of which reasoning will be formalized and the automation of which will become a tool for solving Data Mining problems in specific subject areas;
creation of Data Mining methods capable of not only extracting patterns from data, but also forming certain theories based on empirical data;
overcoming a significant gap in capabilities tools Data Mining from theoretical advances in this field.

If we consider the future of Data Mining in the short term, it is obvious that the development of this technology is most directed towards business-related areas.

In the short term, Data Mining products may become as common and necessary as Email, and, for example, used by users to search for the lowest prices for a certain product or the cheapest tickets.

In the long term, the future of Data Mining is truly exciting - it could be the search by intelligent agents for both new treatments for various diseases and new understanding of the nature of the universe.

However, Data Mining is also fraught with potential danger - after all, an increasing amount of information is becoming available through the World Wide Web, including private information, and more and more knowledge can be extracted from it:

Not long ago, the largest online store, Amazon, found itself at the center of a scandal over the patent it had received, “Methods and systems for helping users when purchasing goods,” which is nothing more than another Data Mining product designed to collect personal data about store visitors. The new technique allows you to predict future requests based on purchase facts, as well as draw conclusions about their purpose. The purpose of this technique is what was mentioned above - obtaining as much as possible more information about clients, including private information (gender, age, preferences, etc.). In this way, data is collected about privacy store customers, as well as members of their families, including children. The latter is prohibited by the legislation of many countries - the collection of information about minors is possible there only with the permission of their parents.

Research notes that there are both successful solutions using Data Mining and unsuccessful experiences with this technology. Areas where applications of Data Mining technology are most likely to be successful include the following:

require knowledge-based decisions;
have a changing environment;
have accessible, sufficient and meaningful data;
provide high dividends from the right decisions.

Existing approaches to analysis

For quite a long time, the discipline of Data Mining was not recognized as a full-fledged independent field of data analysis; it is sometimes called the “backyard of statistics” (Pregibon, 1997).

To date, several points of view on Data Mining have been defined. Supporters of one of them consider it a mirage that distracts attention from classical analysis

Artificial neural networks, genetic algorithms, evolutionary programming, associative memory, fuzzy logic. Data Mining methods often include statistical methods(descriptive analysis, correlation and regression analysis, factor analysis, analysis of variance, component analysis, discriminant analysis, time series analysis). Such methods, however, presuppose some a priori ideas about the analyzed data, which is somewhat at odds with the goals Data Mining(discovery of previously unknown non-trivial and practically useful knowledge).

One of the most important purposes of Data Mining methods is to visually present the results of calculations, which allows the use of Data Mining tools by people who do not have special mathematical training. At the same time, the use of statistical methods of data analysis requires good knowledge of probability theory and mathematical statistics.

Introduction

Data Mining methods (or, which is the same thing, Knowledge Discovery In Data, abbreviated as KDD) lie at the intersection of databases, statistics and artificial intelligence.

Historical excursion

The field of Data Mining began with a workshop conducted by Grigory Pyatetsky-Shapiro in 1989.

Previously, while working at GTE Labs, Grigory Pyatetsky-Shapiro became interested in the question: is it possible to automatically find certain rules to speed up some queries to large databases. At the same time, two terms were proposed - Data Mining (“data mining”) and Knowledge Discovery In Data (which should be translated as “discovery of knowledge in databases”).

Formulation of the problem

Initially the task is set as follows:

there is a fairly large database;
it is assumed that there is some “hidden knowledge” in the database.

It is necessary to develop methods for discovering knowledge hidden in large volumes of initial “raw” data.

What does "hidden knowledge" mean? This must be knowledge:

previously unknown - that is, knowledge that should be new (and not confirming some previously obtained information);
non-trivial - that is, those that cannot simply be seen (during direct visual analysis of data or when calculating simple statistical characteristics);
practically useful - that is, knowledge that is valuable to a researcher or consumer;
accessible for interpretation - that is, knowledge that is easy to present in a form that is clear to the user and easy to explain in terms of the subject area.

These requirements largely determine the essence of Data mining methods and the form and ratio in which Data mining technology uses database management systems, statistical analysis methods and artificial intelligence methods.

Data mining and databases

Data mining methods make sense only for fairly large databases. Each specific area of research has its own criterion for the “largeness” of a database.

The development of database technologies first led to the creation of a specialized language - a database query language. For relational databases, it is the SQL language, which provided extensive capabilities for creating, modifying and retrieving stored data. Then the need arose to obtain analytical information (for example, information about the activities of an enterprise for a certain period), and it turned out that traditional relational databases, well suited, for example, for maintaining operational accounting (at an enterprise), are poorly suited for analysis. this led, in turn, to the creation of the so-called. “data warehouses”, the very structure of which in the best possible way corresponds to a comprehensive mathematical analysis.

Data mining and statistics

Data mining methods are based on mathematical methods of data processing, including statistical methods. In industrial solutions, such methods are often directly included in Data mining packages. However, it should be taken into account that often researchers unreasonably use parametric tests instead of non-parametric ones to simplify things, and secondly, the results of the analysis are difficult to interpret, which is completely at odds with the goals and objectives of Data mining. However, statistical methods are used, but their application is limited to performing only certain stages of the study.

Data mining and artificial intelligence

Knowledge obtained by Data mining methods is usually represented in the form models. These models are:

association rules;
decision trees;
clusters;
mathematical functions.

Methods for constructing such models are usually referred to as the so-called. "artificial intelligence".

Tasks

Problems solved by Data Mining methods are usually divided into descriptive ones. descriptive) and predictive (eng. predictive).

In descriptive tasks, the most important thing is to provide a visual description of the existing hidden patterns, while in predictive tasks, the foreground is the question of prediction for those cases for which there is no data yet.

Descriptive tasks include:

search for association rules or patterns (samples);
grouping of objects, cluster analysis;
building a regression model.

Predictive tasks include:

classification of objects (for predefined classes);
regression analysis, time series analysis.

Learning algorithms

Classification problems are characterized by “supervised learning”, in which the construction (training) of a model is carried out using a sample containing input and output vectors.

For clustering and association problems, “unsupervised learning” is used, in which the model is built using a sample in which there is no output parameter. The value of the output parameter (“belongs to a cluster ...”, “is similar to a vector ...”) is selected automatically during the training process.

For description reduction problems it is typical no separation into input and output vectors. Since the classic works of K. Pearson on the method of principal components, the main attention has been given to data approximation.

Stages of training

There is a typical series of stages for solving problems using Data Mining methods:

Hypothesis formation;
Data collection;
Data preparation (filtering);
Model selection;
Selection of model parameters and learning algorithm;
Model training ( automatic search other model parameters);
Analysis of the quality of training, if the transition to point 5 or point 4 is unsatisfactory;
Analysis of identified patterns, if the transition to steps 1, 4 or 5 is unsatisfactory.

Data preparation

Before using Data Mining algorithms, it is necessary to prepare a set of analyzed data. Since IDA can only detect patterns present in the data, the source data, on the one hand, must have sufficient volume so that these patterns are present in them, and on the other hand, be compact enough so that the analysis takes acceptable time. Most often, data warehouses or data marts act as source data. Preparation is necessary for analyzing multidimensional data before clustering or data mining.

The cleaned data is reduced to feature sets (or vectors if the algorithm can only work with fixed-dimensional vectors), one feature set per observation. A set of features is formed in accordance with hypotheses about which features of raw data have high predictive power based on the required computing power for processing. For example, black and white image a face measuring 100x100 pixels contains 10 thousand bits of raw data. They can be converted into a feature vector by detecting eyes and mouth in the image. As a result, the data volume is reduced from 10 thousand bits to a list of position codes, significantly reducing the volume of analyzed data, and hence the analysis time.

A number of algorithms are able to process missing data that has predictive power (for example, a client’s lack of purchases of a certain type). For example, when using the association rules method (English) Russian It is not feature vectors that are processed, but sets of variable dimensions.

The choice of objective function will depend on what the purpose of the analysis is; choosing the “right” function is fundamental to successful data mining.

Observations are divided into two categories - training set and test set. The training set is used to “train” the Data Mining algorithm, and the test set is used to check the patterns found.

Notes

Literature

Paklin N. B., Oreshkov V. I. Business analytics: from data to knowledge (+ CD). - St. Petersburg. : Ed. Peter, 2009. - 624 p.

Duke V., Samoilenko A. Data Mining: training course(+CD). - St. Petersburg. : Ed. Peter, 2001. - 368 p.

Zhuravlev Yu.I. , Ryazanov V.V., Senko O.V. RECOGNITION. Mathematical methods. Software system. Practical applications. - M.: Publishing house. “Phasis”, 2006. - 176 p. - ISBN 5-7036-0108-8

Zinoviev A. Yu. Visualizing multidimensional data. - Krasnoyarsk: Publishing house. Krasnoyarsk State Technical University, 2000. - 180 p.

Chubukova I. A. Data Mining: A Tutorial. - M.: Internet University information technologies: BINOM: Knowledge Laboratory, 2006. - 382 p. - ISBN 5-9556-0064-7

Ian H. Witten, Eibe Frank and Mark A. Hall Data Mining: Practical Machine Learning Tools and Techniques. - 3rd Edition. - Morgan Kaufmann, 2011. - P. 664. - ISBN 9780123748560

Links

Data Mining Software in the Open Directory Project link directory (dmoz).

Data Mining and Machine Learning
	Weka GNU R KNIME Rapid Miner Gretl PSPP
Proprietary	Deductor Statistica SPSS

Wikimedia Foundation. 2010.

Send your good work in the knowledge base is simple. Use the form below

Students, graduate students, young scientists who use the knowledge base in their studies and work will be very grateful to you.

Send your good work in the knowledge base is simple. Use the form below

Students, graduate students, young scientists who use the knowledge base in their studies and work will be very grateful to you.

Source data for Data Mining

The use of Data Mining is justified if there is a sufficiently large amount of data, ideally contained in a correctly designed data warehouse (in fact, the data warehouses themselves are usually created to solve analysis and forecasting problems associated with decision support). We have also written repeatedly about the principles of building data warehouses; relevant materials can be found on our CD, so we will not dwell on this issue. Let us only recall that the data in the warehouse is a replenished set, common for the entire enterprise and allowing one to restore a picture of its activities at any point in time. Note also that the storage data structure is designed in such a way that queries to it are carried out as efficiently as possible. However, there are Data Mining tools that can search for patterns, correlations and trends not only in data warehouses, but also in OLAP cubes, that is, in sets of pre-processed statistical data.

Types of patterns identified by Data Mining methods

According to V.A. Duke, there are five standard types of patterns identified by Data Mining methods:

Association - a high probability of events being connected with each other (for example, one product is often purchased together with another);

Sequence - a high probability of a chain of events related in time (for example, within a certain period after the purchase of one product, another will be purchased with a high degree of probability);

Classification - there are signs that characterize the group to which this or that event or object belongs (usually, based on the analysis of already classified events, certain rules are formulated);

Clustering is a pattern similar to classification and differs from it in that the groups themselves are not specified - they are identified automatically during data processing;

Temporal patterns - the presence of patterns in the dynamics of the behavior of certain data (a typical example is seasonal fluctuations in demand for certain goods or services) used for forecasting.

Data mining methods

Today there are quite a large number of different data mining methods. Based on the above classification proposed by V.A. Duke, among them we can distinguish:

Regression, variance and correlation analysis (implemented in most modern statistical packages, in particular in products of SAS Institute, StatSoft, etc.);

Methods of analysis in a specific subject area, based on empirical models (often used, for example, in inexpensive financial analysis tools);

Neural network algorithms, the idea of which is based on an analogy with the functioning of nervous tissue and is that initial parameters are considered as signals that are transformed in accordance with the existing connections between “neurons”, and the response of the entire network to the original data is considered as the response that is the result of the analysis. In this case, connections are created using the so-called network training through a large sample size containing both initial data and correct answers;

Algorithms - selection of a close analogue of the original data from existing historical data. Also called the “nearest neighbor” method;

Decision trees are a hierarchical structure based on a set of questions that require a “Yes” or “No” answer; despite the fact that this method of data processing does not always perfectly find existing patterns, it is quite often used in forecasting systems due to the clarity of the response received;

Cluster models (sometimes also called segmentation models) are used to group similar events together based on similar values of several fields in a data set; also very popular when creating forecasting systems;

Restricted search algorithms that calculate frequencies of combinations of simple logical events in subgroups of data;

Evolutionary programming - search and generation of an algorithm expressing the interdependence of data, based on an initially specified algorithm, modified during the search process; sometimes the search for interdependencies is carried out among certain types of functions (for example, polynomials).

More information about these and other Data Mining algorithms, as well as about the tools that implement them, can be read in the book “Data Mining: Training Course” by V.A. Duke and A.P. Samoilenko, published by the Peter publishing house in 2001. Today this is one of the few books in Russian devoted to this problem.

Leading manufacturers of Data Mining tools

Data Mining tools, like most Business Intelligence tools, are traditionally expensive software tools - some of them cost up to several tens of thousands of dollars. Therefore, until recently, the main consumers of this technology were banks, financial and insurance companies, large trading enterprises, and the main tasks requiring the use of Data Mining were considered to be the assessment of credit and insurance risks and the development of marketing policies, tariff plans and other principles of working with clients. In recent years, the situation has undergone certain changes: in the market software Relatively inexpensive Data Mining tools from several manufacturers appeared, which made this technology available to small and medium-sized businesses that had not previously thought about it.

Modern Business Intelligence tools include report generators, analytical data processing tools, BI solution development tools (BI Platforms) and the so-called Enterprise BI Suites - enterprise-scale data analysis and processing tools that allow you to carry out a set of actions related to data analysis and reporting, and often include an integrated set of BI tools and BI application development tools. The latter, as a rule, contain reporting tools, OLAP tools, and often Data Mining tools.

According to Gartner Group analysts, the leaders in the market for enterprise-scale data analysis and processing tools are Business Objects, Cognos, Information Builders, and Microsoft and Oracle also claim leadership (Fig. 1). As for the development tools for BI solutions, the main contenders for leadership in this area are Microsoft and SAS Institute (Fig. 2).

Note that Microsoft's Business Intelligence tools are relatively inexpensive products available to a wide range of companies. That is why we are going to look at some practical aspects of using Data Mining using the example of this company’s products in the subsequent parts of this article.

Literature:

1. Duke V.A. Data Mining - data mining. - http://www.olap.ru/basic/dm2.asp.

2. Duke V.A., Samoilenko A.P. Data Mining: training course. - St. Petersburg: Peter, 2001.

3. B. de Ville. Microsoft Data Mining. Digital Press, 2001.

Data mining technologies. Data Mining - data mining. Tools for analyzing text information

Prospects for Data Mining Technology

Existing approaches to analysis

Introduction

Historical excursion

Formulation of the problem

Data mining and databases

Data mining and statistics

Data mining and artificial intelligence

Tasks

Learning algorithms

Stages of training

Data preparation

see also

Notes

Literature

Links

Send your good work in the knowledge base is simple. Use the form below

Similar documents

Send your good work in the knowledge base is simple. Use the form below

Similar documents

Source data for Data Mining

Types of patterns identified by Data Mining methods

Data mining methods

Leading manufacturers of Data Mining tools