Data mining technologies. Data Mining - data mining. Tools for analyzing text information

data mining) and “rough” exploratory analysis, which forms the basis of operational analytical processing data (OnLine Analytical Processing, OLAP), while one of the main provisions of Data Mining is the search for non-obvious patterns. Data Mining tools can find such patterns independently and also independently build hypotheses about relationships. Since formulating a hypothesis regarding dependencies is the most difficult task, the advantage of Data Mining over other analysis methods is obvious.

Most statistical methods for identifying relationships in data use the concept of sample averaging, which leads to operations on non-existent values, while Data Mining operates on real values.

OLAP is more suitable for understanding historical data. Data Mining relies on historical data to answer questions about the future.

Prospects for Data Mining Technology

The potential of Data Mining gives the green light to expand the boundaries of technology application. Regarding the prospects of Data Mining, the following development directions are possible:

  • identifying types of subject areas with their corresponding heuristics, the formalization of which will facilitate the solution of relevant Data Mining problems related to these areas;
  • creation of formal languages ​​and logical tools with the help of which reasoning will be formalized and the automation of which will become a tool for solving Data Mining problems in specific subject areas;
  • creation of Data Mining methods capable of not only extracting patterns from data, but also forming certain theories based on empirical data;
  • overcoming a significant gap in capabilities tools Data Mining from theoretical advances in this field.

If we consider the future of Data Mining in the short term, it is obvious that the development of this technology is most directed towards business-related areas.

In the short term, Data Mining products may become as common and necessary as Email, and, for example, used by users to search for the lowest prices for a certain product or the cheapest tickets.

In the long term, the future of Data Mining is truly exciting - it could be the search by intelligent agents for both new treatments for various diseases and new understanding of the nature of the universe.

However, Data Mining is also fraught with potential danger - after all, an increasing amount of information is becoming available through the World Wide Web, including private information, and more and more knowledge can be extracted from it:

Not long ago, the largest online store, Amazon, found itself at the center of a scandal over the patent it had received, “Methods and systems for helping users when purchasing goods,” which is nothing more than another Data Mining product designed to collect personal data about store visitors. The new technique allows you to predict future requests based on purchase facts, as well as draw conclusions about their purpose. The purpose of this technique is what was mentioned above - obtaining as much as possible more information about clients, including private information (gender, age, preferences, etc.). In this way, data is collected about privacy store customers, as well as members of their families, including children. The latter is prohibited by the legislation of many countries - the collection of information about minors is possible there only with the permission of their parents.

Research notes that there are both successful solutions using Data Mining and unsuccessful experiences with this technology. Areas where applications of Data Mining technology are most likely to be successful include the following:

  • require knowledge-based decisions;
  • have a changing environment;
  • have accessible, sufficient and meaningful data;
  • provide high dividends from the right decisions.

Existing approaches to analysis

For quite a long time, the discipline of Data Mining was not recognized as a full-fledged independent field of data analysis; it is sometimes called the “backyard of statistics” (Pregibon, 1997).

To date, several points of view on Data Mining have been defined. Supporters of one of them consider it a mirage that distracts attention from classical analysis

Artificial neural networks, genetic algorithms, evolutionary programming, associative memory, fuzzy logic. Data Mining methods often include statistical methods(descriptive analysis, correlation and regression analysis, factor analysis, analysis of variance, component analysis, discriminant analysis, time series analysis). Such methods, however, presuppose some a priori ideas about the analyzed data, which is somewhat at odds with the goals Data Mining(discovery of previously unknown non-trivial and practically useful knowledge).

One of the most important purposes of Data Mining methods is to visually present the results of calculations, which allows the use of Data Mining tools by people who do not have special mathematical training. At the same time, the use of statistical methods of data analysis requires good knowledge of probability theory and mathematical statistics.

Introduction

Data Mining methods (or, which is the same thing, Knowledge Discovery In Data, abbreviated as KDD) lie at the intersection of databases, statistics and artificial intelligence.

Historical excursion

The field of Data Mining began with a workshop conducted by Grigory Pyatetsky-Shapiro in 1989.

Previously, while working at GTE Labs, Grigory Pyatetsky-Shapiro became interested in the question: is it possible to automatically find certain rules to speed up some queries to large databases. At the same time, two terms were proposed - Data Mining (“data mining”) and Knowledge Discovery In Data (which should be translated as “discovery of knowledge in databases”).

Formulation of the problem

Initially the task is set as follows:

  • there is a fairly large database;
  • it is assumed that there is some “hidden knowledge” in the database.

It is necessary to develop methods for discovering knowledge hidden in large volumes of initial “raw” data.

What does "hidden knowledge" mean? This must be knowledge:

  • previously unknown - that is, knowledge that should be new (and not confirming some previously obtained information);
  • non-trivial - that is, those that cannot simply be seen (during direct visual analysis of data or when calculating simple statistical characteristics);
  • practically useful - that is, knowledge that is valuable to a researcher or consumer;
  • accessible for interpretation - that is, knowledge that is easy to present in a form that is clear to the user and easy to explain in terms of the subject area.

These requirements largely determine the essence of Data mining methods and the form and ratio in which Data mining technology uses database management systems, statistical analysis methods and artificial intelligence methods.

Data mining and databases

Data mining methods make sense only for fairly large databases. Each specific area of ​​research has its own criterion for the “largeness” of a database.

The development of database technologies first led to the creation of a specialized language - a database query language. For relational databases, it is the SQL language, which provided extensive capabilities for creating, modifying and retrieving stored data. Then the need arose to obtain analytical information (for example, information about the activities of an enterprise for a certain period), and it turned out that traditional relational databases, well suited, for example, for maintaining operational accounting (at an enterprise), are poorly suited for analysis. this led, in turn, to the creation of the so-called. “data warehouses”, the very structure of which in the best possible way corresponds to a comprehensive mathematical analysis.

Data mining and statistics

Data mining methods are based on mathematical methods of data processing, including statistical methods. In industrial solutions, such methods are often directly included in Data mining packages. However, it should be taken into account that often researchers unreasonably use parametric tests instead of non-parametric ones to simplify things, and secondly, the results of the analysis are difficult to interpret, which is completely at odds with the goals and objectives of Data mining. However, statistical methods are used, but their application is limited to performing only certain stages of the study.

Data mining and artificial intelligence

Knowledge obtained by Data mining methods is usually represented in the form models. These models are:

  • association rules;
  • decision trees;
  • clusters;
  • mathematical functions.

Methods for constructing such models are usually referred to as the so-called. "artificial intelligence".

Tasks

Problems solved by Data Mining methods are usually divided into descriptive ones. descriptive) and predictive (eng. predictive).

In descriptive tasks, the most important thing is to provide a visual description of the existing hidden patterns, while in predictive tasks, the foreground is the question of prediction for those cases for which there is no data yet.

Descriptive tasks include:

  • search for association rules or patterns (samples);
  • grouping of objects, cluster analysis;
  • building a regression model.

Predictive tasks include:

  • classification of objects (for predefined classes);
  • regression analysis, time series analysis.

Learning algorithms

Classification problems are characterized by “supervised learning”, in which the construction (training) of a model is carried out using a sample containing input and output vectors.

For clustering and association problems, “unsupervised learning” is used, in which the model is built using a sample in which there is no output parameter. The value of the output parameter (“belongs to a cluster ...”, “is similar to a vector ...”) is selected automatically during the training process.

For description reduction problems it is typical no separation into input and output vectors. Since the classic works of K. Pearson on the method of principal components, the main attention has been given to data approximation.

Stages of training

There is a typical series of stages for solving problems using Data Mining methods:

  1. Hypothesis formation;
  2. Data collection;
  3. Data preparation (filtering);
  4. Model selection;
  5. Selection of model parameters and learning algorithm;
  6. Model training ( automatic search other model parameters);
  7. Analysis of the quality of training, if the transition to point 5 or point 4 is unsatisfactory;
  8. Analysis of identified patterns, if the transition to steps 1, 4 or 5 is unsatisfactory.

Data preparation

Before using Data Mining algorithms, it is necessary to prepare a set of analyzed data. Since IDA can only detect patterns present in the data, the source data, on the one hand, must have sufficient volume so that these patterns are present in them, and on the other hand, be compact enough so that the analysis takes acceptable time. Most often, data warehouses or data marts act as source data. Preparation is necessary for analyzing multidimensional data before clustering or data mining.

The cleaned data is reduced to feature sets (or vectors if the algorithm can only work with fixed-dimensional vectors), one feature set per observation. A set of features is formed in accordance with hypotheses about which features of raw data have high predictive power based on the required computing power for processing. For example, black and white image a face measuring 100x100 pixels contains 10 thousand bits of raw data. They can be converted into a feature vector by detecting eyes and mouth in the image. As a result, the data volume is reduced from 10 thousand bits to a list of position codes, significantly reducing the volume of analyzed data, and hence the analysis time.

A number of algorithms are able to process missing data that has predictive power (for example, a client’s lack of purchases of a certain type). For example, when using the association rules method (English) Russian It is not feature vectors that are processed, but sets of variable dimensions.

The choice of objective function will depend on what the purpose of the analysis is; choosing the “right” function is fundamental to successful data mining.

Observations are divided into two categories - training set and test set. The training set is used to “train” the Data Mining algorithm, and the test set is used to check the patterns found.

see also

  • Reshetov probabilistic neural network

Notes

Literature

  • Paklin N. B., Oreshkov V. I. Business analytics: from data to knowledge (+ CD). - St. Petersburg. : Ed. Peter, 2009. - 624 p.
  • Duke V., Samoilenko A. Data Mining: training course(+CD). - St. Petersburg. : Ed. Peter, 2001. - 368 p.
  • Zhuravlev Yu.I. , Ryazanov V.V., Senko O.V. RECOGNITION. Mathematical methods. Software system. Practical applications. - M.: Publishing house. “Phasis”, 2006. - 176 p. - ISBN 5-7036-0108-8
  • Zinoviev A. Yu. Visualizing multidimensional data. - Krasnoyarsk: Publishing house. Krasnoyarsk State Technical University, 2000. - 180 p.
  • Chubukova I. A. Data Mining: A Tutorial. - M.: Internet University information technologies: BINOM: Knowledge Laboratory, 2006. - 382 p. - ISBN 5-9556-0064-7
  • Ian H. Witten, Eibe Frank and Mark A. Hall Data Mining: Practical Machine Learning Tools and Techniques. - 3rd Edition. - Morgan Kaufmann, 2011. - P. 664. - ISBN 9780123748560

Links

  • Data Mining Software in the Open Directory Project link directory (dmoz).

Wikimedia Foundation. 2010.

Send your good work in the knowledge base is simple. Use the form below

Students, graduate students, young scientists who use the knowledge base in their studies and work will be very grateful to you.

Similar documents

    Classification of DataMining tasks. Creation of reports and summaries. Features of Data Miner in Statistica. Classification, clustering and regression problem. Analysis tools Statistica Data Miner. The essence of the problem is searching for association rules. Analysis of predictors of survival.

    course work, added 05/19/2011

    Description functionality Data Mining technologies as processes for discovering unknown data. Study of association rule inference systems and mechanisms of neural network algorithms. Description of clustering algorithms and areas of application of Data Mining.

    test, added 06/14/2013

    Basics for clustering. Using Data Mining as a way to “discover knowledge in databases.” Selecting clustering algorithms. Retrieving data from the remote workshop database storage. Clustering students and tasks.

    course work, added 07/10/2017

    Data mining, developmental history of data mining and knowledge discovery. Technological elements and methods of data mining. Steps in knowledge discovery. Change and deviation detection. Related disciplines, information retrieval and text extraction.

    report, added 06/16/2012

    Analysis of problems arising when applying clustering methods and algorithms. Basic algorithms for partitioning into clusters. RapidMiner program as an environment for machine learning and data analysis. Assessing the quality of clustering using Data Mining methods.

    course work, added 10/22/2012

    Improving data recording and storage technologies. Specifics of modern requirements for processing information data. The concept of patterns reflecting fragments of multi-dimensional relationships in the underlying data modern technology Data Mining.

    test, added 09/02/2010

    Analysis of the use of neural networks for forecasting the situation and making decisions on the stock market using the Trajan 3.0 neural network modeling software package. Transformation of primary data, tables. Ergonomic evaluation of the program.

    thesis, added 06/27/2011

    Difficulties in using evolutionary algorithms. Construction of computing systems based on the principles of natural selection. Disadvantages of genetic algorithms. Examples of evolutionary algorithms. Directions and sections of evolutionary modeling.

    Send your good work in the knowledge base is simple. Use the form below

    Students, graduate students, young scientists who use the knowledge base in their studies and work will be very grateful to you.

    Similar documents

      Description of the functionality of Data Mining technology as a process for discovering unknown data. Study of association rule inference systems and mechanisms of neural network algorithms. Description of clustering algorithms and areas of application of Data Mining.

      test, added 06/14/2013

      Basics for clustering. Using Data Mining as a way to “discover knowledge in databases.” Selecting clustering algorithms. Retrieving data from the remote workshop database storage. Clustering students and tasks.

      course work, added 07/10/2017

      Improving data recording and storage technologies. Specifics of modern requirements for processing information data. The concept of templates reflecting fragments of multidimensional relationships in data is the basis of modern Data Mining technology.

      test, added 09/02/2010

      Data mining, developmental history of data mining and knowledge discovery. Technological elements and methods of data mining. Steps in knowledge discovery. Change and deviation detection. Related disciplines, information retrieval and text extraction.

      report, added 06/16/2012

      Data Mining as a decision support process based on searching for hidden patterns (information patterns) in data. Its patterns and stages of implementation, the history of the development of this technology, assessment of advantages and disadvantages, opportunities.

      essay, added 12/17/2014

      Classification of DataMining tasks. Creation of reports and summaries. Features of Data Miner in Statistica. Classification, clustering and regression problem. Analysis tools Statistica Data Miner. The essence of the problem is searching for association rules. Analysis of predictors of survival.

      course work, added 05/19/2011

      Promising directions for data analysis: analysis text information,data mining. Analysis of structured information stored in databases. Analysis Process text documents. Features of data preprocessing.

      abstract, added 02/13/2014

      Classification of Data Mining tasks. The problem of clustering and searching for association rules. Determining the class of an object by its properties and characteristics. Finding frequent dependencies between objects or events. Operational and analytical data processing.

      test, added 01/13/2013

    What is Data Mining

    corporate database of any modern enterprise usually contains a set of tables that store records about certain facts or objects (for example, about goods, their sales, customers, accounts). As a rule, each entry in such a table describes a specific object or fact. For example, an entry in the sales table reflects the fact that such and such a product was sold to such and such a client at that time by such and such a manager, and by and large does not contain anything other than this information. However, the collection of a large number of such records, accumulated over several years, can become a source of additional, much more valuable information that cannot be obtained on the basis of one specific record, namely, information about patterns, trends or interdependencies between any data. Examples of such information are information about how sales of a particular product depend on the day of the week, time of day or time of year, which categories of customers most often purchase this or that product, what proportion of buyers of one specific product purchases another specific product, which category of customers most often does not repay the loan provided on time.

    This kind of information is usually used in forecasting, strategic planning, risk analysis, and its value for the enterprise is very high. Apparently, that’s why the process of searching for it was called Data Mining (mining in English means “mining,” and searching for patterns in a huge set of factual data is really akin to this). The term Data Mining denotes not so much a specific technology as the process of searching for correlations, trends, relationships and patterns through various mathematical and statistical algorithms: clustering, creating subsamples, regression and correlation analysis. The purpose of this search is to present the data in a form that clearly reflects business processes, and also to build a model with which you can predict processes that are critical for business planning (for example, the dynamics of demand for certain goods or services or the dependence of their acquisition on certain then consumer characteristics).

    Note that traditional mathematical statistics, which for a long time remained the main tool for data analysis, as well as tools for online analytical processing (OLAP), which we have already written about several times (see materials on this topic on our CD) , cannot always be successfully used to solve such problems. Typically, statistical methods and OLAP are used to test pre-formulated hypotheses. However, it is often the formulation of a hypothesis that turns out to be the most difficult task when implementing business analysis for subsequent decision-making, since not all patterns in the data are obvious at first glance.

    Modern Data Mining technology is based on the concept of templates that reflect patterns inherent in subsamples of data. The search for patterns is carried out using methods that do not use any a priori assumptions about these subsamples. While statistical analysis or OLAP typically asks questions like “What is the average number of unpaid invoices among customers for this service?”, Data Mining typically involves answering questions like “Is there a typical category of non-paying customers?” . At the same time, it is the answer to the second question that often provides a more non-trivial approach to marketing policy and to organizing work with clients.

    An important feature of Data Mining is the non-standard and non-obvious nature of the patterns being sought. In other words, Data Mining tools differ from statistical data processing tools and OLAP tools in that instead of checking pre-assumed interdependencies by users, they are able to find such interdependencies independently based on available data and build hypotheses about their nature.

    It should be noted that the use of Data Mining tools does not exclude the use of statistical tools and OLAP tools, since the results of data processing using the latter, as a rule, contribute to a better understanding of the nature of the patterns that should be looked for.

    Source data for Data Mining

    The use of Data Mining is justified if there is a sufficiently large amount of data, ideally contained in a correctly designed data warehouse (in fact, the data warehouses themselves are usually created to solve analysis and forecasting problems associated with decision support). We have also written repeatedly about the principles of building data warehouses; relevant materials can be found on our CD, so we will not dwell on this issue. Let us only recall that the data in the warehouse is a replenished set, common for the entire enterprise and allowing one to restore a picture of its activities at any point in time. Note also that the storage data structure is designed in such a way that queries to it are carried out as efficiently as possible. However, there are Data Mining tools that can search for patterns, correlations and trends not only in data warehouses, but also in OLAP cubes, that is, in sets of pre-processed statistical data.

    Types of patterns identified by Data Mining methods

    According to V.A. Duke, there are five standard types of patterns identified by Data Mining methods:

    Association - a high probability of events being connected with each other (for example, one product is often purchased together with another);

    Sequence - a high probability of a chain of events related in time (for example, within a certain period after the purchase of one product, another will be purchased with a high degree of probability);

    Classification - there are signs that characterize the group to which this or that event or object belongs (usually, based on the analysis of already classified events, certain rules are formulated);

    Clustering is a pattern similar to classification and differs from it in that the groups themselves are not specified - they are identified automatically during data processing;

    Temporal patterns - the presence of patterns in the dynamics of the behavior of certain data (a typical example is seasonal fluctuations in demand for certain goods or services) used for forecasting.

    Data mining methods

    Today there are quite a large number of different data mining methods. Based on the above classification proposed by V.A. Duke, among them we can distinguish:

    Regression, variance and correlation analysis (implemented in most modern statistical packages, in particular in products of SAS Institute, StatSoft, etc.);

    Methods of analysis in a specific subject area, based on empirical models (often used, for example, in inexpensive financial analysis tools);

    Neural network algorithms, the idea of ​​which is based on an analogy with the functioning of nervous tissue and is that initial parameters are considered as signals that are transformed in accordance with the existing connections between “neurons”, and the response of the entire network to the original data is considered as the response that is the result of the analysis. In this case, connections are created using the so-called network training through a large sample size containing both initial data and correct answers;

    Algorithms - selection of a close analogue of the original data from existing historical data. Also called the “nearest neighbor” method;

    Decision trees are a hierarchical structure based on a set of questions that require a “Yes” or “No” answer; despite the fact that this method of data processing does not always perfectly find existing patterns, it is quite often used in forecasting systems due to the clarity of the response received;

    Cluster models (sometimes also called segmentation models) are used to group similar events together based on similar values ​​of several fields in a data set; also very popular when creating forecasting systems;

    Restricted search algorithms that calculate frequencies of combinations of simple logical events in subgroups of data;

    Evolutionary programming - search and generation of an algorithm expressing the interdependence of data, based on an initially specified algorithm, modified during the search process; sometimes the search for interdependencies is carried out among certain types of functions (for example, polynomials).

    More information about these and other Data Mining algorithms, as well as about the tools that implement them, can be read in the book “Data Mining: Training Course” by V.A. Duke and A.P. Samoilenko, published by the Peter publishing house in 2001. Today this is one of the few books in Russian devoted to this problem.

    Leading manufacturers of Data Mining tools

    Data Mining tools, like most Business Intelligence tools, are traditionally expensive software tools - some of them cost up to several tens of thousands of dollars. Therefore, until recently, the main consumers of this technology were banks, financial and insurance companies, large trading enterprises, and the main tasks requiring the use of Data Mining were considered to be the assessment of credit and insurance risks and the development of marketing policies, tariff plans and other principles of working with clients. In recent years, the situation has undergone certain changes: in the market software Relatively inexpensive Data Mining tools from several manufacturers appeared, which made this technology available to small and medium-sized businesses that had not previously thought about it.

    Modern Business Intelligence tools include report generators, analytical data processing tools, BI solution development tools (BI Platforms) and the so-called Enterprise BI Suites - enterprise-scale data analysis and processing tools that allow you to carry out a set of actions related to data analysis and reporting, and often include an integrated set of BI tools and BI application development tools. The latter, as a rule, contain reporting tools, OLAP tools, and often Data Mining tools.

    According to Gartner Group analysts, the leaders in the market for enterprise-scale data analysis and processing tools are Business Objects, Cognos, Information Builders, and Microsoft and Oracle also claim leadership (Fig. 1). As for the development tools for BI solutions, the main contenders for leadership in this area are Microsoft and SAS Institute (Fig. 2).

    Note that Microsoft's Business Intelligence tools are relatively inexpensive products available to a wide range of companies. That is why we are going to look at some practical aspects of using Data Mining using the example of this company’s products in the subsequent parts of this article.

    Literature:

    1. Duke V.A. Data Mining - data mining. - http://www.olap.ru/basic/dm2.asp.

    2. Duke V.A., Samoilenko A.P. Data Mining: training course. - St. Petersburg: Peter, 2001.

    3. B. de Ville. Microsoft Data Mining. Digital Press, 2001.