A review on data mining tasks and tools

Артиков, Музаффар Эгамберганович; Курбанова, Онахон Улугбековна

INTRODUCTION

Data mining is a powerful new technology with great potential to help companies focus on the most important information in the data they have collected about the behavior of their customers and potential customers. It discovers information within the data that queries and reports can't effectively reveal.

Generally, data mining is the process of analyzing data from different perspectives and summarizing it into useful information — information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases [1]. Data mining consists of more than collection and managing data; it also includes analysis and prediction. People are often do mistakes while analyzing or, possibly, when trying to establish relationships between multiple features. This makes it difficult for them to find solutions to certain problems. Machine learning can often be successfully applied to these problems, improving the efficiency of systems and the designs of machines. There are several applications for Machine Learning (ML), the most significant of which is data mining [2].

DATA MINING TASKS

Data mining involves six common classes of tasks: [3]

Anomaly detection

Anomaly detection is the process of finding the patterns in a dataset whose behavior is not normal on expected. These unexpected behaviors are also termed as anomalies or outliers. The anomalies cannot always be categorized as an attack but it can be a surprising behavior which is previously not known. It may or may not be harmful. The anomaly detection provides very significant and critical information in various applications, for example Credit card thefts or identity thefts. When data has to be analyzed in order to find relationship or to predict known or unknown data mining techniques are used. These include clustering, classification and machine based learning techniques [3].

Anomaly detection can be used to solve problems like the following:

 A law enforcement agency compiles data about illegal activities, but nothing about legitimate activities. How can suspicious activity be flagged? The law enforcement data is all of one class. There are no counter-examples.

 Insurance Risk Modeling — An insurance agency processes millions of insurance claims, knowing that a very small number are fraudulent. How can the fraudulent claims be identified? The claims data contains very few counter-examples. They are outliers. Claims are rare but very costly.

 Targeted Marketing — Given demographic data about a set of customers, identify customer purchasing behaviour that is significantly different from the norm. Response is typically rare but can be profitable.

 Health care fraud, expense report fraud, and tax compliance.

 Web mining (Less than 3 % of all people visiting Amazon.com make a purchase).

 Churn Analysis. Churn is typically rare but quite costly.

 Hardware Fault Detection.

 Disease detection.

 Network intrusion detection. Number of intrusions on the network is typically a very small fraction of the total network traffic.

 Credit card fraud detection. Millions of regular transactions are stored, while only a very small percentage corresponds to fraud.

 Medical diagnostics. When classifying the pixels in mammogram images, cancerous pixels represent only a very small fraction of the entire image [4].

Association rule learning

Association rule learning is a method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. Based on the concept of strong rules, Rakesh Agrawal et al. [6] introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule {onions, potatoes} {burger} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements. In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection, Continuous production, and bioinformatics. In contrast with sequence mining, association rule learning typically does not consider the order of items either within a transaction or across transactions [7].

Clustering — the task of discovering groups and structures in the data that are in some way or another «similar», without using known structures in the data. Clustering is the grouping of a particular set of objects based on their characteristics, aggregating them according to their similarities. Regarding to data mining, this metodology partitions the data implementing a specific join algorithm, most suitable for the desired information analysis.

This clustering analysis allows an object not to be part of a cluster, or strictly belong to it, calling this type of grouping hard partitioning. In the other hand, soft partitioning states that every object belongs to a cluster in a determined degree. More specific divisions can be possible to create like objects belonging to multiple clusters, to force an object to participate in only one cluster or even construct hierarchical trees on group relationships [8].

Classification

Classification is a data mining (machine learning) technique used to predict group membership for data instances. There are two forms of data analysis that can be used for extracting models describing important classes or to predict future data trends. These two forms are as follows −

 Classification

 Prediction

Classification models predict categorical class labels; and prediction models predict continuous valued functions. For example, we can build a classification model to categorize bank loan applications as either safe or risky, or a prediction model to predict the expenditures in dollars of potential customers on computer equipment given their income and occupation. Popular classification techniques include decision trees and neural networks.

Regression — a data mining (machine learning) technique used to fit an equation to a dataset. The simplest form of regression, linear regression, uses the formula of a straight line (y = mx + b) and determines the appropriate values for m and b to predict the value of y based upon a given value of x. Advanced techniques, such as multiple regression, allow the use of more than one input variable and allow for the fitting of more complex models, such as a quadratic equation.

Summarization — providing a more compact representation of the data set, including visualization and report generation. Data visualization is a general term that describes any effort to help people understand the significance of data by placing it in a visual context. Patterns, trends and correlations that might go undetected in text-based data can be exposed and recognized easier with data visualization software.

Today's data visualization tools go beyond the standard charts and graphs used in Excel spreadsheets, displaying data in more sophisticated ways such as infographics, dials and gauges, geographic maps, sparklines, heat maps, and detailed bar, pie and fever charts. The images may include interactive capabilities, enabling users to manipulate them or drill into the data for querying and analysis. Indicators designed to alert users when data has been updated or predefined conditions occur can also be included.

Most business intelligence software vendors embed data visualization tools into their products, either developing the visualization technology themselves or sourcing it from companies that specialize in visualization.

DATA MINING TOOLS

There are many tools to solve data mining problems. In this paper, we will consider a few of them:

RapidMiner

Written in the Java Programming language, this tool offers advanced analytics through template-based frameworks. Offered as a service, rather than a piece of local software, this tool holds top position on the list of data mining tools.

In addition to data mining, RapidMiner also provides functionality like data preprocessing and visualization, predictive analytics and statistical modeling, evaluation, and deployment. What makes it even more powerful is that it provides learning schemes, models and algorithms from WEKA and R scripts.

WEKA

The original non-Java version of WEKA primarily was developed for analyzing data from the agricultural domain. With the Java-based version, the tool is very sophisticated and used in many different applications including visualization and algorithms for data analysis and predictive modeling. It's free under the GNU General Public License, which is a big plus compared to RapidMiner, because users can customize it however they please.

WEKA supports several standard data mining tasks, including data preprocessing, clustering, classification, regression, visualization and feature selection.

WEKA would be more powerful with the addition of sequence modeling, which currently is not included.

R-Programming

It’s a free software programming language and software environment for statistical computing and graphics. The R language is widely used among data miners for developing statistical software and data analysis. Ease of use and extensibility has raised R’s popularity substantially in recent years.

Besides data mining, it provides statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others.

Orange

Python-based, powerful and open source tool for both novices and experts.

It also has components for machine learning, add-ons for bioinformatics and text mining. It’s packed with features for data analytics.

KNIME

Data preprocessing has three main components: extraction, transformation and loading. KNIME does all three. It gives you a graphical user interface to allow for the assembly of nodes for data processing. It is an open source data analytics, reporting and integration platform. KNIME also integrates various components for machine learning and data mining through its modular data pipelining concept and has caught the eye of business intelligence and financial data analysis.

Written in Java and based on Eclipse, KNIME is easy to extend and to add plugins. Additional functionalities can be added on the go. Plenty of data integration modules are already included in the core version [9].

Scikit-learn

Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

CONCLUSION

This paper has presented different data mining tasks and tools to solve them. As the amount of data is expanding in all areas, it is easier to find a lot of useful knowledge by using data mining methods. As well as, above-mentioned tools will help us to implement data mining techniques in various areas.

References:

Maninder Singh. A Review On Data Mining Algorithms // Computer Science and Information Technology Research. — 2014. — № 2. — С. 8–14.
Raj Kumar, Dr. Rajesh Verma. Classification Algorithms for Data Mining: A Survey // International Journal of Innovations in Engineering and Technology. — 2012. — № 1. — С. 7–14.
Fayyad, Usama; Piatetsky-Shapiro, Gregory; Smyth, Padhraic. From Data Mining to Knowledge Discovery in Databases // AAAI. — 1996. — № 1. — С. 37–54.
Data Mining — (Anomaly|outlier) Detection // Gerardnico. URL: http://gerardnico.com/wiki/data_mining/anomaly_detection (дата обращения: 28.04.2016).
Piatetsky-Shapiro, Gregory. Discovery, analysis, and presentation of strong rules // Knowledge Discovery in Databases. — Cambridge: AAAI/MIT Press, 1991.
Agrawal, R.; Imieliński, T.; Swami, A. Mining association rules between sets of items in large databases // ACM SIGMOD international conference on Management of data — SIGMOD '93., 1993. — С. 207.
Association rule learning // Wikipedia. URL: https://en.wikipedia.org/wiki/Association_rule_learning (дата обращения: 28.04.2016).
Clustering in Data Mining // Data on focus. URL: http://www.dataonfocus.com/clustering-in-data-mining/ (дата обращения: 28.04.2016).
Six of the Best Open Source Data Mining Tools // The New Stack. URL: http://thenewstack.io/six-of-the-best-open-source-data-mining-tools/ (дата обращения: 2.05.2016).

A review on data mining tasks and tools

Библиографическое описание:

Похожие статьи

Обнаружение последовательностных паттернов в событиях...

Web mining association rules | Статья в журнале «Техника.»

Анализ методов аутентификации при использовании банковских...

Выбор платформы интеллектуального анализа данных для...

Разработка методики выявления сетевых атак с помощью Data...

Разработка информационного обеспечения автоматизированной...

Применение методов кластеризации для обработки новостного...

A Decision Forest Approach to Cold-start Recommendation