Web mining association rules

Хужамуратов Бекмурод Хужамурот угли

In this article is the considered concepts of web mining. Associative rules are reflected in the concepts, and in an example.

В этой статье рассмотрено понятие web mining. Ассоциативные правила отражены в концепции, а также на примере.

Mining information on the World Wide Web is a huge application area. Search engine companies examine the hyperlinks in web pages to come up with a measure of “prestige” for each web page and website. Dictionaries define prestige as “high standing achieved through success or influence.” A metric called PageRank, introduced by the founders of Google and used in various guises by other search engine developers too, attempts to measure the standing of a web page. The more pages that link to your website, the higher its prestige. And prestige is greater if the pages that link in have high prestige themselves. The definition sounds circular, but it can be made to work. Search engines use PageRank (among other things) to sort web pages into order before displaying the result of your search.

Another way in which search engines tackle the problem of how to rank web pages is to use machine learning based on a training set of example queries — documents that contain the terms in the query and human judgments about how relevant the documents are to that query. Then a learning algorithm analyzes this training data and comes up with a way to predict the relevance judgments for any document and query. For each document a set of feature values is calculated that depend on the query term — e.g., whether it occurs in the title tag, whether it occurs in the document’s URL, how often it occurs in the document itself, and how often it appears in the anchor text of hyperlinks that point to this document. For multiterm queries, features include how often two different terms appear close together in the document, and so on. There are many possible features: typical algorithms for learning ranks use hundreds or thousands of them.

Search engines mine the content of the Web. They also mine the content of your queries — the terms you search for — to select advertisement that you might be interested in. They have a strong incentive to do this accurately, because they only get paid by advertisers when users click on their links. Search engine companies mine your very clicks, because knowledge of which results you click on can be used to improve the search next time Online booksellers mine the purchasing database to come up with recommendations such as “users who bought this book also bought these ones”; again they have a strong incentive to present you with compelling, personalized choices. Movie sites recommend movies based on your previous choices and other people’s choices: they win if they make recommendations that keep customers coming back to their website.

And then there are social networks and other personal data. We live in the age of selfrevelation: people share their innermost thoughts in blogs and tweets, their photographs, their music and movies tastes, their opinions of books, software, gadgets, and hotels, their social life. They may believe they are doing this anonymously, or pseudonymously, but often they are incorrect. There is huge commercial interest in making money by mining the Web.

Association rules are an important class of regularities in data. Mining of association rules is a fundamental data mining task. It is perhaps the most important model invented and extensively studied by the database and data mining community. Its objective is to find all co-occurrence relationships, called associations, among data items. Since it was first introduced in 1993 by Agrawal et al., it has attracted a great deal of attention. Many efficient algorithms, extensions and applications have been reported.

The classic application of association rule mining is the market basket data analysis, which aims to discover how items purchased by customers in a supermarket (or a store) are associated. An example association rule is

Cheese → Beer [support = 10 %, confidence = 80 %].

The rule says that 10 % customers buy Cheese and Beer together, and those who buy Cheese also buy Beer 80 % of the time. Support and confidence are two measures of rule strength, which we will define later. This mining model is in fact very general and can be used in many applications. For example, in the context of the Web and text documents, it can be used to find word co-occurrence relationships and Web usage patterns as we will see in later chapters.

Association rule mining, however, does not consider the sequence in which the items are purchased. Sequential pattern mining takes care of that. An example of a sequential pattern is “5 % of customers buy bed first, then mattress and then pillows”. The items are not purchased at the same time, but one after another. Such patterns are useful in Web usage mining for analyzing clickstreams in server logs. They are also useful for finding language or linguistic patterns from natural language texts.
Basic Concepts of Association Rules

The problem of mining association rules can be stated as follows: Let I = {i₁, i₂, …, i_m} be a set of items. Let T = (t₁, t₂, …, t_n) be a set of transactions (the database), where each transaction ti is a set of items such that t_i I. An association rule is an implication of the form,
X→Y, where X I, Y I, and X Y =.

X (or Y) is a set of items, called an itemset.

Example 1: We want to analyze how the items sold in a supermarket are related to one another. I is the set of all items sold in the supermarket. A transaction is simply a set of items purchased in a basket by a customer. For example, a transaction may be:

{Beef, Chicken, Cheese},

which means that a customer purchased three items in a basket, Beef, Chicken, and Cheese. An association rule may be:

Beef,Chicken Cheese,

where {Beef, Chicken} is X and {Cheese} is Y. For simplicity, brackets “{” and “}” are usually omitted in transactions and rules.

A transaction t_i T is said to contain an itemset X if X is a subset of t_i (we also say that the itemset X covers t_i). The support count of X in T (denoted by X.count) is the number of transactions in T that contain X. The strength of a rule is measured by its support and confidence.

Support: The support of a rule, X Y, is the percentage of transactions in T that contains X Y, and can be seen as an estimate of the probability, Pr(XY). The rule support thus determines how frequent the rule is applicable in the transaction set T. Let n be the number of transactions in T. The support of the rule X Y is computed as follows:

Support =

Confidence thus determines the predictability of the rule. If the confidence of a rule is too low, one cannot reliably infer or predict Y from X. A rule with low predictability is of limited use.

Objective: Given a transaction set T, the problem of mining association

rules is to discover all association rules in T that have support and confidence greater than or equal to the user-specified minimum support and minimum confidence.

References:

Agarwal, D. Statistical Challenges in Online Advertising. In Tutorial given at ACM KDD-2009 conference, 2009.
Agarwal, D. and B.-C. Chen. fLDA: matrix factorization through latent dirichlet allocation. In Proceedings of the third ACM international conference on Web search and data mining. 2010, ACM: New York, New York, USA. p. 91–100.
Cadez, I., D. Heckerman, C. Meek, P. Smyth, and S. White. Model-based clustering and visualization of navigation patterns on a web site. Data Mining and Knowledge Discovery, 2003, 7(4): p. 399–424.

Молодой учёный

Web mining association rules

Web mining association rules

Молодой учёный