Using text mining for text prediction

Хужамуратов, Бекмурод Хужамурот

Introduction Text Mining

Text mining (also called text data mining or text analytics) is, at its simplest, a method for drawing out content based on meaning and context from a large body (or bodies) of text. Or, put another way, it is a method for gathering structured information from unstructured text. It is via text mining tools, for example, that many spam filters detect unwanted emails from your inbox, and how companies can anticipate, rather than simply react to, their customer needs by sifting through masses of seemingly unrelated data and discovering meaningful relationships. Text mining also has significant potential for academic application and, at least when used in its basic form, benefits from being a relatively straightforward and easy tool to master.

Main part

The words prediction and forecast conjure up images of momentous decisions and complex processes fraught with inaccuracies. From a statistical perspective, it’s a straightforward problem that has a solution. Of course, the solution may not always be very good. The problem presents itself as in Fig. 1. Given a sample of examples of past experience, we project to new examples. If the future is similar to the past, we may have an opportunity to make accurate predictions. An example of such a situation is where one tries to predict the future share price of a company based on historical records of the company’s share price and other measures of its performance.

Figure 1. Predicting the future based on the past

Making a prediction requires more than a lookup of past experience. Even if we effectively characterize these experiences in a consistent way, the test of success is on new examples. For prediction, a pattern must be found in past experience that will hold in the future, leading to accurate results on new, unseen examples. If a new example presents itself in a form that is radically different from prior experience, learning from past experience will prove inadequate. Machine learning and statistical methods do not learn from basic principles. They have no ability to reason and reach new conclusions for new situations. Still, perhaps surprisingly, many prediction problems can be solved by finding patterns in prior experience. If samples can be obtained and organized in the right format, finding patterns is almost effortless, even in very high-dimensional feature spaces. The classical prediction problem for text is called text categorization. Here, a set of categories is predefined and the objective is to assign a category or topic to a new document. For example, we can collect newswire articles and describe a set of topics such as financial or sports stories. When news arrives, the words are examined, and articles are assigned topics from a field list of possible topics.

However, characterizing all prediction from text as text categorization is too narrow. Prediction from text can be just as ambitious as prediction for numerical data mining. In statistical terms, prediction has a very specific characterization, and it need not deal with just topic assignment to documents. Prediction for text follows the classical lines of all numerical classification problems, and we can use a traditional model of data that applies to any sampling application where the answers are presented as true or false.

The prediction problem for text is generally a classification problem. In our spreadsheet model of data, we have the usual rows and columns, where a row is an example and a column is an attribute to be measured. For classification we have an additional column, a label identifying the correct answer. For text, the answer is something that is true or false. For example, the labels for a stock price prediction problemmaybea1forastockpricethatgoes up anda0forunchanged or down.

Table 1 and 2 are abstract templates of spreadsheets that might be composed for prediction. Table 1 is the classical text categorization application, where the goal is to filter spam e-mail from valid e-mail. Table 2 illustrates that document classification might be explicitly predictive. The examples are words found in news stories about companies, and the labels are whether the stock price rose in some time period following the article.

Table 1

Abstract spreadsheet for spam prediction

…	unsubscribe	…	enlargement	…	Ink	…	Spam
…	Yes	…	Yes	…	Yes	…	True
…	No	…	No	…	No	…	False
…	…	…	…	…	…	…	…

Table 2

Abstract spreadsheet for predicting stock price

…	profits	…	Increased	…	Earnings	…	Stock-price
…	Yes	…	Yes	…	Yes	…	1
…	Yes	…	No	…	Yes	…	0
…	…	…	…	…	…	…	…

So far, we have not shied away from describing text as unstructured data that can be converted into structured data, where classical machine-learning methods can be applied. There remain many nuances in the recipe that do not alter this worldview but can make our trip to obtaining good results more direct. Let’s look at predictive methods from the perspective of text and our experience in choosing the best route for their application [1].

Text Mining Classification

Conducting analyses on text can be done using a variety of approaches. This depends greatly on the type of insight that is required. Text classification, also known as text categorization, is a vital element of text mining where natural language texts are automatically assigned to predefined groups or categories (Tan, Wang & Lee, 2002) [2]. Category assignment is based on the content of the text and helps in retrieving relevant information.

A well-known example of text classification is categorizing on sentiment (Pang & Lee, 2008; Pak & Paroubek, 2010) [3, 4]. A broad approach assigns a positive or negative class to a text, depending on its content. A more specific approach can include different gradations of negative and positive or it can incorporate a range of different emotions.

Before a text mining technique can make any predictions it needs to train its model. A corpus of a large number of texts is provided which already has the classes assigned. From this corpus, the leaning algorithm can induce rules, as seen in Table 3, based on reoccurring patterns in the text.

Table 3

Induced Rules

If Sentence includes ‘bank’ and ‘water’ then ‘bank’ = ground bounding waters

If Sentence includes ‘bank’ and ‘money’ then ‘bank’ = financial institution

If Sentence includes ‘bank’ and preposition = ‘on’ then ‘bank’ = ground bounding waters

If Sentence includes ‘bank’ and preposition = ‘into’ then ‘bank’ = financial institution

Generally, a corpus is used to test the performance of the algorithm. This test corpus also contains pre-assigned classes but they are only used to match them to the classes assigned by the algorithm. When the classifying algorithm shows sufficient performance it can be used to classify texts without any pre-assigned classes. This performance cannot be measured and it is there for highly important that the training- and test corpus is sufficiently representative.

Text Mining Regression

Predicting texts to a certain class can be highly useful and it can separate large collections of texts into relevant groups. However, in some situations it can be more suitable to predict a specific value and not a global range. Research done by Ghani and Simmons (2004), in which a system is proposed that can forecast end-prices of online auctions using text mining, is an example of such an approach. A specific value could be predicted: the end-price of an online auction [5]. This type of prediction can be done using regression. Although Ghani and Simmons found that their regression results were not sufficient, other dataset may show more potential for specific value prediction.

Classifiers that solve regression problems analyze the relationship between the attribute-(independent) variables and the numerical target (dependent) variables (van Wijk, 2008). This can be done using a variety of techniques that predict a continuous numerical value. Regression is suitable in solving complex prediction problems and can handle both linear- and nonlinear relationships (Fayyad, Piatetsky-Shapiro & Smyth, 1996; van Wijk, 2008) [6].

In general, regression tasks have a wider range of possible outcomes compared to classification using broad classes. Compared with classification, predictions for regression are “more sensitive to inaccurate probability estimates” (Frank, Trigg, Holmes & Witten, 1999) [7].This means that predictions on certain statistical assumptions have a higher chance of failure due to the wider range of potential predictions. However, when predicting specific values is needed, regression can lead to a more useful and distinct system.

References:

Thong Zhang, Sholom M. (2010). Fundamentals of Predictive Text Mining, 50-51.
Tan, C. M., Wang, Y. F., Lee, C. D. (2002). The use of bigrams to enhance text categorization, 1-14, 26.
Pak, A., & Paroubek, P. (2010). Twitter as a Corpus for Sentiment Analysis and Opinion Mining. In LREC.
Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2(1-2), 1-15.
Ghani, R., & Simmons, H. (2004). Predicting the end-price of online auctions. In Proceedings of the International Workshop on Data Mining and Adaptive Modelling Methods for Economics and Management, 1-11.
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P. (1996). Knowledge Discovery and Data Mining: Towards a Unifying Framework, 39, 44-46.
Frank, E., Trigg, L., Holmes, G., Witten, I. H. (1999). Naпve Bayes for Regression, 1-20.

Using text mining for text prediction

Библиографическое описание:

Похожие статьи

Классификация новостей сайта правительства Российской...

Классификация новостей сайта правительства Российской...

Похожие статьи

Классификация новостей сайта правительства Российской...

Классификация новостей сайта правительства Российской...

Ответим на ваш вопрос!