TEXT BASED DOCUMENT SIMILARITY MEASURE
Loading...
Date
2013
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Suleyman Demirel University
Abstract
Do you have a shortage of data? Not very likely. A consequence of the pervasive use of computers is that most data originate in digital form. If we trade a stock or write a book or buy a product online, these events evolve electronically. Since so many paper transactions are now in paperless digital form, lots of “big” data are available for further analysis. The concept of data mining, finding valuable patterns in data, is an obvious response to the collection and storage of large volumes of data. Data mining is no longer an emerging technology awaiting further development. Although its application is far from universal, the techniques of data mining are highly developed and for some forms of analysis are entering a mature phase.
We would like to say “Give us data and we will findthepatterns.”Unfortunately, data-mining methods expect a highly structured format for data, necessitating extensive data preparation. Either we have to transform the original data, or the data are supplied in a highly structured format.Data-mining methods learn from samples of past experience. If we speak to specialists in predictive data mining, their data will be in numerical form. These people are the “numbers guys.” The “text miners” do not expect an orderly series of numbers. They are happy to look at collections of documents, where the contents are readable and their meaning is obvious. This is our first distinction between data and text mining: numbers versus text. That doesn’t mean that these are two distinct concepts. Both are based on samples of past examples. The composition of the examples is very different, yet many of the learning methods are similar. That’s because the text will be processed and transformed into a numerical representation.
Description
Keywords
Text document, data analysis methods
Citation
TEXT BASED DOCUMENT SIMILARITY MEASURE, Shnibekov Zhasulan, 2013