Refers to a subset of data mining concerned with extracting information from data in the form of text by recognizing patterns. The goal of text mining is often to classify a given document into one of a number of categories in an automatic way, and to improve this performance dynamically, making it an example of machine learning. One example of this type of text mining are spam filters used for email.
Questions tagged [text-mining]
641 questions
7
votes
1 answer
Combining n-grams
In text mining, if we've computed n-gram counts, for say $n=1\ldots4$, is there a principled way to combine them, other than just concatenating the $tf-idf$ matrices for each one? (equivalent to an unweighted sum of kernels if we were to construct…
tdc
- 7,569
4
votes
1 answer
Statistically fuzzy version of a checksum for file text signature
Background:
Often I end up downloading the same pdf article twice since I do not remember I've already downloaded it. One way around is to maintain an index of cheksums (say md5 etc.) based on the plaintext conversion of a pdf and if a match is…
curious_cat
- 1,091
3
votes
2 answers
How can I analyze my incoming email?
I would like to analyze the email I receive in my Gmail inbox in order to systematically come up with effective Gmail filters for the most common types of email.
I am prepared to manually curate and classify a very large volume of emails (perhaps…
Superbest
- 253
- 3
- 11
3
votes
0 answers
Text Rank Algorithm to find Keywords
I was trying to implement the text rank algorithm mentioned in :
http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mihalcea.pdf
It seems to be simple to implement in python but I could not get the exact result as the authors have got in the paper for the…
snow_leopard
- 345
3
votes
4 answers
Automatic keyword generation evaluation
I have a simple Text analyzer which generates keywords for a given input text. Till now I have been doing a manual evaluation of it, i.e., manually selecting keywords of a text and comparing them against the ones generated by the analyzer.
Is there…
snow_leopard
- 345
3
votes
2 answers
Data transposition from 'clustered rows' into columns
I am facing a difficult challenge, given my very low skills at text mining… Basically, I have a list of approx. 200 individuals described in a plain text file following a simple structure:
N: (name)
Y: (year of birth)
S: (sibling)
N: (name)
Y:…
Fr.
- 1,453
2
votes
1 answer
Continually updating naive Bayes classifier
I am attempting to use a Naive Bayes classifier to classify text. To accomplish this I have created an Excel sheet with a binary distribution for three variables. The workbook can be found here. Assuming that my math is correct, my questions…
AMAS
- 59
2
votes
0 answers
R or Python Equivalent to SAS Text Miner
I am working on a project where a former member of the team used SAS Text Miner (https://support.sas.com/documentation/onlinedoc/txtminer/14.1/tmref.pdf) to complete some text mining. Unfortunately, this member of the team has moved on and no one…
benso8
- 305
2
votes
1 answer
When generating a corpus for latent Dirichlet allocation, does word frequency within documents matter?
I have my set of documents and have extracted the unique words from them, including a count of the number of times each word appears in the document. But it would seem from the documentation on the Python library I'm using, that word count within…
Rich Armstrong
- 121
2
votes
2 answers
How to calculate burstiness?
I would like to have some advice on the way to calculate burstiness. I am working with a set of text data, where every term is calculated with their frequency in newspaper for 2 weeks, e.g. "apple" during their iphone4s release will be day1 = 10,…
drhanlau
- 169
2
votes
1 answer
How to use TF-IDF for features selection in Text classification?
I have a small confusion regarding TFIDF. I am planning to use TFIDF for creating better word dictionary to be used in Naive Bayes classifier. I am calculating the TDIDF of all words in respective class to find the importance of a given word in…
pankaj jha
- 75
1
vote
0 answers
An up to date keyword set on global news
My question is not strictly binds to the topic of text mining, but maybe you can help.
I am hunting for a keyword set, which has the following criterions:
- contains only english words/n-gramms or named entities
- manual (tagged by human) tags of…
user25151
- 11
1
vote
0 answers
Address & email text matching to identify Households
I am beginner in the text matching ,indexing related algorithms. Hence need some idea on what should be my approach / algorithms to identify potential household members when i have their addresses and email- The criteria is looking for similar…
Pb89
- 325
1
vote
2 answers
Topic extracting and scoring for text data
This question is related to text analytics.
I have text files which contain customer feedback for a chain of retail stores. My objective is to extract 5 main topics or entities from that data and score each store based on those entities and…
mathkid
- 121
1
vote
1 answer
What is text distance in data mining
I need to write a report on visualization of multidimensional data, map and text distance.
I got content related to other two but not getting any clue about text distance.
Is it related to Data visualization?
Ashu
- 11