Questions tagged [text-mining]

Refers to a subset of data mining concerned with extracting information from data in the form of text by recognizing patterns. The goal of text mining is often to classify a given document into one of a number of categories in an automatic way, and to improve this performance dynamically, making it an example of machine learning. One example of this type of text mining are spam filters used for email.

641 questions
7
votes
1 answer

Combining n-grams

In text mining, if we've computed n-gram counts, for say $n=1\ldots4$, is there a principled way to combine them, other than just concatenating the $tf-idf$ matrices for each one? (equivalent to an unweighted sum of kernels if we were to construct…
tdc
  • 7,569
4
votes
1 answer

Statistically fuzzy version of a checksum for file text signature

Background: Often I end up downloading the same pdf article twice since I do not remember I've already downloaded it. One way around is to maintain an index of cheksums (say md5 etc.) based on the plaintext conversion of a pdf and if a match is…
curious_cat
  • 1,091
3
votes
2 answers

How can I analyze my incoming email?

I would like to analyze the email I receive in my Gmail inbox in order to systematically come up with effective Gmail filters for the most common types of email. I am prepared to manually curate and classify a very large volume of emails (perhaps…
Superbest
  • 253
  • 3
  • 11
3
votes
0 answers

Text Rank Algorithm to find Keywords

I was trying to implement the text rank algorithm mentioned in : http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mihalcea.pdf It seems to be simple to implement in python but I could not get the exact result as the authors have got in the paper for the…
3
votes
4 answers

Automatic keyword generation evaluation

I have a simple Text analyzer which generates keywords for a given input text. Till now I have been doing a manual evaluation of it, i.e., manually selecting keywords of a text and comparing them against the ones generated by the analyzer. Is there…
3
votes
2 answers

Data transposition from 'clustered rows' into columns

I am facing a difficult challenge, given my very low skills at text mining… Basically, I have a list of approx. 200 individuals described in a plain text file following a simple structure: N: (name) Y: (year of birth) S: (sibling) N: (name) Y:…
Fr.
  • 1,453
2
votes
1 answer

Continually updating naive Bayes classifier

I am attempting to use a Naive Bayes classifier to classify text. To accomplish this I have created an Excel sheet with a binary distribution for three variables. The workbook can be found here. Assuming that my math is correct, my questions…
AMAS
  • 59
2
votes
0 answers

R or Python Equivalent to SAS Text Miner

I am working on a project where a former member of the team used SAS Text Miner (https://support.sas.com/documentation/onlinedoc/txtminer/14.1/tmref.pdf) to complete some text mining. Unfortunately, this member of the team has moved on and no one…
benso8
  • 305
2
votes
1 answer

When generating a corpus for latent Dirichlet allocation, does word frequency within documents matter?

I have my set of documents and have extracted the unique words from them, including a count of the number of times each word appears in the document. But it would seem from the documentation on the Python library I'm using, that word count within…
2
votes
2 answers

How to calculate burstiness?

I would like to have some advice on the way to calculate burstiness. I am working with a set of text data, where every term is calculated with their frequency in newspaper for 2 weeks, e.g. "apple" during their iphone4s release will be day1 = 10,…
drhanlau
  • 169
2
votes
1 answer

How to use TF-IDF for features selection in Text classification?

I have a small confusion regarding TFIDF. I am planning to use TFIDF for creating better word dictionary to be used in Naive Bayes classifier. I am calculating the TDIDF of all words in respective class to find the importance of a given word in…
1
vote
0 answers

An up to date keyword set on global news

My question is not strictly binds to the topic of text mining, but maybe you can help. I am hunting for a keyword set, which has the following criterions: - contains only english words/n-gramms or named entities - manual (tagged by human) tags of…
1
vote
0 answers

Address & email text matching to identify Households

I am beginner in the text matching ,indexing related algorithms. Hence need some idea on what should be my approach / algorithms to identify potential household members when i have their addresses and email- The criteria is looking for similar…
Pb89
  • 325
1
vote
2 answers

Topic extracting and scoring for text data

This question is related to text analytics. I have text files which contain customer feedback for a chain of retail stores. My objective is to extract 5 main topics or entities from that data and score each store based on those entities and…
mathkid
  • 121
1
vote
1 answer

What is text distance in data mining

I need to write a report on visualization of multidimensional data, map and text distance. I got content related to other two but not getting any clue about text distance. Is it related to Data visualization?
Ashu
  • 11
1
2