Most Popular

1500 questions
15
votes
3 answers

Is feature selection necessary?

I would like to run some machine learning model like random forest, gradient boosting, or SVM on my dataset. There are more than 200 predictor variables in my dataset and my target classes are a binary variable. Do I need to run feature selection…
LUSAQX
  • 783
  • 2
  • 10
  • 24
15
votes
4 answers

Alternatives to TF-IDF and Cosine Similarity when comparing documents of differing formats

I've been working on a small, personal project which takes a user's job skills and suggests the most ideal career for them based on those skills. I use a database of job listings to achieve this. At the moment, the code works as follows: 1) Process…
Richard Knoche
  • 151
  • 1
  • 1
  • 3
15
votes
4 answers

How to compare the performance of feature selection methods?

There are several feature selection / variable selection approaches (see for example Guyon & Elisseeff, 2003; Liu et al., 2010): filter methods (e.g., correlation-based, entropy-based, random forest importance based), wrapper methods (e.g.,…
hopfk
  • 341
  • 2
  • 10
15
votes
2 answers

Can overfitting occur even with validation loss still dropping?

I have a convolutional + LSTM model in Keras, similar to this (ref 1), that I am using for a Kaggle contest. Architecture is shown below. I have trained it on my labeled set of 11000 samples (two classes, initial prevalence is ~9:1, so I upsampled…
DeusXMachina
  • 263
  • 1
  • 2
  • 6
15
votes
3 answers

Predict the best time of call

I have a dataset including a set of customers in different cities of California, time of calling for each customer, and the status of call (True if customer answers the call and False if customer does not answer). I have to find an appropriate time…
Hamid Mahdavian
  • 159
  • 1
  • 3
15
votes
3 answers

How to choose a classifier after cross-validation?

When we do k-fold cross validation, should we just use the classifier that has the highest test accuracy? What is generally the best approach in getting a classifier from cross validation?
Armon Safai
  • 419
  • 1
  • 6
  • 12
15
votes
4 answers

Performance measure: Why is it called recall and sensitivity?

precision is the fraction of retrieved instances that are relevant, while recall (also known as sensitivity) is the fraction of relevant instances that are retrieved. I know their meaning but I don't know why it is called recall? I am not a…
Ahmad
  • 407
  • 6
  • 15
15
votes
5 answers

How can I ensure anonymity with queries to small datasets?

I'm building a service that will contain personal data relating to real people. Initially the dataset will be quite small, and as such it may be possible to identify individuals if the search parameters are narrowed sufficiently. An example of a…
mal
  • 253
  • 1
  • 6
15
votes
2 answers

Is there any APIs for crawling abstract of paper?

If I have a very long list of paper names, how could I get abstract of these papers from internet or any database? The paper names are like "Assessment of Utility in Web Mining for the Domain of Public Health". Does any one know any API that can…
Alex Gao
  • 253
  • 2
  • 5
15
votes
2 answers

Does scikit-learn use regularization by default?

I just fitted a logistic curve to some fake data. I made the data essentially a step function. data = -------------++++++++++++++ But when I look at the fitted curve, the slope is very small. The function that best minimizes the cost function,…
sebastianspiegel
  • 891
  • 4
  • 11
  • 16
15
votes
1 answer

Can closer points be considered more similar in T-SNE visualization?

I understand from Hinton's paper that T-SNE does a good job in keeping local similarities and a decent job in preserving global structure (clusterization). However I'm not clear if points appearing closer in a 2D t-sne visualization can be assumed…
Javierfdr
  • 1,490
  • 13
  • 14
15
votes
1 answer

What is the difference between a (dynamic) Bayes network and a HMM?

I have read that HMMs, Particle Filters and Kalman filters are special cases of dynamic Bayes networks. However, I only know HMMs and I don't see the difference to dynamic Bayes networks. Could somebody please explain? It would be nice if your…
Martin Thoma
  • 18,880
  • 35
  • 95
  • 169
14
votes
3 answers

How can I dynamically distinguish between categorical data and numerical data?

I know someone who is working on a project that involves ingesting files of data without regard to the columns or data types. The task is to take a file with any number of columns and various data types and output summary statistics on the numerical…
Poisson Fish
  • 243
  • 3
  • 6
14
votes
2 answers

Sentiment data for Emoji

For experimenting we'd like to use the Emoji embedded in many Tweets as a ground truth/training data for simple quantitative senitment analysis. Tweets are usually too unstructured for NLP to work well. Anyway, there are 722 Emoji in Unicode 6.0,…
Erich Schubert
  • 341
  • 3
  • 8
14
votes
4 answers

How word2vec can be used to identify unseen words and relate them to already trained data

I was working on word2vec gensim model and found it really interesting. I am intersted in finding how a unknown/unseen word when checked with the model will be able to get similar terms from the trained model. Is this possible? Can word2vec be…
gaurus
  • 341
  • 1
  • 2
  • 5