Highest Voted Questions - Data Science Stack Exchange

15

votes

3 answers

Is feature selection necessary?

I would like to run some machine learning model like random forest, gradient boosting, or SVM on my dataset. There are more than 200 predictor variables in my dataset and my target classes are a binary variable. Do I need to run feature selection…

asked Jan 04 '17 at 08:46

LUSAQX

783
2
10
24

15

votes

4 answers

Alternatives to TF-IDF and Cosine Similarity when comparing documents of differing formats

I've been working on a small, personal project which takes a user's job skills and suggests the most ideal career for them based on those skills. I use a database of job listings to achieve this. At the moment, the code works as follows: 1) Process…

asked Jan 02 '17 at 20:41

Richard Knoche

151
1
1
3

15

votes

4 answers

How to compare the performance of feature selection methods?

There are several feature selection / variable selection approaches (see for example Guyon & Elisseeff, 2003; Liu et al., 2010): filter methods (e.g., correlation-based, entropy-based, random forest importance based), wrapper methods (e.g.,…

asked Dec 06 '16 at 13:31

hopfk

341
2
10

15

votes

2 answers

Can overfitting occur even with validation loss still dropping?

I have a convolutional + LSTM model in Keras, similar to this (ref 1), that I am using for a Kaggle contest. Architecture is shown below. I have trained it on my labeled set of 11000 samples (two classes, initial prevalence is ~9:1, so I upsampled…

asked Nov 20 '16 at 13:43

DeusXMachina

263
1
2
6

15

votes

3 answers

Predict the best time of call

I have a dataset including a set of customers in different cities of California, time of calling for each customer, and the status of call (True if customer answers the call and False if customer does not answer). I have to find an appropriate time…

asked Sep 21 '16 at 08:08

Hamid Mahdavian

159
1
3

15

votes

3 answers

How to choose a classifier after cross-validation?

When we do k-fold cross validation, should we just use the classifier that has the highest test accuracy? What is generally the best approach in getting a classifier from cross validation?

asked Sep 13 '16 at 03:23

Armon Safai

419
1
6
12

15

votes

4 answers

Performance measure: Why is it called recall and sensitivity?

precision is the fraction of retrieved instances that are relevant, while recall (also known as sensitivity) is the fraction of relevant instances that are retrieved. I know their meaning but I don't know why it is called recall? I am not a…

asked Jun 17 '16 at 19:19

Ahmad

407
6
15

15

votes

5 answers

How can I ensure anonymity with queries to small datasets?

I'm building a service that will contain personal data relating to real people. Initially the dataset will be quite small, and as such it may be possible to identify individuals if the search parameters are narrowed sufficiently. An example of a…

asked Mar 23 '23 at 08:57

mal

253
1
6

15

votes

2 answers

Is there any APIs for crawling abstract of paper?

If I have a very long list of paper names, how could I get abstract of these papers from internet or any database? The paper names are like "Assessment of Utility in Web Mining for the Domain of Public Health". Does any one know any API that can…

asked May 17 '14 at 08:45

Alex Gao

253
2
5

15

votes

2 answers

Does scikit-learn use regularization by default?

I just fitted a logistic curve to some fake data. I made the data essentially a step function. data = -------------++++++++++++++ But when I look at the fitted curve, the slope is very small. The function that best minimizes the cost function,…

asked Mar 21 '16 at 06:51

sebastianspiegel

891
4
11
16

15

votes

1 answer

Can closer points be considered more similar in T-SNE visualization?

I understand from Hinton's paper that T-SNE does a good job in keeping local similarities and a decent job in preserving global structure (clusterization). However I'm not clear if points appearing closer in a 2D t-sne visualization can be assumed…

asked Mar 20 '16 at 16:11

Javierfdr

1,490
13
14

15

votes

1 answer

What is the difference between a (dynamic) Bayes network and a HMM?

I have read that HMMs, Particle Filters and Kalman filters are special cases of dynamic Bayes networks. However, I only know HMMs and I don't see the difference to dynamic Bayes networks. Could somebody please explain? It would be nice if your…

asked Jan 27 '16 at 19:58

Martin Thoma

18,880
35
95
169

14

votes

3 answers

How can I dynamically distinguish between categorical data and numerical data?

I know someone who is working on a project that involves ingesting files of data without regard to the columns or data types. The task is to take a file with any number of columns and various data types and output summary statistics on the numerical…

asked Jan 21 '16 at 20:15

Poisson Fish

243
3
6

14

votes

2 answers

Sentiment data for Emoji

For experimenting we'd like to use the Emoji embedded in many Tweets as a ground truth/training data for simple quantitative senitment analysis. Tweets are usually too unstructured for NLP to work well. Anyway, there are 722 Emoji in Unicode 6.0,…

asked Aug 12 '14 at 07:57

Erich Schubert

341
3
8

14

votes

4 answers

How word2vec can be used to identify unseen words and relate them to already trained data

I was working on word2vec gensim model and found it really interesting. I am intersted in finding how a unknown/unseen word when checked with the model will be able to get similar terms from the trained model. Is this possible? Can word2vec be…

asked Dec 26 '15 at 03:47

gaurus

341
1
2
5

Most Popular