Highest Voted Questions - Data Science Stack Exchange

11

votes

2 answers

Implementing Complementary Naive Bayes in python?

Problem I have tried using Naive bayes on a labeled data set of crime data but got really poor results (7% accuracy). Naive Bayes runs much faster than other alogorithms I've been using so I wanted to try finding out why the score was so…

asked Jun 21 '15 at 08:22

grasshopper

213
1
5

11

votes

4 answers

How do you create an optimized walk list given longitude and latitude coordinates?

I am working on a political campaign where dozens of volunteers will be conducting door-knocking promotions over the next few weeks. Given a list with names, addresses and long/lat coordinates, what algorithms can be used to create an optimized walk…

algorithms

asked Jun 26 '14 at 14:39

McGovernTheory

219
1
4

11

votes

3 answers

Which is faster: PostgreSQL vs MongoDB on large JSON datasets?

I have a large dataset with 9m JSON objects at ~300 bytes each. They are posts from a link aggregator: basically links (a URL, title and author id) and comments (text and author ID) + metadata. They could very well be relational records in a table,…

asked Jun 03 '15 at 20:29

blue-dino

383
2
3
11

11

votes

3 answers

What are R's memory constraints?

In reviewing “Applied Predictive Modeling" a reviewer states: One critique I have of statistical learning (SL) pedagogy is the absence of computation performance considerations in the evaluation of different modeling techniques. With its…

asked May 14 '14 at 17:48

blunders

1,932
2
15
19

11

votes

2 answers

Delete/Drop only the rows which has all values as NaN in pandas

I have a Dataframe, i need to drop the rows which has all the values as NaN. ID Age Gender 601 21 M 501 NaN F NaN NaN NaN The resulting data frame should look like. Id Age Gender 601 21 M 501 …

asked Sep 09 '19 at 09:33

Harshith

283
2
5
16

11

votes

3 answers

How to encode a class with 24,000 categories?

I'm currently working on a logistic regression model for genomics. One of the input fields I want to include as a covariate is genes. There are around 24,000 known genes. There are many features with this level of variability in computational…

asked Sep 03 '19 at 00:52

Kermit

529
5
17

11

votes

5 answers

LinkedIn web scraping

I recently discovered a new R package for connecting to the LinkedIn API. Unfortunately the LinkedIn API seems pretty limited to begin with; for example, you can only get basic data on companies, and this is detached from data on individuals. I'd…

asked May 13 '15 at 21:01

christopherlovell

480
1
5
18

11

votes

3 answers

Why does Gradient Boosting regression predict negative values when there are no negative y-values in my training set?

As I increase the number of trees in scikit learn's GradientBoostingRegressor, I get more negative predictions, even though there are no negative values in my training or testing set. I have about 10 features, most of which are binary. Some of the…

asked Jun 24 '14 at 19:43

user2592989

219
1
2
6

11

votes

1 answer

Why could my DDQN get significantly worse after beating the game repeatedly?

I've been trying to train a DDQN to play OpenAI Gym's CartPole-v1, but found that although it starts off well and starts getting full score (500) repeatedly (at around 600 episodes in the pic below), it then seems to go off the rails and do worse…

asked Jul 20 '19 at 08:06

Danny Tuppeny

213
2
7

11

votes

4 answers

Feature Extraction Technique - Summarizing a Sequence of Data

I often am building a model (classification or regression) where I have some predictor variables that are sequences and I have been trying to find technique recommendations for summarizing them in the best way possible for inclusion as predictors in…

asked Jun 23 '14 at 23:20

B_Miner

702
1
7
20

11

votes

2 answers

What is the difference between and Embedding Layer and an Autoencoder?

I'm reading about Embedding layers, especially applied to NLP and word2vec, and they seem nothing more than an application of Autoencoders for dimensionality reduction. Are they different? If so, what are the differences between them?

asked Jun 21 '19 at 15:52

Leevo

6,225
3
16
52

11

votes

4 answers

How can Time Series Analysis be done with Categorical Variables

Most of the time series analysis tutorials/textbooks I've read about, be they for univariate or multivariate time series data, usually deal with continuous numerical variables. I currently have a problem at hand that deals with multivariate time…

asked Jun 20 '19 at 09:06

Brian Yen

111
1
1
5

11

votes

2 answers

Amplifying a Locality Sensitive Hash

I'm trying to build a cosine locality sensitive hash so I can find candidate similar pairs of items without having to compare every possible pair. I have it basically working, but most of the pairs in my data seem to have cosine similarity in the…

machine-learning

asked Jan 30 '15 at 11:08

Philip Pearl

251
1
5

11

votes

1 answer

What is fractionally-strided convolution layer?

In paper Generating High-Quality Crowd Density Maps using Contextual Pyramid CNNs, in Section 3.4, it said Since, the aim of this work is to estimate high-resolution and high-quality density maps, F-CNN is constructed using a set of …

asked Apr 15 '19 at 03:26

Haha TTpro

243
1
2
7

11

votes

3 answers

Inverse Relationship Between Precision and Recall

I made some search to learn precision and recall and I saw some graphs represents inverse relationship between precision and recall and I started to think about it to clarify subject. I wonder the inverse relationship always hold? Suppose I have a…

asked Apr 11 '19 at 11:48

tkarahan

422
5
14

Most Popular