Highest Voted Questions - Data Science Stack Exchange

11

votes

3 answers

Is the direction of edges in a Bayes Network irrelevant?

Today, in a lecture it was claimed that the direction of edges in a Bayes network doesn't really matter. They don't have to represent causality. It is obvious that you cannot switch any single edge in a Bayes network. For example, let $G = (V, E)$…

bayesian-networks

asked Feb 02 '16 at 09:05

Martin Thoma

18,880
35
95
169

10

votes

4 answers

Why might several types of models give almost identical results?

I've been analyzing a data set of ~400k records and 9 variables The dependent variable is binary. I've fitted a logistic regression, a regression tree, a random forest, and a gradient boosted tree. All of them give virtual identical goodness of fit…

asked Aug 18 '14 at 14:56

JenSCDC

317
1
10

10

votes

1 answer

Convergence in Hartigan-Wong k-means method and other algorithms

I have been trying to understand the different k-means clustering algorithms mainly that are implemented in the stats package of the R language. I understand the Lloyd's algorithm and MacQueen's online algorithm. The way I understand them is as…

asked Jan 19 '16 at 20:59

Sid

101
1
5

10

votes

1 answer

Why is Reconstruction in Autoencoders Using the Same Activation Function as Forward Activation, and not the Inverse?

Suppose you have an input layer with n neurons and the first hidden layer has $m$ neurons, with typically $m < n$. Then you compute the actication $a_j$ of the $j$-th neuron in the hidden layer by $a_j = f\left(\sum\limits_{i=1..n} w_{i,j}…

asked Jan 12 '16 at 23:39

Manfred Eppe

101
1
4

10

votes

2 answers

What is the difference between BERT and Roberta

I want to understand the difference between BERT and Roberta. I saw the article below. https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8 It mentions that Roberta was trained on 10x more data but I don't…

asked Jul 01 '21 at 11:02

Noman Tanveer

203
1
1
7

10

votes

3 answers

Building a machine learning model to predict crop yields based on environmental data

I have a dataset containing data on temperature, precipitation and soybean yields for a farm for 10 years (2005 - 2014). I would like to predict yields for 2015 based on this data. Please note that the dataset has DAILY values for temperature and…

asked Jan 04 '16 at 00:17

user308827

271
3
11

10

votes

2 answers

Software Testing for Data Science in R

I often use Nose, Tox or Unittest when testing my python code, specially when it has to be integrated with other modules or other pieces of code. However, now that I've found myself using R more than python for ML modelling and development. I…

asked Jan 01 '16 at 00:30

wacax

3,390
4
23
45

10

votes

1 answer

What's the default Scorer in Sci-kit learn's GridSearchCV?

Even if I don't define the scoring parameter, it scores and makes a decision for best estimator, but documentation says the default value for scoring is "None", so what is it using to score when I don't define a metric or list of metrics?

asked May 10 '21 at 15:01

TwoPointNo

101
1
3

10

votes

2 answers

Are there studies which examine dropout vs other regularizations?

Are there any papers published which show differences of the regularization methods for neural networks, preferably on different domains (or at least different datasets)? I am asking because I currently have the feeling that most people seem to use…

asked Dec 03 '15 at 21:30

Martin Thoma

18,880
35
95
169

10

votes

3 answers

How do various statistical techniques (regression, PCA, etc) scale with sample size and dimension?

Is there a known general table of statistical techniques that explain how they scale with sample size and dimension? For example, a friend of mine told me the other day that the computation time of simply quick-sorting one dimensional data of size n…

asked Aug 05 '14 at 18:36

Bridgeburners

229
1
7

10

votes

1 answer

Server log analysis using machine learning

I was assigned this task to analyze the server logs of our application which contains exception logs, database logs event logs etc. I am new to machine learning, we use Spark with elastic search and Sparks MLlib(or PredictionIO).An example of the…

asked Nov 27 '15 at 18:11

elric

111
1
1
3

10

votes

2 answers

Which database to use for storing machine learning data?

I am currently storing my training data into HDF5 files and I want my team and I to switch for a database for two main reasons: the data is not used only by me and the different datasets are stored in different folders at different paths etc so I…

machine-learning

asked Feb 02 '21 at 15:05

ava_punksmash

241
1
5

10

votes

1 answer

User-product positive (click data) available. How to generate negative (no-click data)?

Its very common in recommender that we have user product data which have label as an e.g. "click". In order to learn the model, I need click and no-click data. Simplest approach to generate is to take user-products pairs which are not found in click…

asked Nov 17 '15 at 16:10

p.paliwal

131
5

10

votes

1 answer

Text-Classification-Problem: Is Word2Vec/NN the best approach?

I am looking to design a system that given a paragraph of text will be able to categorize it and identify the context: Is trained with user generated text paragraphs (like comments/questions/answers) Each item in the training set will be tagged…

asked Nov 04 '15 at 07:34

Shankar

101
3

10

votes

1 answer

Transforming AutoEncoders

I've just read Geoff Hinton's paper on transforming autoencoders Hinton, Krizhevsky and Wang: Transforming Auto-encoders. In Artificial Neural Networks and Machine Learning, 2011. and would quite like to play around with something like this. But…

asked Oct 30 '15 at 15:59

Daniel Slater

256
1
8

Most Popular