Highest Voted Questions - Data Science Stack Exchange

11

votes

3 answers

When does cache get expired for a RDD in pyspark?

We use .cache() on RDD for persistent caching of an dataset, My concern is when this cached will be expired?. dt = sc.parallelize([2, 3, 4, 5, 6]) dt.cache()

asked May 10 '16 at 12:38

krishna Prasad

1,147
1
14
23

11

votes

1 answer

How to determine if character sequence is English word or noise

What kind of features you will try to extract from list of words for future predicting, is it existing word or just mess of characters ? There is description of task that I found there. You have to write a program that can answer whether a given…

asked Apr 28 '16 at 17:20

vimi

155
1
7

11

votes

2 answers

Have 100% images from ImageNet been proven to belong to the class annotated?

Is it proven that all 15M images were manually classified correctly and there are no mistakes or randomly selected responses collected?

asked Sep 05 '22 at 18:36

ivan866

210
2
7

11

votes

6 answers

Python: Handling imbalance Classes in python Machine Learning

I have a dataset for which I am trying to predict target variables. Col1 Col2 Col3 Col4 Col5 1 2 23 11 1 2 22 12 14 1 22 11 43 38 3 14 22 25 19 3 12…

asked Apr 25 '16 at 07:26

SRS

1,065
5
11
22

11

votes

3 answers

How to group identical values and count their frequency in Python?

Newbie to analytics with Python so please be gentle :-) I couldn't find the answer to this question - apologies if it is already answered elsewhere in a different format. I have a dataset of transaction data for a retail outlet. Variables along with…

asked Apr 21 '16 at 18:49

new_analyst

123
1
1
7

11

votes

1 answer

Decision tree, how to understand or calculate the probability/confidence of prediction result

For example, a drug prediction problem using a decision tree. I trained the decision tree model and would like to predict using new data. For example: patient, Attr1, Attr2, Attr3, .., Label 002 90.0 8.0 98.0 ... ? ===> predict drug…

decision-trees

asked Apr 12 '16 at 13:45

GoingMyWay

233
1
2
9

11

votes

2 answers

How do "intent recognisers" work?

Amazon's Alexa, Nuance's Mix and Facebook's Wit.ai all use a similar system to specify how to convert a text command into an intent - i.e. something a computer would understand. I'm not sure what the "official" name for this is but I call it "intent…

asked Apr 05 '16 at 09:03

Timmmm

231
1
2
5

11

votes

2 answers

Neural net for server monitoring

I'm looking at pybrain for taking server monitor alarms and determining the root cause of a problem. I'm happy with training it using supervised learning and curating the training data sets. The data is structured something like this: Server Type A…

asked Sep 10 '14 at 14:50

Matt Williamson

211
1
2

11

votes

1 answer

HOW TO: Deep Neural Network weight initialization

Given difficult learning task (e.g high dimensionality, inherent data complexity) Deep Neural Networks become hard to train. To ease many of the problems one might: Normalize && handpick quality data choose a different training algorithm (e.g…

asked Mar 28 '16 at 12:17

Joonatan Samuel

219
2
5

11

votes

2 answers

Dissmissing features based on correlation with target variable

Is it valid to dismiss features based on their Pearson correlation values with the target variable in a classification problem? say for instance I have a dataset with the following format where the target variable takes 1 or 0: >>> dt.head() ID …

asked Mar 12 '16 at 15:21

MedAli

275
1
2
10

11

votes

4 answers

Which one first: algorithm benchmarking, feature selection, parameter tuning?

When trying to do e.g. a classification, my approach currently is to try out various algorithm first and benchmark them perform feature selection on the best algorithm from 1 above tune the parameters using the selected features and…

asked Mar 06 '16 at 17:57

Ricky

269
3
10

11

votes

2 answers

Is the Transformer decoder an autoregressive model?

I have been trying to find an answer to these questions, but I only find conflicting information. Is the transformer as a whole autoregressive or not? And what about the decoder? I understand that the decoder during inference proceeds…

asked Nov 15 '21 at 18:36

Dametime

223
1
2
4

11

votes

2 answers

Why use bootstrapping?

The wiki page for bootstrapping says that you use it in the case where the underlying distribution is unknown. Why is bootstrapping, or sampling with replacement, better than just calculating the variance and other properties from the data directly?

statistics

asked Feb 26 '16 at 06:04

sebastianspiegel

891
4
11
16

11

votes

2 answers

Tools for automatic anomaly detection on a SQL table?

I have a large SQL table that is essentially a log. The data is pretty complex and I'm trying to find some way to identify anomalies without me understanding all the data. I've found lots of tools for Anomaly Detection but most of them require a…

asked Feb 12 '16 at 17:52

THE JOATMON

211
2
4

11

votes

1 answer

Knn distance plot for determining eps of DBSCAN

I would like to use the knn distance plot to be able to figure out which eps value should I choose for the DBSCAN algorithm. Based on this page: The idea is to calculate, the average of the distances of every point to its k nearest neighbors. …

asked Feb 09 '16 at 16:29

Marc Lamberti

327
1
3
8

Most Popular