Most Popular

1500 questions
11
votes
3 answers

When does cache get expired for a RDD in pyspark?

We use .cache() on RDD for persistent caching of an dataset, My concern is when this cached will be expired?. dt = sc.parallelize([2, 3, 4, 5, 6]) dt.cache()
krishna Prasad
  • 1,147
  • 1
  • 14
  • 23
11
votes
1 answer

How to determine if character sequence is English word or noise

What kind of features you will try to extract from list of words for future predicting, is it existing word or just mess of characters ? There is description of task that I found there. You have to write a program that can answer whether a given…
vimi
  • 155
  • 1
  • 7
11
votes
2 answers

Have 100% images from ImageNet been proven to belong to the class annotated?

Is it proven that all 15M images were manually classified correctly and there are no mistakes or randomly selected responses collected?
ivan866
  • 210
  • 2
  • 7
11
votes
6 answers

Python: Handling imbalance Classes in python Machine Learning

I have a dataset for which I am trying to predict target variables. Col1 Col2 Col3 Col4 Col5 1 2 23 11 1 2 22 12 14 1 22 11 43 38 3 14 22 25 19 3 12…
SRS
  • 1,065
  • 5
  • 11
  • 22
11
votes
3 answers

How to group identical values and count their frequency in Python?

Newbie to analytics with Python so please be gentle :-) I couldn't find the answer to this question - apologies if it is already answered elsewhere in a different format. I have a dataset of transaction data for a retail outlet. Variables along with…
new_analyst
  • 123
  • 1
  • 1
  • 7
11
votes
1 answer

Decision tree, how to understand or calculate the probability/confidence of prediction result

For example, a drug prediction problem using a decision tree. I trained the decision tree model and would like to predict using new data. For example: patient, Attr1, Attr2, Attr3, .., Label 002 90.0 8.0 98.0 ... ? ===> predict drug…
GoingMyWay
  • 233
  • 1
  • 2
  • 9
11
votes
2 answers

How do "intent recognisers" work?

Amazon's Alexa, Nuance's Mix and Facebook's Wit.ai all use a similar system to specify how to convert a text command into an intent - i.e. something a computer would understand. I'm not sure what the "official" name for this is but I call it "intent…
Timmmm
  • 231
  • 1
  • 2
  • 5
11
votes
2 answers

Neural net for server monitoring

I'm looking at pybrain for taking server monitor alarms and determining the root cause of a problem. I'm happy with training it using supervised learning and curating the training data sets. The data is structured something like this: Server Type A…
Matt Williamson
  • 211
  • 1
  • 2
11
votes
1 answer

HOW TO: Deep Neural Network weight initialization

Given difficult learning task (e.g high dimensionality, inherent data complexity) Deep Neural Networks become hard to train. To ease many of the problems one might: Normalize && handpick quality data choose a different training algorithm (e.g…
Joonatan Samuel
  • 219
  • 2
  • 5
11
votes
2 answers

Dissmissing features based on correlation with target variable

Is it valid to dismiss features based on their Pearson correlation values with the target variable in a classification problem? say for instance I have a dataset with the following format where the target variable takes 1 or 0: >>> dt.head() ID …
MedAli
  • 275
  • 1
  • 2
  • 10
11
votes
4 answers

Which one first: algorithm benchmarking, feature selection, parameter tuning?

When trying to do e.g. a classification, my approach currently is to try out various algorithm first and benchmark them perform feature selection on the best algorithm from 1 above tune the parameters using the selected features and…
Ricky
  • 269
  • 3
  • 10
11
votes
2 answers

Is the Transformer decoder an autoregressive model?

I have been trying to find an answer to these questions, but I only find conflicting information. Is the transformer as a whole autoregressive or not? And what about the decoder? I understand that the decoder during inference proceeds…
Dametime
  • 223
  • 1
  • 2
  • 4
11
votes
2 answers

Why use bootstrapping?

The wiki page for bootstrapping says that you use it in the case where the underlying distribution is unknown. Why is bootstrapping, or sampling with replacement, better than just calculating the variance and other properties from the data directly?
sebastianspiegel
  • 891
  • 4
  • 11
  • 16
11
votes
2 answers

Tools for automatic anomaly detection on a SQL table?

I have a large SQL table that is essentially a log. The data is pretty complex and I'm trying to find some way to identify anomalies without me understanding all the data. I've found lots of tools for Anomaly Detection but most of them require a…
THE JOATMON
  • 211
  • 2
  • 4
11
votes
1 answer

Knn distance plot for determining eps of DBSCAN

I would like to use the knn distance plot to be able to figure out which eps value should I choose for the DBSCAN algorithm. Based on this page: The idea is to calculate, the average of the distances of every point to its k nearest neighbors. …
Marc Lamberti
  • 327
  • 1
  • 3
  • 8