Questions tagged [scikit-learn]

A machine-learning library for Python. Use this tag for any on-topic question that (a) involves scikit-learn either as a critical part of the question or expected answer, & (b) is not just about how to use scikit-learn.

A machine learning framework for Python.

scikit-learn is a machine-learning library for Python that provides simple and efficient tools for data analysis and data mining. It is accessible to everybody and reusable in various contexts. It is built on NumPy, SciPy, and matplotlib. The project is open source and commercially usable (BSD license).

1802 questions
10
votes
2 answers

Classification algorithms that return confidence?

Given a machine learning model built on top of scikit-learn, how can I classify new instances but then choose only those with the highest confidence? How do we define confidence in machine learning and how to generate it (if not generated…
user2295350
  • 421
  • 4
  • 10
  • 19
5
votes
1 answer

In scikit-learn, should classifiers be reinstated after every fold?

In scikit-learn, I don't see any classifier "unfit" or "unlearn" method similar to the untrain method of the classifiers in pyMVPA http://www.pymvpa.org/generated/mvpa2.clfs.svm.LinearCSVMC.html#mvpa2.clfs.svm.LinearCSVMC When I was using pyMVPA,…
Andy S
  • 392
  • 1
  • 9
4
votes
2 answers

Question about the Scikit-learn "SVM-Anova: SVM with univariate feature selection" example

Can anyone explain to me why in the Scikit-learn "SVM-Anova: SVM with univariate feature selection" example http://scikit-learn.org/stable/auto_examples/svm/plot_svm_anova.html when we use all features (100 percentile), the model has much better…
4
votes
0 answers

Add external features to sklearn pipeline

i'm using skearn's pipeline and GridSearchCV to apply grid search on text data classification problem as follows: text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf',…
Eitan
  • 131
3
votes
2 answers

How to find the optimal number of clusters for spectral clustering using similarity matrix in scikit learn

What is the best way of finding out the optimal number of clusters, given that I just have a similarity matrix? Is it possible to do it all in Scikit-learn without any extra implementation? My suggestion? PCA it! Please correct, Thank you!
3
votes
0 answers

Can I optimize for specificity using GridSearchCV in sklearn?

I am developing a fraud detection system for online purchases using random forests. My first concern was to optimize for recall, as the dataset was originally unbalanced (98% no fraud events and 2% fraud events). After resampling my dataset and…
Manuel Q
  • 153
  • 8
2
votes
1 answer

I can't explain this precision score

I am printing out the precision score and confusion matrix using sklearn. print("Confusion matrix:") print(confusion_matrix(test_y, predict_y)) print("Precision:", precision_score(test_y, predict_y)) The output is: Confusion matrix: [[910 16] […
RajV
  • 123
2
votes
1 answer

It’s possible to apply transform operations only to the training set while cross-validating with cross_val_score and pipelines?

I am trying to include steps in a pipeline that transform the data, ex. balancing the dataset. This pipeline is intended to be used with cross_val_score. However, I want to apply some transformations only over the training fold, since it may have no…
jias
  • 21
2
votes
2 answers

Is it legitimate to modify the classification of an scikit-learn random forest classifier by changing its default threshold?

I am using a random forest binary classifier (in sklearn) in Python to detect anomalous events with an extremely unbalanced class dataset (1% are positive and 99% are negative). My recall score for the positive class is generally above 4%, not very…
1
vote
1 answer

sklearn.metrics.r2_score vs sklearn.LinearRegression.score

I'm using sklearn to calculate the coefficient of determination between X (true age) and Y (predicted age). But I'm getting two different values for two different methods, which to the best of my understanding should be identical. Here is the data…
reas0n
  • 123
1
vote
0 answers

what is the RFE feature ranking based method of sci-kit learn?

I have used sci-kit learn RFE and RFECV to select the best features. https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html, It gives the ranking of the features. can anyone explain the ranking based method that is…
rk__
  • 11
1
vote
1 answer

Why does the last estimator of the Sklearn only get fit and not transformed?

Here is the documentation for the pipeline constructor from Sklearn website: Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform…
1
vote
2 answers

How are the scores computed with SelectKBest (sklearn)

I was expecting the scores_ provided by SelectKBest() to be the result of the score_func (e.g. the value of the F statistics itself when score_func = f_classif, or the chi2 statistics when score_func = chi2). But it is apparently not the case. I…
1
vote
1 answer

Generating y label in an sklearn pipeline?

I have a simple use case - I want to, as part of an sklearn pipeline, generate y labels based on X features (e.g. for predicting some signal in X, n timesteps in the future). I want it to be a pipeline step as I want the label based on the…
Raven
  • 131
1
vote
1 answer

Regarding pre-processing function StandardScaler in scikit-learn library. How to save the scaler variable for predicting new data

Please help me. I am trying to give maximum information related to my query. I have a query regarding pre-processing function in scikit-learn library. My data set are divided into 3 parts train, test, holdout. I am using StandardScalar function for…
Shivam Soni
  • 13
  • 1
  • 3
1
2