1

I came across this interesting r/statistics post here:

As of the time of writing (Feb 2023) the post is 5 years old, but it is disturbing and makes me thing twice about blindly using sklearn. The top voted comment points to an interesting email thread which led to the deprecation of a sklearn bootstrap method which as it turns out was not in fact a proper bootstrap. The devs removed the method because they were afraid people would use it without checking the code (which you should not have to do IMO) and assuming they were getting a bootstrap. As sklearn is community built as opposed to R which is often built by academic statisticians who will cite articles describing how to implement the code properly R may not be faster but it is more trustworthy. Personally I place a lot of importance on rigor of a model and I don't think simply rushing to RFs and neural nets should be done without trying more interpretable rigorous methods first. This post is from several years ago and I note that python now has the statsmodels module which has really nice documentation, but in playing with it I found there were cases where R had an implementation of something that python simply lacked. It's hard to do power analysis for instance in python, you have to scratch build code.

Angus Campbell
  • 468
  • 2
  • 8

1 Answers1

1

Is it still true that sklearn cannot be trusted for statistics

Sklearn has insisted it is a machine learning library and not a statistics library. There have been a few cases where devs have plainly said so (and even pondered why users would want an unpenalized version of logistic regression).

To measure sklearn against R for statistics is to judge a fish by its ability to climb a tree.

s sklearn is community built as opposed to R which is often built by academic statisticians who will cite articles describing how to implement the code properly

LOL if anything, this is an argument against R. Many (not all) academics are very poor software engineers with poor practices. While the example of the bootstrap in sklearn is notable, they have good software engineering practices from where I stand. I'd trust a project with 100 developers and good unit tests more than I trust one person with no software engineering best practices to speak of.

  • I mean the documentation in R is generally much better than in sklearn. And having worked with lots of CS majors they don't really understand stats and just blithely assume they can copy paste data science scripts or don't benchmark non-parametric methods against more interpretable methods which lend themselves better to making inferences. – Angus Campbell Feb 06 '23 at 23:05
  • @AngusCampbell That is different than failing to ensure your code does what you think it does. Many academics are guilty of that. – Demetri Pananos Feb 06 '23 at 23:06
  • Fair, but can you point to an instance in R similar to the deprecation issue mentioned? – Angus Campbell Feb 06 '23 at 23:06
  • https://scikit-learn-general.narkive.com/CcCsycWg/bootstrap-depracation-warning – Angus Campbell Feb 06 '23 at 23:07
  • @AngusCampbell I don't have one off the top of my head, but a single deprecation does not a bad library make. – Demetri Pananos Feb 06 '23 at 23:08
  • But there are others: https://stats.stackexchange.com/questions/8025/what-are-correct-values-for-precision-and-recall-when-the-denominators-equal-0

    The way I see it python is better for implementing models but for actual analysis I will probably lean towards R. I already do, andI've found it challenging to do things like power analysis in python or find methods that compute significance in more edge cases. And when I dig into the documentation for R libraries it is almost invariably better, and yes the code often does what the author says it does.

    – Angus Campbell Feb 06 '23 at 23:10
  • I mean really a data scientist should know both IMO. – Angus Campbell Feb 06 '23 at 23:14
  • @AngusCampbell Here is an issue I found where confidence intervals always returned 95% https://github.com/tidymodels/rsample/pull/179. You seem insistent that R is superior for stats -- which no one has disputed -- and I find all arguments about languages to be rehashes of stuff that has been said before so I think the issue was right to be clsoed. – Demetri Pananos Feb 06 '23 at 23:16
  • I'm not disputing the closure stack has a policy on opinion based questions so it's fair to close though I didn't intend it to be that way when I posted it. I think the closure is fair. But I wanted to foster discussion and you seemed to have an opinion so I wanted to hear it. I appreciate the sentiment you espouse. – Angus Campbell Feb 06 '23 at 23:29
  • R versus Python has been rehashed on various social platforms and blog post many years ago. It's not a discussion for stackexchanges. Asking for specific differences across packages and their reasons is still on-topic. (example handling of collinearity and perfect separation and other edge cases) – Josef Feb 07 '23 at 16:04