Suggestions for statistical computing workflow

Question

Note: I chose to ask this here instead of at stats.stackexchange.com because it is about software workflow tools and not about any particular methods. I felt that people more intimately familiar with the actual software packages would be able to help more, because I'm specifically trying to avoid the common answer I get from academics, which is to just always use R or Matlab and then make grad students figure out how to make stuff work for large data.

I'm about to start a large project that involves a lot of data mining (mostly through SQL), a lot of quick and dirty basic statistics (general linear models, covariance estimation, etc.), a significant chunk of more advanced methods (Bayesian stuff, advanced samplers, non-parametrics), a strong need to scale things up for multiprocessing, and the need to generate good plots.

Currently, I am pretty good with Python and associated scientific tools (NumPy, scikits stuff, matplotlib, and even PyCUDA / MPI for multiprocessing... I've never dealt with SQL before though). However, I find that it's often the case that the methods I need are relatively slow in Python and don't scale up well when the data sets get large. I only know a little bit of C/C++, and not much at all about Boost.Python or Cython.

I know that a lot of statisticians use R, but I have also heard that R is just a tiny step up from something like Matlab, which is way to slow and encumbered with oddly-specified built-in functions.

My question is: what is a good workflow / suite of tools for doing this kind of statistical work. What tools should I consider when I want to take some Python code that I've written and make it faster/better by moving it into a different language or by packaging Python libraries into C++, say. Is Boost.Python the kind of thing that will let me support advanced mathematical algorithms in C++ and then use them in Python? Is this a good thing to consider when doing statistical work, or is Boost.Python too paltry in statistical functions?

I've also seen PyR2, which lets you access virtually all of R but within Python. Is this fast enough to use on large data?

Any other suggestions about statistical workflow would be great!

There is PyPy, if you are interested, which is Python with a JIT compiler. PyPy includes a (mostly working) version of Numpy and it works really quickly compared to normal Python. — Blender, Mar 17 '12 at 00:13
This is probably a great thing to keep in mind for the future, but if it doesn't support lots of third party libraries, it might be tough. For example, I would almost surely want to be able to access scikits.statsmodels and scikits.learn. Without then, the speed gains would be offset by trying to re-write a bunch of my own statistics and machine learning functions. — ely, Mar 17 '12 at 00:14
Just so you know, R will interface just fine with C/C++ code that you write yourself as well (and will talk to just about any db you can find). But if you already feel more comfortable in Python, working there will save you some time probably. — joran, Mar 17 '12 at 00:28
Why the downvotes with no comments? [Here is](http://stackoverflow.com/questions/1429907/workflow-for-statistical-analysis-and-report-writing) another question that was not closed, covers similar (but meaningfully different) topics, and has useful answers. Why are people voting to close.. this makes no sense? — ely, Mar 17 '12 at 01:52
I didn't down vote, but did cast a close vote. This question is perfectly "good" but falls squarely under the SO definition of Not Constructive. There will not be a clear objective answer. If you find this upsetting, you should ask on Meta, but I can almost guarantee that the added attention will cause the question to be closed faster, and you will gain no allies by linking to similar questions from way back in 2009, when the standards were looser in this regard. — joran, Mar 17 '12 at 02:22
Several suggestions. If this is academical project or if you have $1k for tools, I would definitely suggest trying [Revolution R](http://www.revolutionanalytics.com/products/revolution-r.php) - it is said to be production level R implementation with much more faster built-in functions and good scalability (including scaling to a cluster). Also if you are ready to learn new languages, take a look at the Clojure and [Incanter](http://incanter.org/) project. If you still want to use Python, you may be interested in using Jython or IronPython (though I'm not sure they will increase performance). — ffriend, Mar 17 '12 at 03:35
Thanks @ffriend, the link to Incanter alone makes the thread valuable. — ely, Mar 17 '12 at 04:50

Suggestions for statistical computing workflow

0 Answers0