Note: I chose to ask this here instead of at stats.stackexchange.com because it is about software workflow tools and not about any particular methods. I felt that people more intimately familiar with the actual software packages would be able to help more, because I'm specifically trying to avoid the common answer I get from academics, which is to just always use R or Matlab and then make grad students figure out how to make stuff work for large data.
I'm about to start a large project that involves a lot of data mining (mostly through SQL), a lot of quick and dirty basic statistics (general linear models, covariance estimation, etc.), a significant chunk of more advanced methods (Bayesian stuff, advanced samplers, non-parametrics), a strong need to scale things up for multiprocessing, and the need to generate good plots.
Currently, I am pretty good with Python and associated scientific tools (NumPy, scikits stuff, matplotlib, and even PyCUDA / MPI for multiprocessing... I've never dealt with SQL before though). However, I find that it's often the case that the methods I need are relatively slow in Python and don't scale up well when the data sets get large. I only know a little bit of C/C++, and not much at all about Boost.Python or Cython.
I know that a lot of statisticians use R, but I have also heard that R is just a tiny step up from something like Matlab, which is way to slow and encumbered with oddly-specified built-in functions.
My question is: what is a good workflow / suite of tools for doing this kind of statistical work. What tools should I consider when I want to take some Python code that I've written and make it faster/better by moving it into a different language or by packaging Python libraries into C++, say. Is Boost.Python the kind of thing that will let me support advanced mathematical algorithms in C++ and then use them in Python? Is this a good thing to consider when doing statistical work, or is Boost.Python too paltry in statistical functions?
I've also seen PyR2, which lets you access virtually all of R but within Python. Is this fast enough to use on large data?
Any other suggestions about statistical workflow would be great!