75

I know the title sounds a little extreme but I wonder whether R is phased out by a lot of quant desks at sell side banks as well as hedge funds in favor of Python. I get the impression that with improvements in Pandas, Numpy and other Python packages functionality in Python is drastically improving in order to meaningfully mine data and model time series. I have also seen quite impressive implementations through Python to parallelize code and fan out computations to several servers/machines. I know some packages in R are capable of that too but I just sense that the current momentum favors Python.

I need to make a decision regarding architecture of a subset of my modeling framework myself and need some input what the current sentiment is by other quants.

I also have to admit that my initial reservations regarding performance via Python are mostly outdated because some of the packages make heavy use of C implementations under the hood and I have seen implementations that clearly outperform even efficiently written, compiled OOP language code.

Can you please comment on what you are using? I am not asking for opinions whether you think one is better or worse for below tasks but specifically why you use R or Python and whether you even place them in the same category to accomplish, among others, the following tasks:

  • acquire, store, maintain, read, clean time series
  • perform basic statistics on time series, advanced statistical models such as multivariate regression analyses,...
  • performing mathematical computations (fourier transforms, PDE solver, PCA, ...)
  • visualization of data (static and dynamic)
  • pricing derivatives (application of pricing models such as interest rate models)
  • interconnectivity (with Excel, servers, UI, ...)
  • (Added Jan 2016): Ability to design, implement, and train deep learning networks.

EDIT I thought the following link might add more value though its slightly dated [2013] (for some obscure reason that discussion was also closed...): https://softwareengineering.stackexchange.com/questions/181342/r-vs-python-for-data-analysis

You can also search for several posts on the r-bloggers website that address computational efficiency between R and Python packages. As was addressed in some of the answers, one aspect is data pruning, the preparation and setup of input data. Another part of the equation is the computational efficiency when actually performing statistical and mathematical computations.

Update (Jan 2016)

I wanted to provide an update to this question now that AI/Deep Learning networks are very actively pursued at banks and hedge funds. I have spent a good amount of time on delving into deep learning and performed experiments and worked with libraries such as Theano, Torch, and Caffe. What stood out from my own work and conversations with others was that a lot of those libraries are used via Python and that most of the researchers in this space do not use R in this particular field. Now, this still constitutes a small part of quant work being performed in financial services but I still wanted to point it out as it directly touches on the question I asked. I added this aspect of quant research to reflect current trends.

Matt Wolf
  • 14,434
  • 3
  • 27
  • 56
  • I am not sure but definitively there are some adventages for python in regards to the development of packages in some areas. – Barnaby May 19 '15 at 08:26
  • 17
    You are a highly respected member of this community but I am getting a worse and worse feeling about this question. One of the examples of questions that we don't want on this site is "What programming language should I use?" (quant.stackexchange.com/help/on-topic). When you look at the discussions in the comments you can see why: They are getting more and more contentious - and you seem to have made up your mind anyway. I think if somebody with less rep had asked this question it would have got closed right away. I think best would be to close this question. Do you see my point? – vonjd May 21 '15 at 15:35
  • @vonjd, I have not made up my mind else I would not have asked. And we should be fair in acknowledging that some on this site have a very strong vested interest in leaning towards R because they derive a portion or all of their living from writing R code, hence their rather strong wording. I defend the question because the question and hopefully answers are imho very relevant to those working at quant desks or potentially to those who pour many tens if not hundreds of thousands into projects. – Matt Wolf May 21 '15 at 15:41
  • But I am of course entirely open to let the community vote to have the question closed if most think it is not relevant nor targeted enough (though I listed very specific use cases that I am interested in)... – Matt Wolf May 21 '15 at 15:42
  • By the way, is there a way to vote or suggest allowing certain questions that may currently not fit the desired format? I find questions like "which language is recommended for xyz" or "is abc-regression better suited to tackle xyz than bcd-regression" very important and useful for those who work in this field. At least a lot more useful than many questions that are kept open of the type "where can I download free tick data" or "does yahoo finance backward adjust dividend splits"... – Matt Wolf May 21 '15 at 15:50
  • Fair enough. You could raise this on meta when you think that the rules of this site should be changed. – vonjd May 21 '15 at 15:56
  • 2
    @vonjd, I raised this on meta, thanks for suggesting this: http://meta.quant.stackexchange.com/questions/1452/willingness-to-consider-a-revision-to-the-current-question-format-guidelines – Matt Wolf May 22 '15 at 05:04
  • Upvoted on meta. – vonjd May 22 '15 at 05:51
  • I noticed that there has been a relative fury of down/up votes on answers to this particular question. While I think there is value in a referendum on the subject, I would encourage more people to share their thoughts in the comments and new answers especially those with experience using both languages. – rhaskett May 27 '15 at 16:02
  • I did not notice a fury nor downvotes. And I fully agree with your suggestion. What really currently discourages me to again more actively participate on this site is pressure to conform to strict "rules" and guidelines. Humans are not bits and bytes nor does efficient and intelligent learning involve black and white Q&A formats. As this question demonstrates the format itself is already questioned because some seem to feel incredibly uncomfortable to go out of their "rules-based" comfort zone. I also like to see more healthy debate and sharing... – Matt Wolf May 28 '15 at 00:08
  • Many people put a lot of effort into this, so I would be interested whether the answers helped you to arrive at a conclusion? – vonjd Jun 07 '15 at 20:05
  • 2
    @vonjd, no I have not yet made a decision. But I am much better informed thanks to some of the answers and my spending more time with packages such as data.table and rcpp. It does not change my impression of bits and pieces being "glued together" in R in order to run more performant computations (Rcpp is in effect a bridge to run compiled C++ code and data.tables is a highly indexed data structure which should not be compared with solutions that make no use of indexing). My main concern at this point is that I will end up with code bases in multiple languages to achieve ... – Matt Wolf Jun 08 '15 at 03:45
  • 4
    ...performance that matches or exceeds what can be done purely in Python. For example, any statistical or numerical techniques that cannot be vectorized require me to essentially maintain a C++ code base to beat code operations in Python. Similar applies to visualizations: Most dynamic visualizations or visuals that allow me to pan/zoom or otherwise manipulate rendering during run-time requires knowledge of .js and/or D3.js. Python on the other hand allows me to more easily interface with existing visualization libraries I already peruse. But as said, I have not yet come to a final conclusion – Matt Wolf Jun 08 '15 at 03:51
  • Did you see this, it might be interesting for you: http://blog.dominodatalab.com/comparing-python-and-r-for-data-science/ and http://blog.datacamp.com/r-or-python-for-data-analysis/ – vonjd Jun 09 '15 at 17:01
  • Thanks, vonjd, I took a quick look but am frankly not a big fan of generalized comparison reviews because it does not address specific needs (for obvious reasons). – Matt Wolf Jun 10 '15 at 00:00
  • It's not far enough along in the development cycle for your needs, but keep an eye out for julia in the future. I've played around with it a bit myself and it has the potential to replace/complement both R and Python for this kind of technical work. – Colin T Bowers Aug 13 '15 at 05:41
  • @MattWolf Perhaps your Jan 2016 update would be better as a separate question. E.g. "What libraries/packages would you recommend to do deep learning in quant finance applications?" (That leaves it language-neutral, which may or may not be a good idea...) – Darren Cook Feb 03 '16 at 19:01
  • @DarrenCook, while I agree that this site should encourage much more exposure to deep learning in quant finance I believe the addition (Jan 2016 update) is very relevant to this question. Deep learning is perhaps the area at banks, hedge funds, and at private equity that sees the most incremental investment in terms of funding and talent hiring. I do think that it is an area that clearly favors Python over R and I would love to hear from other practitioners about their take. – Matt Wolf Feb 04 '16 at 06:27
  • @MattWolf OK. I'm just saying it is better to start a new question than update one that already has answers, including an accepted answer. – Darren Cook Feb 04 '16 at 08:25

8 Answers8

50

My deal is HFT so what I care about is

  1. read/load data from file or DB quickly in memory
  2. perform very efficient data-munging operations (group,transform)
  3. visualize easily the data

I think is is pretty clear that 3. goes to R, graphics and ggplot2 and others allow you to plot anything from scratch with little effort.

About 1. and 2. I am amazed reading previous post to see that people are advocating for python based on pandas and that no one cites data.table The data.table is a fantastic package that allows blazing fast grouping/transforming of tables with 10s million rows. From this bench you can see that data.table is multiple time faster than pandas and much more stable (pandas tend to crash on massive tables)

Example

R) library(data.table)
R) DT = data.table(x=rnorm(2e7),y=rnorm(2e7),z=sample(letters,2e7,replace=T))
R) tables()
     NAME       NROW NCOL  MB COLS  KEY
[1,] DT   20,000,000    3 458 x,y,z    
Total: 458MB
R) system.time(DT[,.(sum(x),mean(y)),.(z)])
   user  system elapsed 
  0.226   0.037   0.264 

R)setkey(DT,z)
R)system.time(DT[,.(sum(x),mean(y)),.(z)])
  user  system elapsed 
  0.118   0.022   0.140 

Then there is speed, as I work in HFT neither R nor python can be used in production. But the Rcpp package allows you to write efficient C++ code and integrate it to R trivially (literally adding 2 lines). I doubt R is fading, given the number of new packages created every day and the momentum the language has...

EDIT 2018-07

A few years latter I am amazed by how the R ecosystem has evolved. For in-memory computation you get unmatched tools, from fst for blazing fast binary read/write, fork or cluster parallelism in one liners. C++ integration is incredibly easy with Rcpp. You get interactive graphics with the classics like plotly, crazy features like ggplotly (just makes your ggplot2 interactive). For trying python with pandas I honestly do not understand how there could even be a match. Syntax is clunky and performance is poor, I must be too used to R I guess. Another thing that is really missing in python is litterate programming, nothing comes close to rmarkdown (the best I could find in python was jupyter but that does even come close). With all the fuss surrounding the R vs Python langage war I realize that vast majority of people are simply uninformed, they do not know what data.table is, that it has nothing to do with a data.frame, they do not know that R fully supports tensorflow and keras.... To conclude I think both tools can do everything and it seems that python langage has very good PR...

statquant
  • 1,288
  • 1
  • 9
  • 16
  • 2
    Hmm I guess I need to disagree with you here regarding visualizations. R packages are still lightyears behind efficient and especially dynamic visualization. Every first year IT student can chart a time series from scratch. What people want and need is visualization of millions of data points that a charting app can down sample. Fast zooming and panning and handling of annotations. I have not seen anything in R that comes even remotely close. – Matt Wolf May 28 '15 at 20:07
  • 1
    Secondly datatables in R are very very slow. Throw a few million time series data points at it and data frames go to their knees. The only thing I have seen that was fast was an implementation that perused memory mapping. But one could argue this is just an interface R peruse ...as soon as you actually grab the data and run R functions over it becomes very slow. Caveat here: I have not looked at any new developments over the past 8 months in R space. If there is anything new I would be happy to be pointed to it. – Matt Wolf May 28 '15 at 20:10
  • And when you talk about something crashing then the problem lies with improperly providing the required input format. The same can happen in OOP languages, Python and R. – Matt Wolf May 28 '15 at 20:12
  • Every now and then there will be stuff that you will not be able to do with data.table which will cause you to use regular R data.frame and curse yourself that you had started off with pandas instead. For instance, data.table can fail to read many CSV files that both regular R data.frame and Pandas can read easily. Or, try working with data that has 600 columns & writing loops over columns. In data.frame or pandas, you can loop over for i in x.columns: and do something like x.loc[:,i] = .., but in data.table you might need 600 lines for each columns – uday May 28 '15 at 20:15
  • @uday: about loops... that's also false, look at .SDcols this is faster and easier than a loop – statquant May 28 '15 at 20:20
  • @statquant, nope it is not. frequently breaks down for mixed text and number CSV and you will tired of writing on SO, where R experts will remind you frequently to give reproducible test cases even if you mention that you are reading a 2 GB csv file, etc. – uday May 28 '15 at 20:20
  • @statquant, that is a pretty bold claim you make. I am happy to whip up a few test batteries when I find time in the next couple days but being a kdb user myself I find that pretty hard to believe. I will report back with some numbers. Thanks for your answer and for sharing your insight. My intent to move away from kdb by the way is the precise reason that caused my asking this question. – Matt Wolf May 28 '15 at 20:20
  • @MattWolf what claim ? do you refer to this benchmark: https://github.com/Rdatatable/data.table/wiki/Benchmarks-:-Grouping – statquant May 28 '15 at 20:22
  • 1
    @statquant, another issue with data.table is that their founders and followers are somewhat too protective (although similar in the case of pandas too) and any sign of mentioning issues with data.table you can expect your post on SO to get rapidly downgraded. At least in the case of pandas, (1) the source code is available on github (you can do an easy search on web without downloading) versus downloading the sources code from CRAN, (2) you can easily overwrite pandas to customize your own subclass of pandas dataframes (I use two such specialized subclasses for my work). – uday May 28 '15 at 20:26
  • That grouping and transforming on 20 million rows takes less than 1 second as well as you stating that the speed reaches kdb performance benchmarks... – Matt Wolf May 28 '15 at 20:26
  • @MattWolf ok, I can help on this if you fancy it, you can ping me at username At outlook D.T cOm. But your initial question was R vs Python... not kdb :). Actually just tried on my personal laptop to sum a column of 2e7 random gaussian number , and average another (independent) by group (23 of them) it took <200 ms, (the table was 500M in memory) – statquant May 28 '15 at 20:36
  • Statquant, I appreciate your offer and will contact you. I only mentioned kdb as you brought it up. I am definitely interested in gaining more insight into the data.table package you mentioned because performance in R was a deal breaker for me so far. – Matt Wolf May 28 '15 at 20:36
  • @statquant, :-) given how you are persistent on data.table, I will try it for purely numeric data and test out if it is faster than pandas. thanks for sharing the link. That .SDcols option might not have been there two years ago when I was trying to use data.table extensively (or may be I overlooked it). – uday May 28 '15 at 21:47
  • @statquant, I played a bit with the data.table and while it seems indeed significantly improve data table grouping and table transformations my original concern is not addressed. For computational efficiency the organization of input data is only one part of the equation. The main resource consumption will be taken up by the respective statistical and mathematical computations and that is where I am not (yet) sold that R comes close to Python's stats and math libraries in terms of computational efficiency. – Matt Wolf May 29 '15 at 07:40
  • @Matt Wolf I'll be honest I am not sure you really have a clear idea what you want. What do you do that require a lot of computational power? If thats linear algebra then you have Rcpp Armadillo or other packages that I know are blowing numpy away... Can you be clearer please? – statquant May 29 '15 at 19:42
  • I thought I was very specific about my requirements in the question I originally asked. Just because some of the requirements require large data quantities and others do not should not be confused with me not knowing what I want. I look for a development and analytical testing architectural change that needs to cover both, the analysis and visualization of vast amounts of time series based data and options order book data on any end of the spectrum – Matt Wolf May 29 '15 at 20:43
  • as well as pricing derivatives via Monte Carlo, PCA, or more mathematically involved PDE solvers on the other end of the spectrum. I get the point that indexed data tables allow for fast access to chunks of data but this only serves the starting point of any analytical or numerical exercise... – Matt Wolf May 29 '15 at 20:49
  • what I need to better understand is the computational efficiency of the actual statistical and numerical procedures. Your data tables can be as fast as they want but if the actual visualization of time series in R gets on its knees when you throw a million or so data points at it then you have your bottle neck right there. Same goes for MC pricing. Is that clearer? – Matt Wolf May 29 '15 at 20:49
  • Ok... I think those 3 things are numerically extremely different, as one should take advantage of massive parallelisation, the second very efficient linear algebra and the third GPU computing. Each would require a post on their own. I doubt R will allow you to obtain cutting hedge implementations in all those fields, never would Python... As far as data visuaIisation is concerned have you looked at http://www.amazon.co.uk/Graphics-Large-Datasets-Visualizing-Statistics/dp/0387329064 ? I think we should continue this discussion on a chat/email as this gets a bit off topic. – statquant May 29 '15 at 21:09
  • Agree fully that each requires different approaches and poses different requirements in general. However in the end of the day I and my team still needs to get our work done in our framework of choice. For visualization, for example, we peruse a C# Frontend that we equipped with massive parallelization capabilities, customizability, and the ability to make use of hardware based technologies. For parallel and async computing we also interface with different technologies which is precisely why we look for a framework that boasts strong capabilities in interfacing with other components. – Matt Wolf May 30 '15 at 05:29
  • And hence we look for ways in either R or Python to migrate part of our design and pricing framework to. I tremendously benefitted from this discussion already and you make lots of very good and above all informed points. Thanks a lot for adding so much value. – Matt Wolf May 30 '15 at 05:32
  • OK last thing then, I did not realize that you were OK to spend substantial time in heavy development. If that's the case even if I can not help you as I did not do it myself I would still go the R route. All bleeding hedge numerical procedures provide a C++ API and Rcpp provides the easiest way to leverage this. Look at the R task views for a lot of references. A far as graphical are concerned D3 is also coming... – statquant May 30 '15 at 08:18
  • I am spending some time with Rcpp this weekend. Thanks for the pointer. – Matt Wolf May 30 '15 at 10:57
  • About plots - R is really good at plotting 2d plots once and not touching them. It's meh at 3d plots and has virtually no support for mouse interaction. Everything else I think is spot on. – eddi Jun 22 '15 at 22:47
  • @MattWolf I'd like to know your thoughts now that you had the time to research and compare. – statquant Aug 07 '15 at 18:25
  • 1
    @statquant, I cannot comment on what we decided in the end to peruse but I can comment on why we decided not to choose R. The biggest reason was time-consuming procedures to hook up R code with our existing research and trading framework. While everything can be interconnected some way or the other connectivity often proved quirky and cumbersome. To name just two examples: To up R's performance regarding large data sets one can use packages such as data table and Rcpp among others but then the question begged why using R in the first place. Another example is visualizations: None of the... – Matt Wolf Aug 08 '15 at 13:27
  • 1
    ...R packages provided visualization capabilities that we demanded. Even D3 is not performant enough. We decided to stay with C# in regards to our front-ends that encapsulate visualizations. We hardly use any of the R packages for algorithms and statistical/mathematical computations hence we are entirely independent from those. In the end we wondered what R is particularly good at in comparison with other choices and we believe R is a great all-round tool that performs reasonably well but does not come out on top in any of the categories. Hope this explains our decision making process a bit. – Matt Wolf Aug 08 '15 at 13:30
  • 1
    @MattWolf regarding dense time series visualisation (interactive and performant, including downsampling) check out dygraphs and its R port – Daniel Krizian Jan 08 '16 at 13:18
  • @DanielKrizian, probably the most versatile and best .js library I have seen for visualization purposes. Thanks for pointing out this "gem". I have not tested whether it holds up to its claim to manage "millions of datapoints" but it certainly looks a lot better than anything I have seen out there in .js space. If it has a good down-sampling algorithm then it might be even more interesting. Thanks a lot. – Matt Wolf Jan 09 '16 at 04:36
  • @DanielKrizian I ++ MattWolf this looks simple and effective. – statquant Jan 09 '16 at 08:35
  • @MattWolf: can you tell us what R package you tried? What do you think of plotly? – statquant Jan 09 '16 at 08:35
  • @statquant, have not seen this package before, seems pretty new (post 2013?)...in all honesty I stopped following the emergence of new R packages (though to give credit plotly seems to offer an API for Python, too) late last year because I decided to go the Python route for all future research purposes. – Matt Wolf Jan 09 '16 at 08:52
  • @MattWolf fair enough – statquant Jan 09 '16 at 08:53
  • 1
    @MattWolf seems you have wrongly taken data.frame as data.table. Data.frame is not a data.table. Data.table is a separate package, it speed up data.frame with C-written methods. Data.table is currently the fastest open source for aggregation and join. If you have any benchmarks which opposed that (and are not driven by a bug at the time when they were made) please share it as I was looking for any for a long time. Be aware data.table doesn't need indexes to be fast. – jangorecki Mar 08 '16 at 19:49
  • @jangorecki, I believe you are addressing a different user? When I used the term "datatable" once (I think in my first comment to this answer) it was meant in the generic sense and I was not addressing the particular R package. – Matt Wolf Mar 11 '16 at 10:02
32

Instead of wild guesses about R's/python's future in the community, here some facts:

The following query on StackExchange Data Explorer counts the number of questions that have <r> or <python> tags. If you scroll down on one of the three webpages provided below, you can see a graph with data on a monthly basis. You can easily run this query on databases for other sites as well (just go to "Switch sites" right below the query).

stats http://data.stackexchange.com/stats/query/350129/r-versus-python-tags#graph

stack http://data.stackexchange.com/stackoverflow/query/350129/r-versus-python-tags#graph

quant http://data.stackexchange.com/quant/query/350129/r-versus-python-tags#graph

The results:

  • In absolute terms, R has more hits for both stats.stackexchange.com and quant.stackexchange.com (the latter having very few data points). Python has more hits for stackoverflow.com.

  • In relative terms, the gap between R and python is closing for stackoverflow.com (ratio approx 1 to 3 at the moment). The ratio between R and python tags on stats.stackexchange.com is more or less stable since mid/end 2013 (roughly a factor 10 or a little above).

I really do think that the tag statistics in the stackexchange universe are a good indicator of the current interest in a particular programming language - probably even more so for its future popularity.

All-in-all, I am confident that the present data makes a strong case against Matt Wolf's hypothesis that "R might be obsolete in 3-4 years". ;)


Update: So now it's been 6 months since my initial answer. We still have to wait another 2.5-3.5 years to definitely see whether R has become obsolete. :) In the meantime, a quick addition due to Matt Wolf's comment. Here are variations of the above queries that give you the tag ratios (that's what I have been referring to in the second point of my answer). All ratios are python tags divided by R tags.

stats

http://data.stackexchange.com/stats/query/421036/r-versus-python-tags-quotient-py-r#graph

I do not see a clear trend here. The Py/R ratio is around 0.07 (there was a spike to 0.095 in November though). Since mid 2013, the ratio varies between 0.04 and 0.11. So I would call it relatively stable.

SO

http://data.stackexchange.com/stackoverflow/query/421032/r-versus-python-tags-quotient-py-r#graph

There was indeed a short term trend in favor of Python since Jul 15 (Py/R ratio went from 3.1 to 3.5). So the statement that "R is closing the gap wrt the Py/R ratio" could be called obsolete at the moment.

quant

http://data.stackexchange.com/quant/query/421042/r-versus-python-tags-quotient-py-r#graph

Still very noisy. Python did seem to catch up a little bit the last few months. But hard to tell with that little data.

cryo111
  • 481
  • 4
  • 5
  • 12
    Three cheers and an up vote to bringing empirics into a decade full of conjectures. – Dirk Eddelbuettel Aug 15 '15 at 19:34
  • @cryo111, not sure I would call this stable, tags spreads have continuously increased since the starting date of your query. Also, StackExchange does not seem to be the main platform of exchange for Python power users. I have not mentioned it in my original post but a majority of research work in AI (ML and especially Deep Learning Networks) is performed with Python as wrapper, all major tools such as Theano or Torch provide Python but no R libraries. So, I hold on to my hypothesis, which I feel even strengthened today vs 7 months ago. – Matt Wolf Jan 08 '16 at 10:41
  • 2
    @MattWolf In my above post, I have been referring to the tag ratio. Maybe I have been unclear with that. Will add 3 more queries for the ratios. Wrt to these, things seem relatively stable (apart from quant.stackexchange where volume is low and therefore very noisy). What is the main platform for python users? I am not a python programmer, so I don't know to be honest. But I agree with you that R probably won't overtake python in terms of absolute user numbers [which I have never been claiming :)]. – cryo111 Jan 08 '16 at 12:04
  • @cryo111, not sure we are looking at the same data (and I have not run time series queries over StackExchange data myself but just looked at simple tag counts). Python currently stands at 517027 tags counted while R tags amount to 119847. Not sure how you can get to a quotient of Py/R ever being smaller than 1...(source: https://api.stackexchange.com/docs/tags#order=desc&sort=popular&filter=default&site=stackoverflow&run=true). Now if the most recent monthly counts strongly point in the opposite direction... – Matt Wolf Jan 09 '16 at 04:48
  • ...as you suggest then would this not make you think hard why that is the case (if even remotely true)? I think most everyone would agree that Python even in pure quant space currently sports a stronger momentum than R. I suggest your Py/R tag count ratio of 0.06 is picking up data from a very biased data set (Crossvalidated) rather than looking at the overall StackExchange dataset. Please correct me if I read your query wrong. I am not into language wars but numbers are numbers and facts are facts, an approach you brought up. – Matt Wolf Jan 09 '16 at 04:56
  • 1
    @MattWolf No, you are not looking at the same data. Your numbers are the sums over all tags (since the beginning of the SE universe) whereas my numbers are per-month aggregations. I have chosen a time-series representation because I wanted to see the trend. The Py/R quotient smaller than 1 comes from http://stats.stackexchange.com. I have also included SO and quant SE, as these seem to be the most interesting for quants. I don't know of any other SE sites that might be relevant. But if you know other sites, you can easily switch the above queries to one of these. – cryo111 Jan 10 '16 at 14:10
  • Your queries imho are a textbook case of selection bias. All they express is that there are more r vs python questions on the stats and quant sites. I do not buy into the claim that tag count in the past is a reliable indication of where the route takes us. Else I would not have asked the question but counted tags. Perhaps a lot of R count on the quant site represents Eddelbuettel and friends maintaining rapport with users of their R packages? Maybe its just that not a whole lot of Numpy/... developers hang out here? – Matt Wolf Jan 10 '16 at 18:03
  • Or perhaps that this site has completely missed out on the fact that most banks and hedge funds are by now fully entrenched in AI research (which disregards R and heavily peruses Python)? Could be thousand reasons for tag count on the quant site. In conclusion, supported by the fact that one of the hottest topics in quant and technology space in general (AI and deep learning) fully embraces Python I easily stand by my prediction. – Matt Wolf Jan 10 '16 at 18:04
26

This is interesting because I see another trend: Matlab is being replaced by R, but I guess this is another story.

I use R for my academic (I am also teaching this stuff) as well as my consulting work (I am mainly working in the $\mathbb{P}$ area, with some excursions into $\mathbb{Q}$). I tried Python but it didn't work for me. I think the main reasons I will stick with R are:

  • especially in the area of statistics and analytics there is such a huge amount of high quality packages with sometimes even very recent methods which is unrivalled by any other language out there
  • for me R has the right mixture of low level capabilities of e.g. (re-)organizing data and high level commands (e.g. even k-means in the core package)
  • the speed is ok for me because I am not working in the area of HFT and there are many possibilities of speeding up code (vectorization, parallelization, good connectivity with C asf)
  • the community is really very much into the kind of stuff I am interesting in whereas with Python it is really everybody and his dog doing all kinds of stuff I am not interested in... I guess this is also about the mindset how to approach some problems, I don't know.

I think in general one should focus: I wouldn't try to build a webpage or a game with R but when it comes to statistics and analytics I think Python is no real competitor and I would strongly recommend R as your future setup.

Edit
I also wrote a blog post with additional points about why R is better suited for data science than Python: http://blog.ephorie.de/why-r-for-data-science-and-not-python

vonjd
  • 27,437
  • 11
  • 102
  • 165
  • I agree the available packages that pertain to stats, math, and financial math are quite numerous in R. Though the current rate of new packages that target the above areas seems to be a lot higher in Python than R these days. I got the impression that R might be obsolete in 3-4 years due to so much that is done or ported over to Python right now and that is what caused me to ask this question, to gauge whether others share those observations. Thanks for your input on this. – Matt Wolf May 19 '15 at 15:18
  • 18
    I got the impression that R might be obsolete in 3-4 years. I take the other side on that bet. I actually watch what packages get added every day and I don't see this as stagnating at all. – Dirk Eddelbuettel May 20 '15 at 03:06
  • I'd take the other side of that bet too, Matt. These things take long time. The last time I checked many of our academic brethren were still enamored with Fortran. In all seriousness though R is alive and growing just maybe not the best for the broader use case you describe above. – rhaskett May 20 '15 at 21:21
  • 1
    I stand by my own estimate but I did not intend to flame or cause discontent. Sorry if that above number rattled some cages. I just hear a lot of new quant projects get started in Python rather than R which got me thinking and caused me asking this question. R has the strength of an existing library repository but the growth momentum seems to be on the Python side. – Matt Wolf May 21 '15 at 04:03
  • @DirkEddelbuettel, most of those are version updates, plus I understand and respect you are taking the other side of that bet (though I never offered a bet but voiced an impression). You are heavily invested in R and therefore I get why you have a different impression. Would be nice if you could write up a short answer to state what you use R for and why you think it is a better tool for you than Python. – Matt Wolf May 21 '15 at 04:07
  • 3
    I'm more-or-less with @vonjd on this. I've used a bunch of languages, but I'm most productive for statistics and analytics in R. Python is a great language and numpy and pandas are a fantastic combination. However, the community for R is just so much better. I once spent several hours over a few days to get ipopt to work on Python, and then it just worked without any real effort with R. I also don't like the conda package manager because I can't get it to work behind a corporate firewall. – John May 21 '15 at 15:32
  • @John, thank you for sharing, very interesting and valuable information. – Matt Wolf May 21 '15 at 15:44
  • thought to share link about Man AHL (a firm that I known well as an outsider): http://www.computerweekly.com/news/2240238397/Case-study-Hedge-fund-AHL-Man-Group-uses-MongoDB-to-feed-quants-with-data . The article talks about MongoDB, but somewhere in the middle, it talks about how they consolidated their codes to Python – uday May 24 '15 at 19:02
23

I've used both R and Python with Pandas in a professional quantitative financial work to do both large and small scale projects. I would strongly recommend Python with Pandas over R for most new projects in the field especially in time series analysis.

While I don't dispute vonjd in that you will find more libraries in R with algorithms on the bleeding edge of statistical research, the libraries in Python are very robust and fleshed out in that area. Also, I find in my work and the work of my colleagues that we are grabbing libraries from electrical engineering, computer vision, big data and more. People in these fields mostly have libraries in Python, not R.

However, the main advantage of Python over R in this field is workflow. The workflow with R tended to be that you used Perl/Python for data cleaning, preparation database work because R was too slow awkward for large complicated datasets though this is getting better. You then build the statistical model in R taking advantage of its libraries. Afterwards, the R model was rewritten in C for speed, control, interface, parallelization and error handling for production.

Python can handle this full workflow start to finish. All the inter-connectivity steps surrounding the main research projects is much more robust and a lot of time is saved in development when using the same language throughout. Also, with Pandas the even the core research portion and data handling is now easier and cleaner in my opinion.

In general, if you are just focusing only on advanced statistics/data-mining time series research then R and Python with Pandas are interchangeable at least for now. However it sounds like from your question that you are also are worried also about inter-connectivity and architecture for that Python is far superior.

Edit for 2018: It's amazing how much easier it is to get into data munging in Python these days compared to when I first wrote this. Try Anaconda for those that would like to check out Python/Pandas without any fuss.

rhaskett
  • 1,631
  • 1
  • 11
  • 21
  • 1
    yes, that is another trend I am seeing, for time series analysis a lot of academic courses nowadays seem to have switched from R to Python as teaching and demonstration tool. I am not generalizing but a lot of students with Master's degrees I recently interviewed seem to have a much better grasp at Python than R. But one thing that makes me not yet want to fully embrace Python is: What libraries are exactly out there that assist in time series analysis, derivatives pricing, modeling, applying machine learning techniques aside the generalized Pandas, SciPy, ... packages? – Matt Wolf May 21 '15 at 04:11
  • 2
    On statistics, Python just doesn't have as much developed as R does on those fronts, but statsmodels is my goto. One option is to just call R functions with Python (RPy, but not apparently well tested with Windows). On derivatives, pyql seems to be more developed than the R version. On machine learning, scikit.learn. One other benefit of Python is that iPython Notebook is better than Rstudio. However, the Jupyter project looks interesting (and should work with R). – John May 21 '15 at 15:40
  • @rhaskett, regarding inter-connectivity, I think it is extremely important to be able to efficiently interface with other modules, hardware, software applications. I believe it to be a myth that most who seriously perform data analyses do not need inter-connectivity. In that I find Python a lot more capable and it provides more efficient means to, for example, fan out computations to other hardware instances. – Matt Wolf May 24 '15 at 07:27
  • In reference to @statquants answer, the OP didn't mention HFT but I completely agree for HFT you pretty much always end up in C. I know of some that feel the need to write their own C for an added boost over RCpp but your mileage may and almost certainly will vary. – rhaskett Oct 11 '15 at 06:10
13

For data analysis, particularly for large data analysis project, pretty much most of the top quant hedge funds and a lot of the banks are using Python (over R) for a couple of reasons but many still have bits and pieces of R for specific packages or functions (I work at a bank and interface with quite a few quant hedge funds on data analysis):

  1. Earlier Python 2 used to have a lot of backward compatibility issues, but Python 3 is more stable between versions. Even Pandas versions since 0.13 are very stable between versions. No one wants to use a language for which they have to revisit and rewrite significant codes sometime in the future.

  2. People needed same codes to run on both Linux and Windows. Installing, compiling packages in Python can be a super pain, whether Linux or Windows. A lot of people did not wanted to do any new project in Python 2 as sometime in the future one would need to move to Python 3 and they stuck to R for quite a while. Also for a while, Python 3 was available only with WinPython distro and WinPython used to work only on Windows. Anaconda, which is leading Python disto for Linux (& Mac), came out with Python 3 support sometime in 2014, which then caused a huge migration.

Advantages of Python (vs R):

(i) Raw speed is the biggest motive (allowing you to do way more statistical data analysis in the same time)

(ii) Pandas can read csv files very fast (one of the reasons why many folks moved from Matlab to R at some point)

(iii) Cython is more flexible than RCpp (at least my experience)

(iv) organize code files neatly into logical directories and classes within files (classes in R are an oversight) and the project looks much better

(v) As of 2015, PyCharm is a significantly better IDE than RStudio (although RStudio is better than Spyder). Tools matter

Disadvantages of Python (vs R):

(i) The big issue with Pandas used to be that it didn't have its own binary data format. R's RData format is a huge edge. PyData's HDF5 based storage is not compressible easily, gives a lot of errors every now and then, and for big data it was a hindrance. Pickle, and other formats didn't just cut it. After years of Python-vs-R exploration, most ended up writing their own custom binary data format (to store Pandas data frame) or using significant modifications of PostgreSQL for big data storage.

Statistical packages are generally great with both languages.

I have projects in R that took 4 hours to run every day (over night). Now, in Python, they take a total of 20 minutes (with much less use of Cython codes than RCpp codes in R). That's the speed difference for you.

To answer your question:

  • acquire, store, maintain, read, clean time series: Python is better

  • perform basic statistics on time series, advanced statistical models such as multivariate regression analyses, etc.: both Python and R

  • performing mathematical computations (fourier transforms, PDE solver, PCA) visualization of data (static and dynamic): both Python and R

    • pricing derivatives (application of pricing models such as interest rate models) : both Python and R

    • interconnectivity (with Excel, servers, UI): Python is better

uday
  • 792
  • 3
  • 9
  • Thank you for sharing your experience and providing pros and cons, I appreciate the balanced thought sharing. Though regarding backward compatibility, does Python 3.x not break backward compatibility? And in terms of mathematical and statistical features of packages, R clearly still has the lead here, imho. At the same time, however, I do not see much value in 90%+ R packages because they target a very specific statistical approach and the implementation is not modularized and not extendible, so that functionality remains very limited, almost to the degree of single time usage. – Matt Wolf May 24 '15 at 07:30
  • 1
    If you migrate from R to Python 3.x, you have less to worry about backward compatibility, but if you migrate from R to Python 2.x, you will have to worry about backward compatibility if you later decide to switch from Python 2.x to Python 3.x – uday May 24 '15 at 18:51
  • 1
    also, regarding R packages or Python packages, a lot of real world stuff involves modifications that standard packages can't handle. For example, let's say if you want to do create a tradable ICA or PCA from tradable time-series (e.g. time-series of stock prices or futures prices), you wanted to a liquidity-weighted ICA or liquidity-weighted PCA to avoid the top ICA or PCA factors to loading up some penny cap stocks etc. So, you will end up looking a the source codes of ICA or PCA in either R or Python and rewrite your own codes – uday May 24 '15 at 18:57
  • what I meant was backward compatibility within the Python stack. Is it true that using Python 3.x unables me to use packages that target 2.x? – Matt Wolf May 28 '15 at 00:59
  • 1
    packages that were originally written for 2.x and aren't made compatible for handling both 2.x and 3.x will unlikely work without errors in 3.x. But the list of such packages is really very small - vast majority of the packages like numpy, pandas, etc., work well with both 2.x and 3.x. – uday May 28 '15 at 18:59
6

For the tasks listed, both Python and R perform very well. There are some packages in Python not in R and vice versa. My solution for this is to simply call R from Python. This allows for the best of both worlds.

It is also important to note I do not write any R code other than calling an R library from Python.

Calling Python from R does not work equally across all major OSes as well.

pyCthon
  • 2,111
  • 2
  • 18
  • 39
6

Also in the high frequency / medium frequency field here.

I received a "mixed" consensus regarding the use of R and its prevalence in the field (specifically HFT). Speaking with someone who works in the equity option industry at a relatively small proprietary firm in San Francisco, I was told, "R is a legacy language".

However, speaking with someone who formerly was leading a HFT team at Goldman Sachs, I was told it is still the best language for time series analysis, statistics and especially latency sensitive projects. For libraries, the following were mentioned:

  1. Quantmod (See Quantmod)
  2. Caret (See Caret)
  3. Zoo (See Zoo)
  4. XTS (See XTS)
  5. highfrequency (See highfrequency: tools for high frequency data analysis)
  6. The popular open source QuantLib library also has an R version, which can be found here.

And to reiterate on other answers to this question, given how heavily dependent the HFT field is on speed, R cannot be integrated into production HFT systems. However, the R C++ Package is a popular tool which makes the integration to the HFT system both practical and easy.

I would not say R is dying, but it also does not have a monopoly for data analysis in the field of quantitative finance in general. Python and matlab are of great use in this field as well (I seem to be a minority in my use of matlab but it is great).

Theodore
  • 1,172
  • 1
  • 18
  • 35
  • Thanks for sharing your experience. My prediction was not that R would be dying but that by 2019 Python would by far have outpaced R's utility in quant space. I think it is safe to conclude this has already occurred by today. I occasionally see the odd R port at attempting to link up R to ML or deep learning but the overwhelming majority of development happens in Python space, whether regarding quant, ML, deep learning, or hft, I am strictly speaking of research and development not deployment. – Matt Wolf Jun 12 '18 at 04:15
  • 1
    @Matt There are none so blind than those who will not see. Deep Learning is well supported by python, as to hft (for as much as it makes sense to relate hft and such tools ) I do not know of any serious firm that uses python. – statquant Oct 24 '18 at 18:21
  • 1
    @statquant, perhaps then opening one's own eyes might be the perfect recommendation. If you hear of any recent interviewees who got hired into quant roles at hedge funds, hft firms, and banks and were not asked and most likely tested on Python skills then do let me know. And DL is an area that is strongly pursued by all entities listed above where Python is the language of choice by far. I do appreciate you are a R power user, all the blessings to you, but averages do not work by extrapolating from oneself. – Matt Wolf Oct 25 '18 at 23:49
2

The major advantage of Python (w/ pandas) over R is that Python supports OOP (object-oriented programming). It makes sense to organize a large code base using a hierarchy of classes. Python also supports the notion of polymorphism so that we can use well-known design patterns (e.g., Strategy, Observer, etc.) in our code.

wsw
  • 1,136
  • 10
  • 9
  • 1
    First R is also object-oriented, second the problem I have with python is that you never know when to use a function or a method, see e.g.: https://stackoverflow.com/q/8108688/468305 – vonjd Nov 06 '18 at 15:05
  • Stating that "R is also object-oriented" is similar to claiming that Perl is also object-oriented =) – wsw Nov 12 '22 at 18:50
  • Oh dear, obviously you know neither :-( Google is your friend ;-) – vonjd Nov 12 '22 at 22:34
  • Just because R lets one write lame S3 and S4 classes does not mean that R is a OO language. Take a look at the R code in GitHub -- how much of that code is using OOP versus just the user calling a bunch of (non-member) functions. – wsw Nov 19 '22 at 04:09