11

It seems as though most languages have some number of scientific computing libraries available.

  • Python has Scipy
  • Rust has SciRust
  • C++ has several including ViennaCL and Armadillo
  • Java has Java Numerics and Colt as well as several other

Not to mention languages like R and Julia designed explicitly for scientific computing.

With so many options how do you choose the best language for a task? Additionally which languages will be the most performant? Python and R seem to have the most traction in the space, but logically a compiled language seems like it would be a better choice. And will anything ever outperform Fortran? Additionally compiled languages tend to have GPU acceleration, while interpreted languages like R and Python don't. What should I take into account when choosing a language, and which languages provide the best balance of utility and performance? Also are there any languages with significant scientific computing resources that I've missed?

ragingSloth
  • 1,824
  • 3
  • 14
  • 15
  • 12
    There is no question here. If you need to do basic research on programming language, you are better off reading Wikipedia than to wait for someone to pop up here to push his hobby-horse. – Dirk Eddelbuettel Jun 16 '14 at 20:05
  • @DirkEddelbuettel Very good point. Thought it was better to try producing content than refining it at this point in the Beta, but I don't know a huge amount about SE betas. Was that a good move on my part or not? – indico Jun 16 '14 at 20:11
  • 1
    Look at these numbers. – Emre Jun 16 '14 at 20:20
  • @DirkEddelbuettel you're not wrong, but my hope was to foster a discussion about the useful characteristics and tools associated with various languages. The language you use is an important tool in data science, so my thinking was that people could discuss the tools they preferred and there objective benefits here, as a resource for those looking to attempt similar work. – ragingSloth Jun 16 '14 at 21:08
  • This area is subject to such rapid change that it's probably best to strictly limit the scope of this question to "what qualities make for the best tool" rather than "which tools exist and what are their qualities" - even as a community wiki, the latter would require an intimidating amount of continued maintenance in order to be net useful – Air Jun 16 '14 at 21:54
  • You need to learn either R or Python properly and also be able to understand the other. R has an unbeatable number of stats packages and nothing will ever be able to compare to that. – Simd Jun 17 '14 at 07:48
  • @Lembik I appreciate the evangelism, but there isn't a single piece of statistical functionality that R has that Python doesn't. There are on the other hand, a large number of scientific and machine learning functions that R lacks, which are preset in python. Good opencv bindings are a great example. – indico Jun 17 '14 at 14:45
  • @indico I am not sure if I understand your point. There are literally thousands of packages in CRAN which don't exist for python. See http://cran.r-project.org/web/packages/available_packages_by_name.html . – Simd Jun 17 '14 at 14:54
  • @Lembik packages yes, but I am referring to functionality. The vast majority of the statistical functionality present in those R packages is contained simply within scipy – indico Jun 17 '14 at 14:56
  • @indico This simply isn't true I am afraid. People who do PhDs in stats, for example, still routinely write an R package as part of that. These very very rarely get implemented in scipy. How many new stats functions does scipy add every year? Try just picking a few of the R packages at random to see what I mean. – Simd Jun 17 '14 at 14:58
  • @Lembik I don't really want to argue on this one. Could you please provide an example? scipy adds a huge amount of statistical functions every year, not really keeping track I would guess a few hundred each year. – indico Jun 17 '14 at 15:00
  • 1
    @indico Try http://cran.r-project.org/web/packages/overlap/index.html which is just the first one I happened to pick at random. But really, I have personally known many statisticians who have written R packages. Not one of them has yet written a python one. To broaden the conversation a little, http://www.kdnuggets.com/2013/08/languages-for-analytics-data-mining-data-science.html is interesting. – Simd Jun 17 '14 at 15:08
  • @Lembik and my rebuttal in the form of a blog post: http://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/ – indico Jun 17 '14 at 15:11
  • @indico Oh I read that before. If you are the author of that blog I greatly admire it. However it is in no way a rebuttal to the fact that there are thousands of R packages written by statisticians which have no python equivalent :) – Simd Jun 17 '14 at 15:13
  • @Lembik I'm not trying to argue about a direct 1-to-1 correlation here. I'm trying to argue that in terms of raw functionality, here referring to kernel density estimation, there is nothing fundamentally missing from python, if we're going down the missing functionality line, there are an entire order of magnitude more python packages, and bindings to amazing academic projects such as opencv, which do not exist in R – indico Jun 17 '14 at 15:15
  • @indico You are right in terms of non stats packages. My comment was only meant to refer to statistics. You may also be right about kernel density estimation although it looked like the overlap package had more in it than your blog post referred to. It would perhaps be interesting for someone to go through the thousands of R packages and see what is missing from python but I still claim that the fact is that PhD students from the best stats departments write R packages for their work and put them on CRAN. So if you have an interest in cutting edge stats you need to use those packages. – Simd Jun 17 '14 at 15:21
  • @indico Just picking some more at random which have the word "kernel" in them. See http://cran.r-project.org/web/packages/bark/index.html , http://cran.r-project.org/web/packages/bbefkr/index.html , http://cran.r-project.org/web/packages/bpkde/index.html , http://cran.r-project.org/web/packages/DBKGrad/DBKGrad.pdf – Simd Jun 17 '14 at 15:25

3 Answers3

12

This is a pretty massive question, so this is not intended to be a full answer, but hopefully this can help to inform general practice around determining the best tool for the job when it comes to data science. Generally, I have a relatively short list of qualifications I look for when it comes to any tool in this space. In no particular order they are:

  • Performance: Basically boils down to how quickly the language does matrix multiplication, as that is more or less the most important task in data science.
  • Scalability: At least for me personally, this comes down to ease of building a distributed system. This is somewhere where languages like Julia really shine.
  • Community: With any language, you're really looking for an active community that can help you when you get stuck using whichever tool you're using. This is where python pulls very far ahead of most other languages.
  • Flexibility: Nothing is worse than being limited by the language that you use. It doesn't happen very often, but trying to represent graph structures in haskell is a notorious pain, and Julia is filled with a lot of code architectures pains as a result of being such a young language.
  • Ease of Use: If you want to use something in a larger environment, you want to make sure that setup is a straightforward and it can be automated. Nothing is worse than having to set up a finnicky build on half a dozen machines.

There are a ton of articles out there about performance and scalability, but in general you're going to be looking at a performance differential of maybe 5-10x between languages, which may or may not matter depending on your specific application. As far as GPU acceleration goes, cudamat is a really seamless way of getting it working with python, and the cuda library in general has made GPU acceleration far more accessible than it used to be.

The two primary metrics I use for both community and flexibility are to look at the language's package manager, and the language questions on a site like SO. If there are a large number of high-quality questions and answers, it's a good sign that the community is active. Number of packages and the general activity on those packages can also be a good proxy for this metric.

As far as ease of use goes, I am a firm believer that the only way to actually know is to actually set it up yourself. There's a lot of superstition around a lot of Data Science tools, specifically things like databases and distributed computing architecture, but there's no way to really know if something is easy or hard to setup up and deploy without just building it yourself.

indico
  • 4,219
  • 20
  • 21
  • To add to this answer: in terms of scalability, Scala and Go are worth mentioning. – Marc Claesen Jun 20 '14 at 07:17
  • I would add clarity and brevity (related to syntax and language architecture, but not only). Being able to write fast and read without pain makes a huge difference (as programmers time is more expensive than machine time). – Piotr Migdal Jun 23 '14 at 11:14
5

The best language depends on what you want to do. First remark: don't limit yourself to one language. Learning a new language is always a good thing, but at some point you will need to choose. Facilities offered by the language itself are an obvious thing to keep into account but in my opinion the following are more important:

  • available libraries: do you have to implement everything from scratch or can you reuse existing stuff? Note that this these libraries need not be in whatever language you are considering, as long as you can interface easily. Working in a language without library access won't help you get things done.
  • number of experts: if you want external developers or start working in a team, you have to consider how many people actually know the language. As an extreme example: if you decide to work in Brainfuck because you happen to like it, know that you will likely work alone. Many surveys exists that can help assess the popularity of languages, including the number of questions per language on SO.
  • toolchain: do you have access to good debuggers, profilers, documentation tools and (if you're into that) IDEs?

I am aware that most of my points favor established languages. This is from a 'get-things-done' perspective.

That said, I personally believe it is far better to become proficient in a low level language and a high level language:

  • low level: C++, C, Fortran, ... using which you can implement certain profiling hot spots only if you need to because developing in these languages is typically slower (though this is subject to debate). These languages remain king of the hill in terms of critical performance and are likely to stay on top for a long time.
  • high level: Python, R, Clojure, ... to 'glue' stuff together and do non-performance critical stuff (preprocessing, data handling, ...). I find this to be important simply because it is much easier to do rapid development and prototyping in these languages.
Marc Claesen
  • 281
  • 1
  • 4
4

First you need to decide what you want to do, then look for the right tool for that task.

A very general approach is to use R for first versions and to see if your approach is correct. It lacks a little in speed, but has very powerful commands and addon libraries, that you can try almost anything with it: http://www.r-project.org/

The second idea is if you want to understand the algorithms behind the libraries, you might wanna take a look at the Numerical Recipies. They are available for different languages and free to use for learning. If you want to use them in commercial products, you need to ourchase a licence: http://en.wikipedia.org/wiki/Numerical_Recipes

Most of the time performance will not be the issue but finding the right algorithms and parameters for them, so it is important to have a fast scripting language instead of a monster program that first needs to compile 10 mins before calculating two numbers and putting out the result.

And a big plus in using R is that it has built-in functions or libraries for almost any kind of diagram you might wanna need to visualize your data.

If you then have a working version, it is almost easy to port it to any other language you think is more performant.

Armin
  • 41
  • 1