9

This may be too broad of a question with heavy opinions, but I really am finding it hard to seek information about running various algorithms using SQL Server Analysis Service Data Mining projects versus using R. This is mainly because all the data science guys I work with don't have any idea about SSAS because no one seems to use it. :)

The Database Guy

Before I start, let me clarify. I am a database guy and not a data scientist. I work with people who are data scientist who mainly use R. I assist these guys with creating large data sets where they can analyze and crunch data.

My objective here is to leverage a tool that came with SQL Server that no one is really leveraging because no one seems to have a clue about how it works in comparison to other methods and tools such as R, SAS, SSPS and so forth in my camp.

SSAS

I have never really used SQL Server Analysis Services (SSAS) outside of creating OLAP cubes. Those who know SSAS, you can also perform data mining tasks on cubes or directly on the data in SQL Server.

SSAS Data Mining comes with a range of algorithm types:

  • Classification algorithms predict one or more discrete variables, based on the other attributes in the dataset.
  • Regression algorithms predict one or more continuous variables, such as profit or loss, based on other attributes in the dataset.
  • Segmentation algorithms divide data into groups, or clusters, of items that have similar properties.
  • Association algorithms find correlations between different attributes in a dataset. The most common application of this kind of algorithm is for creating association rules, which can be used in a market basket analysis.
  • Sequence analysis algorithms summarize frequent sequences or episodes in data, such as a Web path flow.

Predicting Discrete Columns

With these different algorithm options, I can start making general predictions from the data such as finding out simply who is going to buy a bike based on a predictable column, Bike Buyers, against an input column, Age. The histogram shows that the age of a person helps distinguish whether that person will purchase a bicycle.

enter image description here

Predicting Continuous Columns

When the Microsoft Decision Trees algorithm builds a tree based on a continuous predictable column, each node contains a regression formula. A split occurs at a point of non-linearity in the regression formula. For example, consider the following diagram.

enter image description here

Comparison

With some of that said, it seems I can run a range of algorithms on the data and also have various functions available to me in SSAS to run against the data. It also seems I can develop my own algorithms in Visual Studio and deploy them to SSAS (if I'm not mistaken).

So, what am I missing here in regards to languages and tools from R? Is it just that they have more flexibility to deploy and edit complex algorithms versus SSAS etc?

Fastidious
  • 213
  • 2
  • 7

1 Answers1

2

In my opinion, it seems that SSAS makes more sense for someone who:

  • has significantly invested in Microsoft's technology stack and platform;
  • prefer point-and-click interface (GUI) to command line;
  • focus on data warehousing (OLAP cubes, etc.);
  • has limited needs in terms of statistical methods and algorithms variety;
  • has limited needs in cross-language integration;
  • doesn't care much about openness, cross-platform integration and vendor lock-in.

You can find useful this blog post by Sami Badawi. However, note that the post is not recent, so some information might be outdated. Plus, the post contains an initial review, which might be not very accurate or comprehensive. If you're thinking about data science, while considering staying within Microsoft ecosystem, I suggest you to take a look at Microsoft's own machine learning platform Azure ML. This blog post presents a brief comparison of (early) Azure ML and SSAS.

Aleksandr Blekh
  • 6,518
  • 4
  • 28
  • 54
  • Thanks. Most of the data I'm analyzing is in multi-dimensional form. NoSQL solutions and other cloud based solutions are not really an option for me. Thus, it doesn't make sense to look towards Azure. However, I'll check out the blog post. Unfortunately, SSAS hasn't really improved all that much over the years and I am on 2008 R2. – Fastidious Mar 27 '15 at 11:45
  • By the way, can you add more insight on the limited statistical methods and algorithm variety? I assumed you could develop your own methods and algorithms with SSAS just like you could with R? Or do you mean the availability of those packages is not as common as the open source community? – Fastidious Mar 27 '15 at 11:51
  • @Fastidious: You're welcome. Re: your first comment (I'll address the second one in the next comment). I don't understand your rationale behind rejecting Azure ML. While I'm not a big fan of Microsoft solutions, but for those who are tied to that technology stack, Azure ML seems like a sensible (while, for some, still might not be the best) option. Azure ML has direct ties with Azure, which, being a general cloud platform, supports whatever you throw at it. Wrap whatever environment and tools you use into (or install them separately on) a virtual machine and launch it. It is not cheap, though. – Aleksandr Blekh Mar 27 '15 at 11:55
  • @Fastidious: Absolutely. The richness of R ecosystem in terms of community and packages is unparalleled. Even considering recent Microsoft's acquisition of R-focused company Revolution Analytics, it is not going to change the overall situation (while it might improve some MS offerings, add integrations, etc.). R's foothold is way too solid. So, in theory, you can develop your data science projects on SSAS, but you'll experience, at least, two main obstacles: 1) limited variety of packages; 2) limited options for language and platform interoperability. If your needs are modest, it might be OK. – Aleksandr Blekh Mar 27 '15 at 12:07
  • The rational is because of strict policies of passing data outside of our network via API's and of course the fact the data is terabytes of data or gigabytes of data per day. So, storage costs, bandwidth, processing power and so forth are deep considerations depending on the costs. Then of course there is API development costs too. I can't just jump to a cloud solution at the drop of a hat hah. :) – Fastidious Mar 27 '15 at 12:27
  • @Fastidious: I see. Well, this situation requires a detailed and careful analysis of all factors involved, which hopefully will produce an optimal solution. Good luck! – Aleksandr Blekh Mar 27 '15 at 13:20
  • 1
    No worries. We do use R and feed data to it via SQL Server. I just wanted to see if we can still leverage everything in SQL too. – Fastidious Mar 27 '15 at 17:55
  • 1
    To whomever suggested removing my first bullet point: I have rejected your edit suggestion. Firstly, Microsoft doesn't own R, which is an open source project. The fact that MS acquired Revolution Analytics (RA) and integrated R, a subset of R huge ecosystem and RA's product into MS stack doesn't make R equally "comfortable" environment versus SSAS. Secondly, in any case, SSAS and other relevant MS' products are still more integrated with/into MS ecosystem for the near-term, at least. This is my opinion and I stand by it. As such, I'd like my answer to remain as is. Thank you for understanding. – Aleksandr Blekh Mar 16 '17 at 08:59