groupBy is not working on Spark?

Question

The design of the table is this:

id | val  |category
---------------- 
a1 |  10   | A
a1 |  30   | B
a1 |  20   | C
a2 |  5    | A
a2 |  7    | B
a2 |  2    | C
a3 | 50    | C
a3 | 60    | B
a3 | 90    | A

Query is this:

SELECT max(val), id, category FROM table GROUP BY id;

I am sure that this query will work on relational databases like MySQL (tested on MySQL), Oracle, MS SQL SERVER etc. But why it's not working on Spark?

Am I right to say "Spark has some limitations for using groupBy?", I tested the same table design on mysql it works perfectly, but it's giving me an error on Spark, let me show the error:

org.apache.spark.sql.AnalysisException: expression 'category' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)

After looking into this error, Spark is basically suggesting to use first() function or first_value() function as a workaround. So, I tried but I didn't get the expected output, or I am not 100% sure that the output is right.

Is it because that it's non-relational?
Can I assume that the above specific case of groupBy doesn't work on other non-relational db's?
Can somebody think of a workaround? a better alternative approach?
I did some research, somewhere it was said that "Spark version above 2.0 will not have such problems". I am on version Spark 1.6, is it really true that Spark 2.0 won't have such issues?

Please correct me If I am wrong somewhere. Thanks a lot in advance!!

"I am sure that this query will work on relational databases like MySQL" I am sure the SQL query does NOT work for the exact same reason that spark is complaining about :) — Jean Logeart, Mar 22 '17 at 14:45
What do "this query will work" & "it works perfectly" mean? What do you think it should do? What is "the expected output"? What does "it's not working" mean? You don't say. PS Where is your quoted source, and what exactly does it mean by "such problems"? — philipxy, Mar 22 '17 at 23:12
You are grouping by id, but you don't have a cumulative function over the category field. That will either fail or give unexpected results depending on what relational database you are using. As others have said, what is the outcome/output you are getting and what is the expected output? — Dijkgraaf, Mar 22 '17 at 23:15
If the statement is valid in MySQL then that's probably the _only_ platform it is valid in. It's not a valid statement in any other platform. — Nick.McDermaid, Mar 23 '17 at 04:02

score 2 · Answer 1 · answered Mar 22 '17 at 15:00

2

when you group the following rows by id

id | val   | category
a1 |  10   | A
a1 |  30   | B
a1 |  20   | C

how should spark know which category it should put out? the first one, a random one, the last one?

MySQL is a bit more chill on that, such that it returns the first it finds. To mimic this behaviour in Spark just use "first(category)" as the column definition.

answered Mar 22 '17 at 15:00

Japu_D_Cret

600
4
18

1

@ShivaShinde can you show us the output you expect or at least thoroughly describe it? Because the change I described should exactly mimic the behaviour that MySQL would produce, which you coined as perfect. Do you want the MAX val per ID, or the MAX val per category, or the MAX val per composite key of ID and category – Japu_D_Cret Mar 22 '17 at 21:05

score 2 · Answer 2 · edited May 23 '17 at 12:00

You do not understand the SQL. The problem has nothing to do with Spark. In standard SQL you cannot SELECT a column that was not in GROUP BY and is not functionally dependent on GROUP BY columns. This is because there are in general a bunch of different values for that column per group. MySQL's non-standard behaviour is to return one of the values from the column. (Which is not guaranteed to be any one in particular.) If you want a particular value and it isn't given by an aggregate, eg the one in the same row as the maximum of some other column, then you need to write the appropriate SQL query. Your query is not "working on MySQL". You just happen to have got that result. (Which just happens to have been what you wanted.) MySQL could return any value from that column for a group.

What do "this query will work" & "it works perfectly" mean? What do you think it should do? What is "the expected output"? What does "it's not working" mean? You don't say.

Maybe you want the category from the row in a group that has the maximum val in the group, but that's not what your query asks for. The correct query for that is explained in the accepted answer to SQL Select only rows with Max Value on a Column:

At first glance...

All you need is a GROUP BY clause with the MAX aggregate function:
SELECT id, MAX(rev)
FROM YourTable
GROUP BY id
It's never that simple, is it?

I just noticed you need the content column as well.

This is a very common question in SQL

It is, actually, so common that StackOverflow community has created a single tag just to deal with questions like that: greatest-n-per-group.

Probably the simplest variant among the answers is:

SELECT * 
FROM t1 WHERE (id,rev) IN 
( SELECT id, MAX(rev)
  FROM t1
  GROUP BY id
)

Vikrame · Answer 3 · 2017-03-22T15:27:21.650

1

the field category cannot be projected unless its part of group by. if you are trying to find the max(val) for every category and id combination, then try

SELECT max(val), id, category FROM table GROUP BY id, category;

what output do you expect from this SQL?

spark-sql> desc sparktest;
id      string  NULL
val     int     NULL
category        string  NULL
Time taken: 3.217 seconds, Fetched 3 row(s)

spark-sql> SELECT max(val), id, category FROM sparktest GROUP BY id, category;
Time taken: 0.412 seconds

I dont have data in my table though, what do you mean by not working on Spark?

edited Mar 22 '17 at 15:27

answered Mar 22 '17 at 15:11

Vikrame

430
2
12

Unfortunately, this query is not working on Spark. I believe it works on MySQL. I tested this many times on Spark. Thanks for your answer :) – Mar 22 '17 at 15:16
spark-sql> desc sparktest; id string NULL val int NULL category string NULL Time taken: 3.217 seconds, Fetched 3 row(s) spark-sql> SELECT max(val), id, category FROM sparktest GROUP BY id, category; Time taken: 0.412 seconds I dont have data in my table though, what do you mean by not working on Spark? – Vikrame Mar 22 '17 at 15:24

groupBy is not working on Spark?

3 Answers3

At first glance...

It's never that simple, is it?