The design of the table is this:
id | val |category
----------------
a1 | 10 | A
a1 | 30 | B
a1 | 20 | C
a2 | 5 | A
a2 | 7 | B
a2 | 2 | C
a3 | 50 | C
a3 | 60 | B
a3 | 90 | A
Query is this:
SELECT max(val), id, category FROM table GROUP BY id;
I am sure that this query will work on relational databases like MySQL (tested on MySQL), Oracle, MS SQL SERVER etc. But why it's not working on Spark?
Am I right to say "Spark has some limitations for using groupBy?", I tested the same table design on mysql it works perfectly, but it's giving me an error on Spark, let me show the error:
org.apache.spark.sql.AnalysisException: expression 'category' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
After looking into this error, Spark is basically suggesting to use first() function or first_value() function as a workaround. So, I tried but I didn't get the expected output, or I am not 100% sure that the output is right.
- Is it because that it's non-relational?
- Can I assume that the above specific case of groupBy doesn't work on other non-relational db's?
- Can somebody think of a workaround? a better alternative approach?
- I did some research, somewhere it was said that "Spark version above 2.0 will not have such problems". I am on version Spark 1.6, is it really true that Spark 2.0 won't have such issues?
Please correct me If I am wrong somewhere. Thanks a lot in advance!!