9

I want to validate if spark-sql query is syntactically correct or not without actually running the query on the cluster.

Actual use case is that I am trying to develop a user interface, which accepts user to enter a spark-sql query and I should be able to verify if the query provided is syntactically correct or not. Also if after parsing the query, I can give any recommendation about the query with respect to spark best practices that would be best.

Def_Os
  • 4,950
  • 5
  • 30
  • 60
Manish Mehra
  • 1,211
  • 1
  • 15
  • 21
  • 1
    How about executing: `val new = spark.sql(query)` - it will be parsed, but the query will not be executed. If there was an exception - then the syntax is most probably wrong... – MaxU - stop genocide of UA Oct 27 '17 at 12:51

4 Answers4

6

SparkSqlParser

Spark SQL uses SparkSqlParser as the parser for Spark SQL expressions.

You can access SparkSqlParser using SparkSession (and SessionState) as follows:

val spark: SparkSession = ...
val parser = spark.sessionState.sqlParser

scala> parser.parseExpression("select * from table")
res1: org.apache.spark.sql.catalyst.expressions.Expression = ('select * 'from) AS table#0

TIP: Enable INFO logging level for org.apache.spark.sql.execution.SparkSqlParser logger to see what happens inside.

SparkSession.sql Method

That alone won't give you the most bullet-proof shield against incorrect SQL expressions and think sql method is a better fit.

sql(sqlText: String): DataFrame Executes a SQL query using Spark, returning the result as a DataFrame. The dialect that is used for SQL parsing can be configured with 'spark.sql.dialect'.

See both in action below.

scala> parser.parseExpression("hello world")
res5: org.apache.spark.sql.catalyst.expressions.Expression = 'hello AS world#2

scala> spark.sql("hello world")
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'hello' expecting {'(', 'SELECT', 'FROM', 'ADD', 'DESC', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE', 'DESCRIBE', 'EXPLAIN', 'SHOW', 'USE', 'DROP', 'ALTER', 'MAP', 'SET', 'RESET', 'START', 'COMMIT', 'ROLLBACK', 'REDUCE', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'DFS', 'TRUNCATE', 'ANALYZE', 'LIST', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT', 'LOAD'}(line 1, pos 0)

== SQL ==
hello world
^^^

  at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:217)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:114)
  at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:68)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
  ... 49 elided
Jacek Laskowski
  • 68,975
  • 24
  • 224
  • 395
  • So it is not possible to validate a query without executing it ? – Ayoub Omari Mar 12 '20 at 09:25
  • It is. See that part about SparkSqlParser. A more sophisticated validation would require logical validation (i.e. checking whether tables, columns and types match). I think it's possible using QueryExecution and a few logical operators. What validation do you think about? – Jacek Laskowski Mar 12 '20 at 10:25
  • For my case, I am thinking especially about the syntactical validation of the query (which I guess happens in an early stage of Catalyst), and also checking whether query selected columns exist in the view (which I think happens in a subsequent stage) – Ayoub Omari Mar 12 '20 at 10:39
0

Following @JacekLaskowski 's answer, I found that SparkSqlParser gave me all sorts of errors that were not really syntax errors.

I therefore agree with him and suggest simply throwing it into SparkSession.sql, which works fine. This is what my method looks like:

  /**
   * Validates a Spark SQL statement by trying to execute it and checking
   * if there are no syntax-related exceptions.
   */
  def validate(sqlStatement: String): Unit = {
    val spark = SparkSession.builder
      .master("local")
      .getOrCreate()
    try {
      spark.sql(sqlStatement)
    } catch {
      case ex: ParseException => throw new MyCustomException("Invalid Spark SQL", ex)
      case _: AnalysisException => // Syntax was correct
    }
  }
Def_Os
  • 4,950
  • 5
  • 30
  • 60
0

For PySpark you can use the following

from pyspark.sql import SparkSession
from pyspark.sql.utils import ParseException


spark_session = SparkSession \ 
    .builder \
    .master("local[1]") \
    .appName('test') \
    .getOrCreate()

try:
    parser = spark_session._jsparkSession.sessionState().sqlParser()
    parser.parseExpression(query_string)
    print('Validation successful')
except ParseException as e:
    # do something
Giorgos Myrianthous
  • 30,279
  • 17
  • 114
  • 133
0

I found few issues using SparkSqlParser and it didn't support my use case. For instance, below valid Spark SQL dialect threw an exception,

val parser = spark.sessionState.sqlParser
val query = "select * from a.b where datamonth = '202203' limit 10"
parser.parseExpression(query)

Exception

org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '.' expecting {<EOF>, '-'}(line 1, pos 15)

== SQL ==
select * from a.b where datamonth = '202203' limit 10
---------------^^^

Solution

So, below solution based on explain() helped me with syntactic and logical correctness.

import org.apache.spark.sql.execution.ExplainMode

import scala.util.{Failure, Success, Try}

def validateQuery(query: String) = Try(spark.sql(query).queryExecution.explainString(ExplainMode.fromString("simple"))) match {
    case Success(plan) => "SUCCESS"  + plan
    case Failure(ex) => "FAILURE => " + ex.getMessage
}

Test 1 - Valid query

validQuery: String = select * from bi.booking where datamonth = '202203' limit 10

"SUCCESS== Physical Plan ==
CollectLimit 10
+- FileScan parquet bi.booking[booking_id#1337613396L,page_type_id#13397L,platform_id#13398L,page_type_platform#13399,... 5 more fields] Batched: false, DataFilters: [], Format: Parquet, Loc...

Test 2 - Syntactically invalid query

invalidQuery: String = select * from bi.booking datamonth = '202203' limit 10

"FAILURE =>
mismatched input '=' expecting {<EOF>, ';'}(line 1, pos 47)

== SQL ==
select * from bi.booking datamonth = '202203' limit 10
-------------------------------^^^
"

Test 3 - Query with non existing table

noTableQuery: String = select * from a.b where datamonth = '202203' limit 10

"FAILURE => Table or view not found: a.b; line 1 pos 14;
'GlobalLimit 10
+- 'LocalLimit 10
   +- 'Project [*]
      +- 'Filter ('datamonth = 202203)
         +- 'UnresolvedRelation [a, b], [], false
Kondasamy Jayaraman
  • 1,648
  • 19
  • 25