-1

I have a data Frame that contains 8M data. There is one column name EMAIL contains Email address I have to check:

  1. Email value must be of the form _@_._
  2. Email value can only contain alphanumeric characters along with -_@.

enter image description here

blackbishop
  • 26,760
  • 8
  • 50
  • 69
aamirmalik124
  • 133
  • 13

3 Answers3

2

In fact, there is a Python library designed for it, validate_email.

You can use the below code snippet to validate the email id.

from validate_email import validate_email
from pyspark.sql.types import BooleanType
from pyspark.sql.functions import udf

valid_email = udf(lambda x: validate_email(x), BooleanType())

emailvalidation.withColumn('is_valid', valid_email('EmailAddress')).show()

+--------------------+--------+
|               email|is_valid|
+--------------------+--------+
|aswin.raja@gm.com   |    true|
|                abc |   false|
+--------------------+--------+

Another way is to use regular expressions. You can use the below code snippet.

import re 

regex = '^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,3})+$'

def check(email):  

    if(re.search(regex,email)):  
        print("Valid Email")  
    else:  
        print("Invalid Email")  


if __name__ == '__main__' :  

    email = "aswin.raja@gm.com"
    check(email) 
    email = "aswinraja.com"
    check(email) 


+--------+
|Valid   |  
|Invalid | 
+--------+
AswinRajaram
  • 1,414
  • 5
  • 17
1

You can use the below code to validate the Email Id in your table.

from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
import re 

def regex_search(string):
    regex = '^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,3})+$'
    if re.search(regex, string, re.IGNORECASE):
      return True
    return False 

validateEmail_udf = udf(regex_search, BooleanType())
df = df.withColumn("is_valid",validateEmail_udf(col("email")))
abhaykagalkar
  • 62
  • 1
  • 5
  • The codes are running fine but when I display resulting data frame its showing error "Job aborted due to stage failure: Task 5 in stage 357.0 failed 4 times, most recent failure: Lost task 5.3 in stage 357.0 (TID 16904, 10.139.64.4, executor 29): org.apache.spark.api.python.PythonException: Traceback (most recent call last):" Can you please help – aamirmalik124 Feb 05 '20 at 11:27
1

No need for UDF here, just use rlike function :

# not really the regex to validate emails but this handles your requirement
r = """^[\w\d-_\.]+\.[\w\d-_\.]+@[\w\d]+\.[\w\d]+$"""

df.withColumn("flag", when(col("email").rlike(regex), lit("valid")).otherwise(lit("invalid")))\
  .show()

Gives:

+---+-----+----+-----------------+-------+
|ids|first|last|            email|   flag|
+---+-----+----+-----------------+-------+
|  1|   aa| zxc|aswin.raja@gm.com|  valid|
|  2|   bb| asd|aswin.raja@gm.com|  valid|
|  3|   cc| qwe| aswinraja@ad.com|invalid|
|  4|   dd| qwe|aswin.raja@gm.com|  valid|
|  5|   ee| qws| aswinraja@ad.com|invalid|
+---+-----+----+-----------------+-------+

For complete regex to validate email address check this post

Community
  • 1
  • 1
blackbishop
  • 26,760
  • 8
  • 50
  • 69