Check whether email column contains @ and . using pyspark

Question

I have a data Frame that contains 8M data. There is one column name EMAIL contains Email address I have to check:

Email value must be of the form _@_._
Email value can only contain alphanumeric characters along with -_@.

AswinRajaram · Accepted Answer · 2020-02-05T10:27:21.087

2

In fact, there is a Python library designed for it, validate_email.

You can use the below code snippet to validate the email id.

from validate_email import validate_email
from pyspark.sql.types import BooleanType
from pyspark.sql.functions import udf

valid_email = udf(lambda x: validate_email(x), BooleanType())

emailvalidation.withColumn('is_valid', valid_email('EmailAddress')).show()

+--------------------+--------+
|               email|is_valid|
+--------------------+--------+
|aswin.raja@gm.com   |    true|
|                abc |   false|
+--------------------+--------+

Another way is to use regular expressions. You can use the below code snippet.

import re 

regex = '^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,3})+$'

def check(email):  

    if(re.search(regex,email)):  
        print("Valid Email")  
    else:  
        print("Invalid Email")  


if __name__ == '__main__' :  

    email = "aswin.raja@gm.com"
    check(email) 
    email = "aswinraja.com"
    check(email) 


+--------+
|Valid   |  
|Invalid | 
+--------+

edited Feb 05 '20 at 10:27

answered Feb 05 '20 at 09:06

AswinRajaram

1,414
5
17

Hi It's giving me an error " No module named 'validate_email_address'", As I tried from validate_email_address import validate_email – aamirmalik124 Feb 05 '20 at 10:07
Hi @aamirmalik124, it is `validate_email` and you need to pip install the module first. – AswinRajaram Feb 05 '20 at 10:11
Is there any other way to do the same as I am using data bricks and won't able to install any libraries of python. – aamirmalik124 Feb 05 '20 at 10:15
@aamirmalik124 You can always use regular expressions to validate email ids. I will edit my answer. – AswinRajaram Feb 05 '20 at 10:23
Yes, please Edit. – aamirmalik124 Feb 05 '20 at 10:25
@aamirmalik124 I have updated it. Please check if it helps. Instead of the single value, you can pass the DataFrame variable which contains the email. – AswinRajaram Feb 05 '20 at 10:28
I have added a screenshot in question what I exactly want, how can I add data frame column here? – aamirmalik124 Feb 05 '20 at 11:25

score 1 · Answer 2 · answered Feb 05 '20 at 10:51

1

You can use the below code to validate the Email Id in your table.

from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
import re 

def regex_search(string):
    regex = '^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,3})+$'
    if re.search(regex, string, re.IGNORECASE):
      return True
    return False 

validateEmail_udf = udf(regex_search, BooleanType())
df = df.withColumn("is_valid",validateEmail_udf(col("email")))

answered Feb 05 '20 at 10:51

abhaykagalkar

62
1
5

The codes are running fine but when I display resulting data frame its showing error "Job aborted due to stage failure: Task 5 in stage 357.0 failed 4 times, most recent failure: Lost task 5.3 in stage 357.0 (TID 16904, 10.139.64.4, executor 29): org.apache.spark.api.python.PythonException: Traceback (most recent call last):" Can you please help – aamirmalik124 Feb 05 '20 at 11:27

score 1 · Answer 3 · edited Jun 20 '20 at 09:12

No need for UDF here, just use rlike function :

# not really the regex to validate emails but this handles your requirement
r = """^[\w\d-_\.]+\.[\w\d-_\.]+@[\w\d]+\.[\w\d]+$"""

df.withColumn("flag", when(col("email").rlike(regex), lit("valid")).otherwise(lit("invalid")))\
  .show()

Gives:

+---+-----+----+-----------------+-------+
|ids|first|last|            email|   flag|
+---+-----+----+-----------------+-------+
|  1|   aa| zxc|aswin.raja@gm.com|  valid|
|  2|   bb| asd|aswin.raja@gm.com|  valid|
|  3|   cc| qwe| aswinraja@ad.com|invalid|
|  4|   dd| qwe|aswin.raja@gm.com|  valid|
|  5|   ee| qws| aswinraja@ad.com|invalid|
+---+-----+----+-----------------+-------+

For complete regex to validate email address check this post

Can you please add regex to validate emails . – aamirmalik124 Feb 06 '20 at 05:50 — aamirmalik124, Feb 06 '20 at 05:50

Check whether email column contains @ and . using pyspark

3 Answers3