I have a data Frame that contains 8M data. There is one column name EMAIL contains Email address I have to check:
- Email value must be of the form
_@_._ - Email value can only contain alphanumeric characters along with
-_@.
I have a data Frame that contains 8M data. There is one column name EMAIL contains Email address I have to check:
_@_._-_@. In fact, there is a Python library designed for it, validate_email.
You can use the below code snippet to validate the email id.
from validate_email import validate_email
from pyspark.sql.types import BooleanType
from pyspark.sql.functions import udf
valid_email = udf(lambda x: validate_email(x), BooleanType())
emailvalidation.withColumn('is_valid', valid_email('EmailAddress')).show()
+--------------------+--------+
| email|is_valid|
+--------------------+--------+
|aswin.raja@gm.com | true|
| abc | false|
+--------------------+--------+
Another way is to use regular expressions. You can use the below code snippet.
import re
regex = '^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,3})+$'
def check(email):
if(re.search(regex,email)):
print("Valid Email")
else:
print("Invalid Email")
if __name__ == '__main__' :
email = "aswin.raja@gm.com"
check(email)
email = "aswinraja.com"
check(email)
+--------+
|Valid |
|Invalid |
+--------+
You can use the below code to validate the Email Id in your table.
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
import re
def regex_search(string):
regex = '^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,3})+$'
if re.search(regex, string, re.IGNORECASE):
return True
return False
validateEmail_udf = udf(regex_search, BooleanType())
df = df.withColumn("is_valid",validateEmail_udf(col("email")))
No need for UDF here, just use rlike function :
# not really the regex to validate emails but this handles your requirement
r = """^[\w\d-_\.]+\.[\w\d-_\.]+@[\w\d]+\.[\w\d]+$"""
df.withColumn("flag", when(col("email").rlike(regex), lit("valid")).otherwise(lit("invalid")))\
.show()
Gives:
+---+-----+----+-----------------+-------+
|ids|first|last| email| flag|
+---+-----+----+-----------------+-------+
| 1| aa| zxc|aswin.raja@gm.com| valid|
| 2| bb| asd|aswin.raja@gm.com| valid|
| 3| cc| qwe| aswinraja@ad.com|invalid|
| 4| dd| qwe|aswin.raja@gm.com| valid|
| 5| ee| qws| aswinraja@ad.com|invalid|
+---+-----+----+-----------------+-------+
For complete regex to validate email address check this post