0

I want to use PySpark to efficiently remove Emoji (e.g., :-)) from 1 billion records. How could I achieve this using pyspark syntax?

smci
  • 29,564
  • 18
  • 109
  • 144
william007
  • 15,661
  • 20
  • 90
  • 161
  • 3
    Do you mean emoji or emoticons? Those are 2 different things – Ranoiaetep Jun 27 '20 at 06:45
  • 3
    Also you should probably create a [mcve](https://stackoverflow.com/help/minimal-reproducible-example) , references [here](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) – anky Jun 27 '20 at 07:41
  • This topic is super-interesting but your question way too broad, hence offtopic for SO. To make it on-topic for SO, can you fix it by adding example data and example code. Do you a) have a list of all the emojis you might encounter, or are you b) looking for a pretrained model that has a decent list, or c) do you want to learn them (hard, but doable)? (I've been working on this exact task recently, and I can tell you a) is manual, b) is seriously fallible, but c) is pretty hard) – smci Jun 28 '20 at 20:57

1 Answers1

0

use regexp_replace pyspark function

Hossein Torabi
  • 546
  • 4
  • 15