7

I am writing a map method using

RDD.map(lambda line: my_method(line))

and based on a particular condition in my_method (let's say line starts with 'a'), I want to either return a particular value otherwise ignore this item all together.

For now, I am returning -1 if the condition is not met on the item and later using another

RDD.filter() method to remove all the ones with -1.

Any better way to be able to ignore these items by returning null from my_method?

London guy
  • 26,580
  • 42
  • 114
  • 173

3 Answers3

13

In case like this flatMap is your friend:

  1. Adjust my_method so it returns either a single element list or an empty list (or create a wrapper like here What is the equivalent to scala.util.Try in pyspark?)

    def my_method(line):
        return [line.lower()] if line.startswith("a") else []
    
  2. flatMap

    rdd = sc.parallelize(["aDSd", "CDd", "aCVED"])
    
    rdd.flatMap(lambda line: my_method(line)).collect()
    ## ['adsd', 'acved']
    
Community
  • 1
  • 1
zero323
  • 305,283
  • 89
  • 921
  • 912
2

If you want to ignore the items based on some condition, then why not use filter by itself? Why use a map? If you want to transform it, you can use map on the output from filter.

Dmitry Rubanovich
  • 2,391
  • 18
  • 24
-1

filter is transformation method. It is high-cost operation because of creating new RDD.

keeptalk
  • 35
  • 2