3

I am building an RDD from a text file. Some of the lines do not conform to the format I am expecting, in which case I use the marker -1.

def myParser(line):
    try:
        # do something
    except:
        return (-1, -1), -1

lines = sc.textFile('path_to_file')
pairs = lines.map(myParser)

is it possible to remove the lines with the -1 marker? If not, what would be the workaround for it?

zero323
  • 305,283
  • 89
  • 921
  • 912
Bob
  • 819
  • 5
  • 13
  • 25

1 Answers1

3

The cleanest solution I can think of is to discard malformed lines using a flatMap:

def myParser(line):
    try:
        # do something
        return [result] # where result is the value you want to return
    except:
        return []

sc.textFile('path_to_file').flatMap(myParser)

See also What is the equivalent to scala.util.Try in pyspark?

You can also filter after the map:

pairs = lines.map(myParser).filter(lambda x: x != ((-1, -1), -1))
Community
  • 1
  • 1
zero323
  • 305,283
  • 89
  • 921
  • 912