6

I wrote a crawler to fetch information out of an Q&A website. Since not all the fields are presented in a page all the time, I used multiple try-excepts to handle the situation.

def answerContentExtractor( loginSession, questionLinkQueue , answerContentList) :
    while True:
        URL = questionLinkQueue.get()
        try:
            response   = loginSession.get(URL,timeout = MAX_WAIT_TIME)
            raw_data   = response.text

            #These fields must exist, or something went wrong...
            questionId = re.findall(REGEX,raw_data)[0]
            answerId   = re.findall(REGEX,raw_data)[0]
            title      = re.findall(REGEX,raw_data)[0]

        except requests.exceptions.Timeout ,IndexError:
            print >> sys.stderr, URL + " extraction error..."
            questionLinkQueue.task_done()
            continue

        try:
            questionInfo = re.findall(REGEX,raw_data)[0]
        except IndexError:
            questionInfo = ""

        try:
            answerContent = re.findall(REGEX,raw_data)[0]
        except IndexError:
            answerContent = ""

        result = {
                  'questionId'   : questionId,
                  'answerId'     : answerId,
                  'title'        : title,
                  'questionInfo' : questionInfo,
                  'answerContent': answerContent
                  }

        answerContentList.append(result)
        questionLinkQueue.task_done()

And this code, sometimes, may or may not, gives the following exception during runtime:

UnboundLocalError: local variable 'IndexError' referenced before assignment

The line number indicates the error occurs at the second except IndexError:

Thanks everyone for your suggestions, Would love to give the marks that you deserve, too bad I can only mark one as the correct answer...

Paul Liang
  • 678
  • 6
  • 16
  • Typos, I hand typed it to striped some un-needed lines.. Edited already.. – Paul Liang Feb 21 '14 at 06:43
  • Related: [multiple exceptions in one line (except block)](http://stackoverflow.com/questions/6470428/catch-multiple-exceptions-in-one-line-except-block?rq=1) – thefourtheye Feb 21 '14 at 06:50

3 Answers3

8

I think the problem is this line:

except requests.exceptions.Timeout ,IndexError

This is equivalent to:

except requests.exceptions.Timeout  as IndexError:

So, you're assigning IndexError to the exception caught by requests.exceptions.Timeout. Error can be reproduced by this code:

try:
    true
except NameError, IndexError:
    print IndexError
    #name 'true' is not defined

To catch multiple exceptions use a tuple:

except (requests.exceptions.Timeout, IndexError):

And UnboundLocalError is coming because IndexError is treated as a local variable by your function, so trying to access its value before actual definition will raise UnboundLocalError error.

>>> 'IndexError' in answerContentExtractor.func_code.co_varnames
True

So, if this line is not executed at runtime (requests.exceptions.Timeout ,IndexError) then the IndexError variable used below it will raise the UnboundLocalError. A sample code to reproduce the error:

def func():
    try:
        print
    except NameError, IndexError:
        pass
    try:
        [][1]
    except IndexError:
        pass
func()
#UnboundLocalError: local variable 'IndexError' referenced before assignment
Ashwini Chaudhary
  • 232,417
  • 55
  • 437
  • 487
2

When you say

except requests.exceptions.Timeout ,IndexError:

Python will except requests.exceptions.Timeout error and the error object will be IndexError. It should have been something like this

except (requests.exceptions.Timeout ,IndexError) as e:
thefourtheye
  • 221,210
  • 51
  • 432
  • 478
1
except requests.exceptions.Timeout ,IndexError:

means same as except requests.exceptions.Timeout as IndexError

You should use

except (requests.exceptions.Timeout, IndexError):

instead

Kimvais
  • 36,728
  • 16
  • 105
  • 138