0

I'm getting an URL from Schema.org. It's content-type="text/html"

Sometimes, read() functions as expected b'< !DOCTYPE html> ....'

Sometimes, read() returns something else b'\x1f\x8b\x08\x00\x00\x00\x00 ...'

try:
    with urlopen("http://schema.org/docs/releases.html") as f:
        txt = f.read()
except URLError:
    return

I've tried solving this with txt = f.read().decode("utf-8").encode() but this results in an error... sometimes: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

The obvious work-around is to test if the first byte is hex and treat this accordingly.

My question is: Is this a bug or something else?

enter image description here Edit Related question. Apparently, sometimes I'm getting a gzipped stream.

Lastly I solved this by adding the following code as proposed here

if 31 == txt[0]:
    txt = decompress(txt, 16+MAX_WBITS)

The question remains; why does this return text/html sometimes and zipped some other times?

Community
  • 1
  • 1
GUI Junkie
  • 529
  • 1
  • 17
  • 29
  • I can only reproduce the case where you receive b' .. Seems to me like the other response you receive has something to do with your internet connection failing one way or another. – Jaap Versteegh Aug 25 '15 at 11:16
  • @SlashV I get this maybe once every 5 times – GUI Junkie Aug 25 '15 at 11:17
  • I ran the code 200 times... – Jaap Versteegh Aug 25 '15 at 11:20
  • @SlashV I'm using PyCharm, I dunno – GUI Junkie Aug 25 '15 at 11:21
  • @SlashV Added screenshot – GUI Junkie Aug 25 '15 at 11:26
  • I see your edit now. When you receive a zipped stream, you'll obviously have to unzip it first. You can probably avoid getting a zipped response by adding an "Accept" header. – Jaap Versteegh Aug 25 '15 at 11:31
  • @SlashV Cool, I'll do that. I've seen a number of questions related to zipped streams on StackOverflow. Should I delete this Q? – GUI Junkie Aug 25 '15 at 11:33
  • @SlashV Shouldn't that rather be [`Accept-Encoding`](https://tools.ietf.org/html/rfc2616#page-102)? – dhke Aug 25 '15 at 11:41
  • @dhke yup. In particular I feel that `Accept-Encoding: identity` should help – Jaap Versteegh Aug 25 '15 at 12:00
  • @SlashV Yeah. `urlopen()` doesn't specify `Accept-Encoding` at all, which the server previously [MAY](https://tools.ietf.org/html/rfc2616#section-14.3) interpret as `Accept-Encoding: *`. This has changed with [RFC7231](https://tools.ietf.org/html/rfc7231#section-5.3.4). From the question history, I'd really consider this a wiki-answer case. – dhke Aug 25 '15 at 12:06
  • I cannot not get gzipped data from this url **even if** I specify `Accept-Encoding: gzip;q=1` or `Accept-Encoding: gzip, deflate`, so it seems to me this server has some rules of its own. – Jaap Versteegh Aug 25 '15 at 15:05

2 Answers2

2

You are indeed receiving a gzipped response. You should be able to avoid it by:

from urllib import request
try:
    req = request.Request("http://schema.org/docs/releases.html")
    req.add_header('Accept-Encoding', 'identity;q=1')
    with request.urlopen(req) as f:
        txt = f.read()
except request.URLError:
    return
Jaap Versteegh
  • 761
  • 7
  • 15
2

There are other questions in this category, but I cannot find an answer that addresses the actual cause of the problem.

Python's urllib2.urlopen() cannot transparently handle compression. It also by default does not set the Accept-Encoding request header. Additionally, the interpretation of this situation according to the HTTP standard has changed in the past.

As per RFC2616:

If no Accept-Encoding field is present in a request, the server MAY assume that the client will accept any content coding. In this case, if "identity" is one of the available content-codings, then the server SHOULD use the "identity" content-coding, unless it has additional information that a different content-coding is meaningful to the client.

Unfortunately (as for the use case), RFC7231 changes this to

If no Accept-Encoding field is in the request, any content-coding is considered acceptable by the user agent.

Meaning, when performing a request using urlopen() you can get a response in whatever encoding the server decides to use and the response will be conformant.

schema.org seems to be hosted by google, i.e. it is most likely behind a distributed frontend load balancer network. So the different answers you get might be returned from load balancers with slightly different configurations.

Google Engineers have in the past advocated for the use HTTP compression, so this might as well be a conscious decision.

So as a lesson: when using urlopen() we need to set Accept-Encoding.

Community
  • 1
  • 1
dhke
  • 14,408
  • 2
  • 37
  • 56