Skip to content Skip to sidebar Skip to footer

Httperror When Using Urllib2 Read()

I'm trying to scrape a web page using urllib2 and BeautifulSoup. It was working fine and then when I put in an input() in a different part of my code to try and debug something, I

Solution 1:

Changing my comment into an answer:

The page that you're scraping responded with (most likely) a 4xx response, and urllib2 raises an HTTPError, as it says it does in the docs. It is your job to catch that exception and (hopefully) do something with it, log it or what have you. Your traceback doesn't display the code/reason for the HTTPError for whatever reason, but it is there. Look at the 'code' and 'reason' attributes of the error.

editorial: It is possible that the website that you were scraping figured out that you're a robot. You might want to take a moment to rewrite your scraper to use a more server-friendly (and vastly better API) library. urllib2 is fine for one-off tasks but it has numerous shortcomings that I won't get into here. Possible superior libraries to look at are requests, mechanize, maybe httplib2. All have up/downsides so I can't tell you the one that's right for your needs.

You also may want to look at what user-agent header you're sending with your requests, since if you self-identify as a robot, well. Yeah.

Post a Comment for "Httperror When Using Urllib2 Read()"