Website Blocks Python Crawler. Searching For Idea To Avoid
Solution 1:
There is big amount of methods websites can use for bot detection. We can group them in next list:
Headers validation. It's the most widespread basic-level validation which check HTTP request headers for existence, nonexistence, default, fake or corrupted values.
E.g. default
User-Agent
in python requests starts frompython-requests/
, which can be easily checked on backend and as a result your client will be flagged as bot and get "error" response.Answer : Try to sniff same request from browser (you can use Fiddler) and clone headers from browser. In python requests it can be done by next code:
headers = { "User-Agent": "Some User-Agent" } response = requests.get(url, headers=headers)
Cookies validation. Yes,
Cookie
is also HTTP header, but validation method differs from previous. Idea of this method is to checkCookie
header and validate each cookie.Solution:
1) Sniff all requests done by browser;
2) Check request you're trying to repeat and take a look on
Cookie
header;3) Search values of each cookie in previous requests;
4) Repeat each request which set cookie(-s) before main request to collect all required cookies.
In python requests you don't need to scrape the manually, just use
session
:http_session = requests.Session() http_session.get(url_to_get_cookie) # cookies will be stored inside "http_session"object response = http_session.get(final_url)
IP address or provider validation. Website can check IP address and provider to not be listed in spam databases. It's possible if you're using public proxies/VPN.
Solution: Try to use another proxies or change VPN.
Of course, it's oversimplified guide which doesn't include information about JavaScript generation of headers/tokens, "control" requests, WebSocket, etc. But, in my opinion, it can be helpful as entry-level guide which can point someone where to look for.
Post a Comment for "Website Blocks Python Crawler. Searching For Idea To Avoid"