Skip to content Skip to sidebar Skip to footer

Website Blocks Python Crawler. Searching For Idea To Avoid

I want to crawl data from Object-sites from https://www.fewo-direkt.de (in US https://www.homeaway.com/) like this: https://www.fewo-direkt.de/ferienwohnung-ferienhaus/p8735326 But

Solution 1:

There is big amount of methods websites can use for bot detection. We can group them in next list:

  1. Headers validation. It's the most widespread basic-level validation which check HTTP request headers for existence, nonexistence, default, fake or corrupted values.

    E.g. default User-Agent in python requests starts from python-requests/, which can be easily checked on backend and as a result your client will be flagged as bot and get "error" response.

    Answer : Try to sniff same request from browser (you can use Fiddler) and clone headers from browser. In python requests it can be done by next code:

    headers = {
        "User-Agent": "Some User-Agent"
    }
    response = requests.get(url, headers=headers)
    
  2. Cookies validation. Yes, Cookie is also HTTP header, but validation method differs from previous. Idea of this method is to check Cookie header and validate each cookie.

    Solution:

    1) Sniff all requests done by browser;

    2) Check request you're trying to repeat and take a look on Cookie header;

    3) Search values of each cookie in previous requests;

    4) Repeat each request which set cookie(-s) before main request to collect all required cookies.

    In python requests you don't need to scrape the manually, just use session:

    http_session = requests.Session() 
    http_session.get(url_to_get_cookie)  # cookies will be stored inside "http_session"object
    response = http_session.get(final_url)
    
  3. IP address or provider validation. Website can check IP address and provider to not be listed in spam databases. It's possible if you're using public proxies/VPN.

    Solution: Try to use another proxies or change VPN.

Of course, it's oversimplified guide which doesn't include information about JavaScript generation of headers/tokens, "control" requests, WebSocket, etc. But, in my opinion, it can be helpful as entry-level guide which can point someone where to look for.

Post a Comment for "Website Blocks Python Crawler. Searching For Idea To Avoid"