Scrapy Doesn't Crawl All Pages
Solution 1:
1."They have suggested to use dont_filter = True however, I have no idea of where to put it in my code."
This argument is in BaseSpider, which CrawlSpider inherits from. (scrapy/spider.py) And it's set as True by default.
2."It can scrap only 10 or 15 records."
Reason: This is because the start_urls is not that good. In this problem, the spider starts crawling in http://www.khmer24.com/, and let's assume that it gets 10 urls to follow(which are satisfied the pattern). And then, the spider goes on crawling these 10 urls. But as these pages contain so little satisfied pattern, the spider gets a few urls to follow(even no urls), which results in stopping crawling.
Possible solution: The reason what I said above just restates icecrime's opinion. And so does the solution.
Suggest to use the 'All ads' page as start_urls. (You could also use the home page as start_urls and use the new rules.)
New rules:
rules = ( # Extract all links and follow links from them # (since no callback means follow=True by default) # (If "allow" is not given, it will match all links.) Rule(SgmlLinkExtractor()), # Extract links matching the "ad/any-words/67-anynumber.html" pattern # and parse them with the spider's method parse_item (NOT FOLLOW THEM) Rule(SgmlLinkExtractor(allow=r'ad/.+/67-\d+\.html'), callback='parse_item'), )
Refer: SgmlLinkExtractor, CrawlSpider example
Post a Comment for "Scrapy Doesn't Crawl All Pages"