Scrapy And Celery `update_state`
Solution 1:
I'm not sure how you are firing your spiders, but i've faced the same issue you describe.
My setup is flask as a rest api, which upon requests fires celery tasks to start spiders. I havent gotten to code it yet, but I'll tell you what i was thinking of doing:
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy import signals
from .your_celery import app
@app.task(bind=True)defscrapping(self):
defmy_item_scrapped_handler(item, response, spider):
meta = {
# fill your state meta as required based on scrapped item, spider, or response object passed as parameters
}
# here self refers to the task, so you can call update_state when using bind
self.update_state(state='PROGRESS',meta=meta)
settings = get_project_settings()
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner(settings)
d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
for crawler in runner.crawlers:
crawler.signals.connect(my_item_scrapped_handler, signal=signals.item_scraped)
reactor.run()
I'm sorry for not being able to confirm if it works, but as soon as I get around to testing it I'll report back here! I currently can't dedicate as much time as I would like to to this project :(
Do not hesitate to contact me if you think I can help you any further!
Cheers, Ramiro
Sources:
- CrawlerRunner crawlers method: https://doc.scrapy.org/en/latest/topics/api.html#scrapy.crawler.CrawlerRunner.crawlers
- Celery tasks docs:
- Scrapy signals: https://doc.scrapy.org/en/latest/topics/signals.html#signals
- Running scrapy as scripts: https://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
Solution 2:
Well need a lot more information to answer this.
How are you using celery with Scrapy? Is scrapy running inside of a celery task?
I would strongly suggest running scrapy under it's own server if it makes sense for your project scrapyd
.
If not then yes the item_scraped signal would be good but only if you have access to the Celery taskid or the Celery task object itself. http://docs.celeryproject.org/en/latest/reference/celery.app.task.html
From the item_scraped signal issue Task.update_state(taskid,meta={})
. You can also run without the taskid if scrapy happens to be running in a Celery task itself (as it defaults to self
)
Post a Comment for "Scrapy And Celery `update_state`"