1 year ago
#387110
Sandro Wiggers
Scrapy with Celery not running on Docker container
I code a Scrapy CrawlerProcess to run from script. It also uses Celery+RabbitMQ for controlling the urls to be scrapped.
The update.py
script sends the urls to RabbitMQ, and the Celery worker runs the Scrapy script.
When debugging on my IDE, it runs successfully. However, when I try to run inside a Docker container, the script doesn't run.
I have already double checked network
properties on docker-compose.yaml
and everything looks correct.
I've been working and debbuging this for days, without any different result.
update.py
...
for key in urls.keys():
site_name = key
links = urls[key]
logger.info(f'Insert into queue: {key} | {len(links)} records')
crawler_task.delay({'site_name': site_name, 'links': links})
...
app.py (Celery worker and setup)
import logging
import os
from billiard.context import Process
from celery import Celery
from scrapy import signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from twisted.internet import reactor
from enums import sites
logger = logging.getLogger(__name__)
def get_broker():
rabbitmq_user = os.getenv('RABBITMQ_USER')
rabbitmq_password = os.getenv('RABBITMQ_PASSWORD')
rabbitmq_host = os.getenv('RABBITMQ_HOST')
rabbitmq_port = os.getenv('RABBITMQ_PORT')
return f'amqp://{rabbitmq_user}:{rabbitmq_password}@{rabbitmq_host}:{rabbitmq_port}'
app = Celery('app', broker=get_broker())
class CrawlerScript(Process):
def __init__(self, params):
Process.__init__(self)
settings = Settings()
os.environ['SCRAPY_SETTINGS_MODULE'] = 'crawler.settings'
settings_module_path = os.environ['SCRAPY_SETTINGS_MODULE']
settings.setmodule(settings_module_path)
links = params.get('links')
site_name = params.get('site_name')
site = sites.get(site_name)
if site:
self.spider = site.get('spider')
spider_settings = site.get('settings')
meta = site.get('meta')
if spider_settings:
settings.setdict(spider_settings)
self.spider.urls = [{'listing_id': link.get('listing_id'), 'link': link.get('link')} for link in links]
self.spider.meta = meta
self.crawler = Crawler(self.spider, settings)
self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
else:
logger.error(f'No site found: {site_name}')
def run(self):
self.crawler.crawl(self.spider())
reactor.run()
@app.task(soft_time_limit=30, time_limit=60)
def crawler_task(params):
crawler = CrawlerScript(params)
crawler.start()
crawler.join()
if __name__ == '__main__':
app.start()
base.py (Generic class to spiders)
import scrapy
from crawler.items import Item
from enums.listing_status import ListingStatusEnum
class BaseSpider(scrapy.Spider):
download_timeout = 30
def __init__(self, name=None, **kwargs):
super().__init__(name, **kwargs)
def start_requests(self):
for url in self.urls:
print(f"START REQUESTS: {url.get('link')}") # This get printed while running on IDE
yield scrapy.Request(url.get('link'),
callback=self.parse,
errback=self.errback,
cb_kwargs=dict(listing_id=url.get('listing_id')),
meta=self.meta,
dont_filter=True)
def errback(self, failure):
self.logger.error('ERROR CALLBACK', repr(failure))
listing_id = failure.request.cb_kwargs['listing_id']
status = ListingStatusEnum.URL_NOT_FOUND.value
yield Item(listing_id=listing_id, name=None, price=None, status=status)
docker-compose.yaml
version: "3.9"
services:
rabbitmq:
image: rabbitmq:3.6.16-management-alpine
container_name: "rabbitmq"
restart: unless-stopped
environment:
RABBITMQ_DEFAULT_USER: "${RABBITMQ_USER}"
RABBITMQ_DEFAULT_PASS: "${RABBITMQ_PASSWORD}"
ports:
- "${RABBITMQ_PORT}:${RABBITMQ_PORT}"
- "1${RABBITMQ_PORT}:1${RABBITMQ_PORT}"
volumes:
- ./.docker/rabbitmq/data/:/var/lib/rabbitmq/
- ./.docker/rabbitmq/log/:/var/log/rabbitmq
deploy:
resources:
limits:
cpus: "1"
memory: 1G
reservations:
memory: 512M
crawler:
build:
context: .
dockerfile: Dockerfile
container_name: crawler
command: bash -c "celery -A app worker --pool=threads --loglevel=INFO --concurrency=1 -n worker@%n"
restart: unless-stopped
environment:
RABBITMQ_USER: "${RABBITMQ_USER}"
RABBITMQ_PASSWORD: "${RABBITMQ_PASSWORD}"
RABBITMQ_HOST: "${RABBITMQ_HOST}"
RABBITMQ_PORT: "${RABBITMQ_PORT}"
network_mode: host
depends_on:
- rabbitmq
deploy:
resources:
limits:
cpus: "4"
memory: 3G
reservations:
memory: 512M
logging:
options:
max-size: "1G"
max-file: "30"
When running on IDE, the script starts normally
Output running on IDE
[2022-04-07 08:53:54,330: INFO/MainProcess] Task app.crawler_task[e3b8401a-a6a7-4c1d-ac61-426cb8c1ccd0] received
[2022-04-07 08:54:04,732: INFO/MainProcess] Overridden settings:
{'BOT_NAME': 'CRAWLER',
'LOG_ENABLED': False,
'LOG_LEVEL': 'INFO',
'NEWSPIDER_MODULE': 'crawler.spiders',
'SPIDER_MODULES': ['crawler.spiders'],
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
'like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
[2022-04-07 08:54:04,762: INFO/MainProcess] Telnet Password: 5e1278a7d809124a
[2022-04-07 08:54:04,794: INFO/MainProcess] Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
[2022-04-07 08:54:15,748: INFO/CrawlerScript-1] Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
[2022-04-07 08:54:15,757: INFO/CrawlerScript-1] Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
[2022-04-07 08:54:16,095: INFO/CrawlerScript-1] Enabled item pipelines:
['crawler.pipelines.PricingCrawlerPipeline']
[2022-04-07 08:54:16,095: INFO/CrawlerScript-1] Spider opened
[2022-04-07 08:54:16,996: INFO/CrawlerScript-1] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[2022-04-07 08:54:17,000: INFO/CrawlerScript-1] Telnet console listening on 127.0.0.1:6023
**The script is running - START REQUESTS log from base.py**
[2022-04-07 08:54:18,831: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,872: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,878: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,884: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,889: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,894: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
...
[2022-04-07 08:54:26,029: INFO/CrawlerScript-1] Closing spider (finished)
[2022-04-07 08:54:26,037: INFO/CrawlerScript-1] Dumping Scrapy stats:
{'downloader/request_bytes': 6401,
'downloader/request_count': 15,
'downloader/request_method_count/GET': 15,
'downloader/response_bytes': 1443579,
'downloader/response_count': 15,
'downloader/response_status_count/200': 14,
'downloader/response_status_count/302': 1,
'elapsed_time_seconds': 9.036862,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 4, 7, 11, 54, 26, 33066),
'httpcompression/response_bytes': 7165556,
'httpcompression/response_count': 14,
'item_scraped_count': 14,
'log_count/INFO': 10,
'log_count/WARNING': 14,
'memusage/max': 96260096,
'memusage/startup': 96260096,
'response_received_count': 14,
'scheduler/dequeued': 15,
'scheduler/dequeued/memory': 15,
'scheduler/enqueued': 15,
'scheduler/enqueued/memory': 15,
'start_time': datetime.datetime(2022, 4, 7, 11, 54, 16, 996204)}
[2022-04-07 08:54:26,038: INFO/CrawlerScript-1] Spider closed (finished)
[2022-04-07 08:54:26,070: INFO/MainProcess] Task app.crawler_task[e3b8401a-a6a7-4c1d-ac61-426cb8c1ccd0] succeeded in 31.737013949023094s: None
When running on Docker, the script doesn't start
Output running on Docker container
crawler | [2022-04-07 12:18:33,000: INFO/MainProcess] Task app.crawler_task[2d05036b-ae92-488b-b7de-a6213905af48] received
crawler | [2022-04-07 12:18:33,009: INFO/MainProcess] Overridden settings:
crawler | {'BOT_NAME': 'CRAWLER',
crawler | 'LOG_ENABLED': False,
crawler | 'LOG_LEVEL': 'INFO',
crawler | 'NEWSPIDER_MODULE': 'crawler.spiders',
crawler | 'SPIDER_MODULES': ['crawler.spiders'],
crawler | 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
crawler | 'like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
crawler | [2022-04-07 12:18:33,027: INFO/MainProcess] Telnet Password: f64b27b6d4457920
crawler | [2022-04-07 12:18:33,081: INFO/MainProcess] Enabled extensions:
crawler | ['scrapy.extensions.corestats.CoreStats',
crawler | 'scrapy.extensions.telnet.TelnetConsole',
crawler | 'scrapy.extensions.memusage.MemoryUsage',
crawler | 'scrapy.extensions.logstats.LogStats']
crawler | [2022-04-07 12:18:33,159: INFO/CrawlerScript-1] Enabled downloader middlewares:
crawler | ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
crawler | 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
crawler | 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
crawler | 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
crawler | 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
crawler | 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
crawler | 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
crawler | 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
crawler | 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
crawler | 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
crawler | 'scrapy.downloadermiddlewares.stats.DownloaderStats']
crawler | [2022-04-07 12:18:33,164: INFO/CrawlerScript-1] Enabled spider middlewares:
crawler | ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
crawler | 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
crawler | 'scrapy.spidermiddlewares.referer.RefererMiddleware',
crawler | 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
crawler | 'scrapy.spidermiddlewares.depth.DepthMiddleware']
And nothing more happens.
I'm running on Python 3.8 with the following dependencies:
requirements.txt
scrapy==2.6.1
celery==5.2.3
billiard==3.6.4.0
Could be something related to reactor
on Docker?
Any ideas on the reason why the script doesn't start on the container?
python
docker
scrapy
celery
reactor
0 Answers
Your Answer