1 year ago

#387110

test-img

Sandro Wiggers

Scrapy with Celery not running on Docker container

I code a Scrapy CrawlerProcess to run from script. It also uses Celery+RabbitMQ for controlling the urls to be scrapped.

The update.py script sends the urls to RabbitMQ, and the Celery worker runs the Scrapy script.

When debugging on my IDE, it runs successfully. However, when I try to run inside a Docker container, the script doesn't run.

I have already double checked network properties on docker-compose.yaml and everything looks correct. I've been working and debbuging this for days, without any different result.

update.py

...
for key in urls.keys():
        site_name = key
        links = urls[key]
        logger.info(f'Insert into queue: {key} | {len(links)} records')
        crawler_task.delay({'site_name': site_name, 'links': links})
...

app.py (Celery worker and setup)

import logging
import os

from billiard.context import Process
from celery import Celery
from scrapy import signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from twisted.internet import reactor

from enums import sites

logger = logging.getLogger(__name__)


def get_broker():
    rabbitmq_user = os.getenv('RABBITMQ_USER')
    rabbitmq_password = os.getenv('RABBITMQ_PASSWORD')
    rabbitmq_host = os.getenv('RABBITMQ_HOST')
    rabbitmq_port = os.getenv('RABBITMQ_PORT')

    return f'amqp://{rabbitmq_user}:{rabbitmq_password}@{rabbitmq_host}:{rabbitmq_port}'


app = Celery('app', broker=get_broker())


class CrawlerScript(Process):
    def __init__(self, params):
        Process.__init__(self)
        settings = Settings()

        os.environ['SCRAPY_SETTINGS_MODULE'] = 'crawler.settings'
        settings_module_path = os.environ['SCRAPY_SETTINGS_MODULE']
        settings.setmodule(settings_module_path)

        links = params.get('links')
        site_name = params.get('site_name')

        site = sites.get(site_name)

        if site:
            self.spider = site.get('spider')
            spider_settings = site.get('settings')
            meta = site.get('meta')

            if spider_settings:
                settings.setdict(spider_settings)

            self.spider.urls = [{'listing_id': link.get('listing_id'), 'link': link.get('link')} for link in links]
            self.spider.meta = meta

            self.crawler = Crawler(self.spider, settings)
            self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)

        else:
            logger.error(f'No site found: {site_name}')

    def run(self):
        self.crawler.crawl(self.spider())
        reactor.run()


@app.task(soft_time_limit=30, time_limit=60)
def crawler_task(params):
    crawler = CrawlerScript(params)
    crawler.start()
    crawler.join()


if __name__ == '__main__':
    app.start()

base.py (Generic class to spiders)

import scrapy

from crawler.items import Item
from enums.listing_status import ListingStatusEnum


class BaseSpider(scrapy.Spider):
    download_timeout = 30

    def __init__(self, name=None, **kwargs):
        super().__init__(name, **kwargs)

    def start_requests(self):
        for url in self.urls:
            print(f"START REQUESTS: {url.get('link')}") # This get printed while running on IDE
            yield scrapy.Request(url.get('link'),
                                 callback=self.parse,
                                 errback=self.errback,
                                 cb_kwargs=dict(listing_id=url.get('listing_id')),
                                 meta=self.meta,
                                 dont_filter=True)

    def errback(self, failure):
        self.logger.error('ERROR CALLBACK', repr(failure))
        listing_id = failure.request.cb_kwargs['listing_id']
        status = ListingStatusEnum.URL_NOT_FOUND.value

        yield Item(listing_id=listing_id, name=None, price=None, status=status)

docker-compose.yaml

version: "3.9"
services:
  rabbitmq:
    image: rabbitmq:3.6.16-management-alpine
    container_name: "rabbitmq"
    restart: unless-stopped
    environment:
      RABBITMQ_DEFAULT_USER: "${RABBITMQ_USER}"
      RABBITMQ_DEFAULT_PASS: "${RABBITMQ_PASSWORD}"
    ports:
      - "${RABBITMQ_PORT}:${RABBITMQ_PORT}"
      - "1${RABBITMQ_PORT}:1${RABBITMQ_PORT}"
    volumes:
      - ./.docker/rabbitmq/data/:/var/lib/rabbitmq/
      - ./.docker/rabbitmq/log/:/var/log/rabbitmq
    deploy:
      resources:
        limits:
          cpus: "1"
          memory: 1G
        reservations:
          memory: 512M

  crawler:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: crawler
    command: bash -c "celery -A app worker --pool=threads --loglevel=INFO  --concurrency=1 -n worker@%n"
    restart: unless-stopped
    environment:
      RABBITMQ_USER: "${RABBITMQ_USER}"
      RABBITMQ_PASSWORD: "${RABBITMQ_PASSWORD}"
      RABBITMQ_HOST: "${RABBITMQ_HOST}"
      RABBITMQ_PORT: "${RABBITMQ_PORT}"
    network_mode: host
    depends_on:
      - rabbitmq
    deploy:
      resources:
        limits:
          cpus: "4"
          memory: 3G
        reservations:
          memory: 512M
    logging:
      options:
        max-size: "1G"
        max-file: "30"

When running on IDE, the script starts normally

Output running on IDE

[2022-04-07 08:53:54,330: INFO/MainProcess] Task app.crawler_task[e3b8401a-a6a7-4c1d-ac61-426cb8c1ccd0] received
[2022-04-07 08:54:04,732: INFO/MainProcess] Overridden settings:
{'BOT_NAME': 'CRAWLER',
 'LOG_ENABLED': False,
 'LOG_LEVEL': 'INFO',
 'NEWSPIDER_MODULE': 'crawler.spiders',
 'SPIDER_MODULES': ['crawler.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
               'like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
[2022-04-07 08:54:04,762: INFO/MainProcess] Telnet Password: 5e1278a7d809124a
[2022-04-07 08:54:04,794: INFO/MainProcess] Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
[2022-04-07 08:54:15,748: INFO/CrawlerScript-1] Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
[2022-04-07 08:54:15,757: INFO/CrawlerScript-1] Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
[2022-04-07 08:54:16,095: INFO/CrawlerScript-1] Enabled item pipelines:
['crawler.pipelines.PricingCrawlerPipeline']
[2022-04-07 08:54:16,095: INFO/CrawlerScript-1] Spider opened
[2022-04-07 08:54:16,996: INFO/CrawlerScript-1] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[2022-04-07 08:54:17,000: INFO/CrawlerScript-1] Telnet console listening on 127.0.0.1:6023

**The script is running - START REQUESTS log from base.py**
[2022-04-07 08:54:18,831: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,872: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,878: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,884: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,889: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
[2022-04-07 08:54:18,894: WARNING/CrawlerScript-1] START REQUESTS: https://xxx
...
[2022-04-07 08:54:26,029: INFO/CrawlerScript-1] Closing spider (finished)
[2022-04-07 08:54:26,037: INFO/CrawlerScript-1] Dumping Scrapy stats:
{'downloader/request_bytes': 6401,
 'downloader/request_count': 15,
 'downloader/request_method_count/GET': 15,
 'downloader/response_bytes': 1443579,
 'downloader/response_count': 15,
 'downloader/response_status_count/200': 14,
 'downloader/response_status_count/302': 1,
 'elapsed_time_seconds': 9.036862,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 4, 7, 11, 54, 26, 33066),
 'httpcompression/response_bytes': 7165556,
 'httpcompression/response_count': 14,
 'item_scraped_count': 14,
 'log_count/INFO': 10,
 'log_count/WARNING': 14,
 'memusage/max': 96260096,
 'memusage/startup': 96260096,
 'response_received_count': 14,
 'scheduler/dequeued': 15,
 'scheduler/dequeued/memory': 15,
 'scheduler/enqueued': 15,
 'scheduler/enqueued/memory': 15,
 'start_time': datetime.datetime(2022, 4, 7, 11, 54, 16, 996204)}
[2022-04-07 08:54:26,038: INFO/CrawlerScript-1] Spider closed (finished)
[2022-04-07 08:54:26,070: INFO/MainProcess] Task app.crawler_task[e3b8401a-a6a7-4c1d-ac61-426cb8c1ccd0] succeeded in 31.737013949023094s: None

When running on Docker, the script doesn't start

Output running on Docker container

crawler    | [2022-04-07 12:18:33,000: INFO/MainProcess] Task app.crawler_task[2d05036b-ae92-488b-b7de-a6213905af48] received
crawler    | [2022-04-07 12:18:33,009: INFO/MainProcess] Overridden settings:
crawler    | {'BOT_NAME': 'CRAWLER',
crawler    |  'LOG_ENABLED': False,
crawler    |  'LOG_LEVEL': 'INFO',
crawler    |  'NEWSPIDER_MODULE': 'crawler.spiders',
crawler    |  'SPIDER_MODULES': ['crawler.spiders'],
crawler    |  'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
crawler    |                'like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
crawler    | [2022-04-07 12:18:33,027: INFO/MainProcess] Telnet Password: f64b27b6d4457920
crawler    | [2022-04-07 12:18:33,081: INFO/MainProcess] Enabled extensions:
crawler    | ['scrapy.extensions.corestats.CoreStats',
crawler    |  'scrapy.extensions.telnet.TelnetConsole',
crawler    |  'scrapy.extensions.memusage.MemoryUsage',
crawler    |  'scrapy.extensions.logstats.LogStats']
crawler    | [2022-04-07 12:18:33,159: INFO/CrawlerScript-1] Enabled downloader middlewares:
crawler    | ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
crawler    |  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
crawler    |  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
crawler    |  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
crawler    |  'scrapy.downloadermiddlewares.retry.RetryMiddleware',
crawler    |  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
crawler    |  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
crawler    |  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
crawler    |  'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
crawler    |  'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
crawler    |  'scrapy.downloadermiddlewares.stats.DownloaderStats']
crawler    | [2022-04-07 12:18:33,164: INFO/CrawlerScript-1] Enabled spider middlewares:
crawler    | ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
crawler    |  'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
crawler    |  'scrapy.spidermiddlewares.referer.RefererMiddleware',
crawler    |  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
crawler    |  'scrapy.spidermiddlewares.depth.DepthMiddleware']

And nothing more happens.

I'm running on Python 3.8 with the following dependencies:

requirements.txt

scrapy==2.6.1
celery==5.2.3
billiard==3.6.4.0

Could be something related to reactor on Docker? Any ideas on the reason why the script doesn't start on the container?

python

docker

scrapy

celery

reactor

0 Answers

Your Answer

Accepted video resources