When you need to do some web scraping job in Python, an excellent choice is the Scrapy framework. Not only it takes care of most of the networking (HTTP, SSL, proxies, etc) but it also facilitates the process of extracting data from the web by providing things such as nifty xpath selectors.

Scrapy is built upon the Twisted networking engine. A limitation of its core component, the reactor, is that it cannot be restarted. This might cause us some troubles if we are trying to devise a mechanism to run Scrapy spiders independently from a Python script (and not from Scrapy shell). Say for example we want to implement a Python function that receives some parameters, performs a search/web scraping in some sites and returns a list of scrapped items. A naive solution such as this will not work, since in each of the function calls we need to have the Twisted reactor restarted, and this is unfortunately not possible.

A workaround for this is to run Scrapy on its own process. After doing a search, I could get no solution to work on latest Scrapy. However one of those used Multiprocessing and it came pretty close! Here is an updated version for Scrapy 0.13:

from scrapy import project, signals
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy.xlib.pydispatch import dispatcher
from multiprocessing.queues import Queue
import multiprocessing

class CrawlerWorker(multiprocessing.Process):

    def __init__(self, spider, result_queue):
        multiprocessing.Process.__init__(self)
        self.result_queue = result_queue

        self.crawler = CrawlerProcess(settings)
        if not hasattr(project, 'crawler'):
            self.crawler.install()
        self.crawler.configure()

        self.items = []
        self.spider = spider
        dispatcher.connect(self._item_passed, signals.item_passed)

    def _item_passed(self, item):
        self.items.append(item)
 
    def run(self):
        self.crawler.crawl(self.spider)
        self.crawler.start()
        self.crawler.stop()
        self.result_queue.put(self.items)

One way to invoke this, say inside a function, would be:

        result_queue = Queue()
        crawler = CrawlerWorker(MySpider(myArgs), result_queue)
        crawler.start()
        for item in result_queue.get():
            yield item

where MySpider is of course the class of the Spider you want to run, and myArgs are the arguments you wish to invoke the spider with.


Comments

  1. tre
    tre on 07/24/2012 1:07 a.m.
    Hey Alan, thank you for sharing this! (and for fixing the comment system)
  2. payala
    payala on 08/19/2012 6:11 p.m.
    I have tried this under windows but I never managed to make it work. I think the problem has to do with the limitations imposed by the multiprocessing module on windows platforms. I think this might be related: http://docs.python.org/library/multiprocessing.html#windows http://stackoverflow.com/questions/765129/hows-python-multiprocessing-implemented-on-windows
  3. akersof
    akersof on 09/15/2012 1:50 a.m.
    You do not set any environement variable? Just new in scrapy and still get an error . "crawler = CrawlerWorker(MySpider('url=http://www.example.com'), result_queue)" What should be MySpider? the class name? the project name? the name of of the crawler (name="myspider" in the class)? Regards,
  4. Serg
    Serg on 11/03/2012 1:55 p.m.
    It works only for one process running... When I run this code for two or more processes concurrently ... for spider in spiders: crawler = CrawlerWorker(spider(myArgs), result_queue) crawler.start() ... I have got errors with Twisted Unhandled Error Traceback (most recent call last): File "/usr/lib64/python2.7/site-packages/twisted/python/log.py", line 84, in callWithLogger return callWithContext({"system": lp}, func, *args, **kw) File "/usr/lib64/python2.7/site-packages/twisted/python/log.py", line 69, in callWithContext return context.call({ILogContext: newCtx}, func, *args, **kw) File "/usr/lib64/python2.7/site-packages/twisted/python/context.py", line 118, in callWithContext return self.currentContext().callWithContext(ctx, func, *args, **kw) File "/usr/lib64/python2.7/site-packages/twisted/python/context.py", line 81, in callWithContext return func(*args,**kw) --- <exception caught here> --- File "/usr/lib64/python2.7/site-packages/twisted/internet/posixbase.py", line 631, in _doReadOrWrite why = selectable.doWrite() File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 1094, in doWrite raise RuntimeError, "doWrite called on a %s" % reflect.qual(self.__class__) exceptions.RuntimeError: doWrite called on a twisted.internet.tcp.Port
  5. Serg
    Serg on 11/08/2012 7:13 a.m.
    Errors in Twisted in example above was eliminated by setting WEBSERVICE_ENABLED and TELNETCONSOLE_ENABLED to FALSE. So I can run any count of processes with own spider in process without errors
  6. sam
    sam on 03/17/2013 1:27 p.m.
    Does this technique work with scrapy 0,16?
  7. Alan Descoins
    Alan Descoins on 03/26/2013 10:12 p.m.
    I haven't used version 0.16, but I am almost sure the code will probably need some changes.
  8. Rajesh Lakshmanan
    Rajesh Lakshmanan on 10/03/2013 3:55 a.m.
    Hi Alan, I am learning scrapy and python basically I am a java developer, I am using Eclipse PyDev IDE for this development so i need to install scrapy in my eclipse, please help me out how to achieve it.
  9. safari tours Uganda
    safari tours Uganda on 08/09/2014 4:54 a.m.
    These kinds of web pages can be be extremely valuable and therefore are going to be the way the concept theme remains spelled separate. My partner and i additionally as well as the majority of the points very. Looking forward to doing well write-up.
  10. iphone 6 precio
    iphone 6 precio on 08/11/2014 1:21 a.m.
    Ideal wants with regards to this kind of fantastic write-up beyond the have a look at, I'll be surely stunned! Preserve specific things like this type of heading back.
  11. Driver service
    Driver service on 08/11/2014 2:52 a.m.
    There are a lot of taxi booking sites out there now in the UK but they are either generally unreliable or don't cover the areas I need to travel to. I'd recommend Driver service - they seem to have the taxi operators that provide the best service at a reasonable cost.
  12. Cannabis Seeds
    Cannabis Seeds on 08/11/2014 5:55 a.m.
    Such a obvious worthwhile energetic. Likewise remarkable to venture this requirement. I would homogeneous to extol you for the bothers you had made for monograph this astounding condition.
  13. Adored.co.uk - Bondage
    Adored.co.uk - Bondage on 08/12/2014 2:52 a.m.
    When i trust your website and I am to scrutinize that much more sometime soon so you need to carry on your own react.
  14. copper bracelets
    copper bracelets on 08/12/2014 6:09 a.m.
    Fundamentally this is certainly great call your site. It is extremely happiness to obtain the notion whenever once i been given massive will allow for below. Many of us specifically gain benefit unique freelance writers cogs and trolley wheels and may be aware of further send with the operations.
  15. new york apartments
    new york apartments on 08/13/2014 3:09 a.m.
    Say for archetype we desire to tool a Python work that haves part parameters, renders a probe/net scraping in few scenes further rebounds a index of scrapped units.
  16. Massage chesterfield
    Massage chesterfield on 08/13/2014 4:48 a.m.
    Cheers intended for using a few moments to be able to argument this, Professionally i think clearly regarding it and also enjoy studying a lot more with this issue.
  17. Cheap mobile accessories
    Cheap mobile accessories on 08/13/2014 5:20 a.m.
    Many women and men may well quite possibly turn into ill-informed linked to actually linked to girls totally, when put next Offering techniques linked to techniques with regards to get babies developing a excessive fat concerning assortment.
  18. click through the next website
    click through the next website on 08/15/2014 12:42 a.m.
    Basically expected to consider through and lots of cheers regarding the several great conduct you routinely create. Conserve the fantastic conduct.
  19. Discounted Islamic Books
    Discounted Islamic Books on 08/15/2014 4:49 a.m.
    I might favor to take into account excellent material which will placement My spouse and i happened upon in a mere a particular person write-up. Best wishes suitable for offering My business is truthfully started using this amazing problems.
  20. edebtconsolidationloans.org
    edebtconsolidationloans.org on 08/15/2014 5:38 a.m.
    I am grateful to be able to found this type of effective write-up. My spouse and i seriously improved the being familiar with next comprehend your own personal write-up which is good for us.
  21. retouche packshot
    retouche packshot on 08/16/2014 5:34 a.m.
    Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles.Keep up the good work!
  22. photography classes
    photography classes on 08/20/2014 3:56 a.m.
    With thanks for this reason write-up. Which is the majority of As i can easily sum it up. An individual many unquestionably assimilate supposed this type of web site web site to something special. An individual definitely truthfully really know what you are undertaking separated making use of, you've shielded a lot of capabilities.
  23. excess baggage shipping by air
    excess baggage shipping by air on 08/20/2014 4:46 a.m.
    Each of we all can’t obtain this type of facts by making use of Well then, I'll go over say thanks to meant for offering these kinds of superb post.
  24. sea fishing tackle
    sea fishing tackle on 08/20/2014 5:42 a.m.
    Virtually any body system achieved unique wonderful factors actually at this time there. I did so therefore then a brand new look for determined by pet along with improved majority of the women as well as men will certainly show your site.
  25. mumm champagne
    mumm champagne on 08/20/2014 6:10 a.m.
    This is actually retaliation While i popularity meeting is vital in doing my verve.
  26. adult sex toys
    adult sex toys on 08/21/2014 4:30 a.m.
    In analyze to given that our keyword suffering dictum anguish correspondents are entire we allure arid for the bide-hows further introduce progress about skills, they consign swiftly harmony attendant term cable solicit regarding near sundry rope.
  27. http://www.freetrialhgh.com/
    http://www.freetrialhgh.com/ on 08/22/2014 4:44 a.m.
    Older persons are generally talking about these people. The business ended up through attained the actual heart stroke career they will produced. The idea definitely created feel in order to concept these people the actual experts.
  28. package tracking software system
    package tracking software system on 08/22/2014 6:24 a.m.
    My business is triumphant although limn your blog unit updated information! bless numerous also need which you adjustable rate mortgage modify perch concentration which are in accordance with this locale.
  29. Vertical Blinds
    Vertical Blinds on 08/22/2014 6:29 a.m.
    A service posesses a superb content boss in addition to outstanding genuine impulse to speak about articles or blog posts amalgamated such as traces. Love on the subject of coping with these kinds of. Your very own non-public powerful specifics gives numerous heavy-duty specifics. Even so, it's even so clear to know, wanted in addition to useful.

Post your comment

:

:

(Optional):

:

(Optional):