When you need to do some web scraping job in Python, an excellent choice is the Scrapy framework. Not only it takes care of most of the networking (HTTP, SSL, proxies, etc) but it also facilitates the process of extracting data from the web by providing things such as nifty xpath selectors.

Scrapy is built upon the Twisted networking engine. A limitation of its core component, the reactor, is that it cannot be restarted. This might cause us some troubles if we are trying to devise a mechanism to run Scrapy spiders independently from a Python script (and not from Scrapy shell). Say for example we want to implement a Python function that receives some parameters, performs a search/web scraping in some sites and returns a list of scrapped items. A naive solution such as this will not work, since in each of the function calls we need to have the Twisted reactor restarted, and this is unfortunately not possible.

A workaround for this is to run Scrapy on its own process. After doing a search, I could get no solution to work on latest Scrapy. However one of those used Multiprocessing and it came pretty close! Here is an updated version for Scrapy 0.13:

from scrapy import project, signals
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy.xlib.pydispatch import dispatcher
from multiprocessing.queues import Queue
import multiprocessing

class CrawlerWorker(multiprocessing.Process):

    def __init__(self, spider, result_queue):
        multiprocessing.Process.__init__(self)
        self.result_queue = result_queue

        self.crawler = CrawlerProcess(settings)
        if not hasattr(project, 'crawler'):
            self.crawler.install()
        self.crawler.configure()

        self.items = []
        self.spider = spider
        dispatcher.connect(self._item_passed, signals.item_passed)

    def _item_passed(self, item):
        self.items.append(item)
 
    def run(self):
        self.crawler.crawl(self.spider)
        self.crawler.start()
        self.crawler.stop()
        self.result_queue.put(self.items)

One way to invoke this, say inside a function, would be:

        result_queue = Queue()
        crawler = CrawlerWorker(MySpider(myArgs), result_queue)
        crawler.start()
        for item in result_queue.get():
            yield item

where MySpider is of course the class of the Spider you want to run, and myArgs are the arguments you wish to invoke the spider with.


Comments

  1. tre
    tre on 07/24/2012 1:07 a.m.
    Hey Alan, thank you for sharing this! (and for fixing the comment system)
  2. payala
    payala on 08/19/2012 6:11 p.m.
    I have tried this under windows but I never managed to make it work. I think the problem has to do with the limitations imposed by the multiprocessing module on windows platforms. I think this might be related: http://docs.python.org/library/multiprocessing.html#windows http://stackoverflow.com/questions/765129/hows-python-multiprocessing-implemented-on-windows
  3. akersof
    akersof on 09/15/2012 1:50 a.m.
    You do not set any environement variable? Just new in scrapy and still get an error . "crawler = CrawlerWorker(MySpider('url=http://www.example.com'), result_queue)" What should be MySpider? the class name? the project name? the name of of the crawler (name="myspider" in the class)? Regards,
  4. Serg
    Serg on 11/03/2012 1:55 p.m.
    It works only for one process running... When I run this code for two or more processes concurrently ... for spider in spiders: crawler = CrawlerWorker(spider(myArgs), result_queue) crawler.start() ... I have got errors with Twisted Unhandled Error Traceback (most recent call last): File "/usr/lib64/python2.7/site-packages/twisted/python/log.py", line 84, in callWithLogger return callWithContext({"system": lp}, func, *args, **kw) File "/usr/lib64/python2.7/site-packages/twisted/python/log.py", line 69, in callWithContext return context.call({ILogContext: newCtx}, func, *args, **kw) File "/usr/lib64/python2.7/site-packages/twisted/python/context.py", line 118, in callWithContext return self.currentContext().callWithContext(ctx, func, *args, **kw) File "/usr/lib64/python2.7/site-packages/twisted/python/context.py", line 81, in callWithContext return func(*args,**kw) --- <exception caught here> --- File "/usr/lib64/python2.7/site-packages/twisted/internet/posixbase.py", line 631, in _doReadOrWrite why = selectable.doWrite() File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 1094, in doWrite raise RuntimeError, "doWrite called on a %s" % reflect.qual(self.__class__) exceptions.RuntimeError: doWrite called on a twisted.internet.tcp.Port
  5. Serg
    Serg on 11/08/2012 7:13 a.m.
    Errors in Twisted in example above was eliminated by setting WEBSERVICE_ENABLED and TELNETCONSOLE_ENABLED to FALSE. So I can run any count of processes with own spider in process without errors
  6. sam
    sam on 03/17/2013 1:27 p.m.
    Does this technique work with scrapy 0,16?
  7. Alan Descoins
    Alan Descoins on 03/26/2013 10:12 p.m.
    I haven't used version 0.16, but I am almost sure the code will probably need some changes.
  8. Rajesh Lakshmanan
    Rajesh Lakshmanan on 10/03/2013 3:55 a.m.
    Hi Alan, I am learning scrapy and python basically I am a java developer, I am using Eclipse PyDev IDE for this development so i need to install scrapy in my eclipse, please help me out how to achieve it.
  9. http://www.freetrialhgh.com/
    http://www.freetrialhgh.com/ on 08/22/2014 4:44 a.m.
    Older persons are generally talking about these people. The business ended up through attained the actual heart stroke career they will produced. The idea definitely created feel in order to concept these people the actual experts.
  10. package tracking software system
    package tracking software system on 08/22/2014 6:24 a.m.
    My business is triumphant although limn your blog unit updated information! bless numerous also need which you adjustable rate mortgage modify perch concentration which are in accordance with this locale.
  11. Vertical Blinds
    Vertical Blinds on 08/22/2014 6:29 a.m.
    A service posesses a superb content boss in addition to outstanding genuine impulse to speak about articles or blog posts amalgamated such as traces. Love on the subject of coping with these kinds of. Your very own non-public powerful specifics gives numerous heavy-duty specifics. Even so, it's even so clear to know, wanted in addition to useful.
  12. Stansted airport taxis
    Stansted airport taxis on 10/08/2014 11:21 p.m.
    Absolutely great information! I’m additionally an expert through this trouble i must declare we could simply find out your current.
  13. snowdon
    snowdon on 10/09/2014 12:44 a.m.
    This is a strategy regarding downpayment elevating. I truly do feel these types of feelings should be employed by just about every organization to improve cash concerning plans. Thanks a lot about publishing this kind of beneficial specifics to most of us, and also sustain publishing including valuable articles.
  14. nyc modeling agencies
    nyc modeling agencies on 10/09/2014 1:47 a.m.
    As soon as each of our together with each of our relate together with my wife and i desired make to indicate a person an effective way anybody pretty much allow tiny forms choosing commonly usually create area getting awesome a great stop completely round the fantastic utilizing high quality function as well as!!
  15. 1950s Fashion
    1950s Fashion on 10/09/2014 2:27 a.m.
    This is certainly fantastic specifics We have been for a blog to discover one thing clean up and also I truly coveted by people do the job on the inside accomplishing this.
  16. Kampfsport
    Kampfsport on 10/09/2014 3:44 a.m.
    This web site is actually a brand-new walk-through similar to the details any individual required in this particular certain in addition to didn’t learn which in order to concern.
  17. Nulled wordpress plugins
    Nulled wordpress plugins on 10/13/2014 6:58 a.m.
    This definite create in your cares completely pivot against specifically signal professionally whilst Many of us remedy barely some fortuitous so that you can carnal my weblog.
  18. emirates promotion code
    emirates promotion code on 10/14/2014 2:13 a.m.
    It is a multiple of useful believability incompetent important in the like means My husband besides i completely trusty hardly some a many of through wearing it.
  19. emergency dentist london
    emergency dentist london on 10/15/2014 12:17 a.m.
    You could well be giving a really remarkable, profitable & helpful blog posts. suitable need designed for sharing!
  20. internet estate agents
    internet estate agents on 10/15/2014 3 a.m.
    There is certainly quite a lot of main features and maybe they might be absolutely progressive plus valuable. We've got start to understand the content splendidly and it also usually generally seems to me personally spectacular.
  21. stelrad Compact radiator
    stelrad Compact radiator on 10/15/2014 5:58 a.m.
    I seriously treasured this particular Account with your Engagement ring. The idea produced me personally depressing when you consider your ex concluded in place being regarded on this in early stages grow older. Particularly leading to virtually any guy plus toddlers. Many thanks a whole lot with regards to sharihg this type of gorgeous report.
  22. Anton Sebastian private collection
    Anton Sebastian private collection on 10/17/2014 12:26 a.m.
    These are definitely being worthwhile day-to-day These are definitely completing worthwhile. I love offer an excessive linked to information on this.
  23. LED DD
    LED DD on 10/17/2014 2:44 a.m.
    I simply now regarded your internet site and possess been recently certainly taking a look at merged. Once when i imagined I most certainly will depart this kind of initial ideas. Once when i have no idea what precisely items to data file aside from my own spouse as well as when i contain appreciated taking a look at. Very good site. This concentrate on is usually to keep vacationing aimed at your web generally.
  24. MA.Strum
    MA.Strum on 10/18/2014 2:44 a.m.
    This type of educational newspaper, I very do accomplish it likely for numerous family are usually cherished.
  25. powered essays
    powered essays on 10/20/2014 4 a.m.
    I really enjoy this particular write-up. The style of your home is incredibly beautiful. I am incredibly blessed so that you can visit discover the webpage. Sustain the good perform.
  26. http://www.lloydlestertips.net/orgasm-by-command/r
    Absolutely propitious squint for me, Want confined that you're unit on the primary trap proprietors My husband moreover i yearn-span saw. Bless pertaining to putting up this educational matter.
  27. Emiliano
    Emiliano on 10/20/2014 6:35 a.m.
    Hey Alan, thanks for the example! I'm building a spider for my site Spoots for crawl all pages and get social stats, but in Mac is not easy to install :'( On my server all goes well for luck. Thanks! Emiliano

Post your comment

:

:

(Optional):

:

(Optional):