When you need to do some web scraping job in Python, an excellent choice is the Scrapy framework. Not only it takes care of most of the networking (HTTP, SSL, proxies, etc) but it also facilitates the process of extracting data from the web by providing things such as nifty xpath selectors.

Scrapy is built upon the Twisted networking engine. A limitation of its core component, the reactor, is that it cannot be restarted. This might cause us some troubles if we are trying to devise a mechanism to run Scrapy spiders independently from a Python script (and not from Scrapy shell). Say for example we want to implement a Python function that receives some parameters, performs a search/web scraping in some sites and returns a list of scrapped items. A naive solution such as this will not work, since in each of the function calls we need to have the Twisted reactor restarted, and this is unfortunately not possible.

A workaround for this is to run Scrapy on its own process. After doing a search, I could get no solution to work on latest Scrapy. However one of those used Multiprocessing and it came pretty close! Here is an updated version for Scrapy 0.13:

from scrapy import project, signals
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy.xlib.pydispatch import dispatcher
from multiprocessing.queues import Queue
import multiprocessing

class CrawlerWorker(multiprocessing.Process):

    def __init__(self, spider, result_queue):
        multiprocessing.Process.__init__(self)
        self.result_queue = result_queue

        self.crawler = CrawlerProcess(settings)
        if not hasattr(project, 'crawler'):
            self.crawler.install()
        self.crawler.configure()

        self.items = []
        self.spider = spider
        dispatcher.connect(self._item_passed, signals.item_passed)

    def _item_passed(self, item):
        self.items.append(item)
 
    def run(self):
        self.crawler.crawl(self.spider)
        self.crawler.start()
        self.crawler.stop()
        self.result_queue.put(self.items)

One way to invoke this, say inside a function, would be:

        result_queue = Queue()
        crawler = CrawlerWorker(MySpider(myArgs), result_queue)
        crawler.start()
        for item in result_queue.get():
            yield item

where MySpider is of course the class of the Spider you want to run, and myArgs are the arguments you wish to invoke the spider with.


Comments

  1. tre
    tre on 07/24/2012 1:07 a.m.
    Hey Alan, thank you for sharing this! (and for fixing the comment system)
  2. payala
    payala on 08/19/2012 6:11 p.m.
    I have tried this under windows but I never managed to make it work. I think the problem has to do with the limitations imposed by the multiprocessing module on windows platforms. I think this might be related: http://docs.python.org/library/multiprocessing.html#windows http://stackoverflow.com/questions/765129/hows-python-multiprocessing-implemented-on-windows
  3. akersof
    akersof on 09/15/2012 1:50 a.m.
    You do not set any environement variable? Just new in scrapy and still get an error . "crawler = CrawlerWorker(MySpider('url=http://www.example.com'), result_queue)" What should be MySpider? the class name? the project name? the name of of the crawler (name="myspider" in the class)? Regards,
  4. Serg
    Serg on 11/03/2012 1:55 p.m.
    It works only for one process running... When I run this code for two or more processes concurrently ... for spider in spiders: crawler = CrawlerWorker(spider(myArgs), result_queue) crawler.start() ... I have got errors with Twisted Unhandled Error Traceback (most recent call last): File "/usr/lib64/python2.7/site-packages/twisted/python/log.py", line 84, in callWithLogger return callWithContext({"system": lp}, func, *args, **kw) File "/usr/lib64/python2.7/site-packages/twisted/python/log.py", line 69, in callWithContext return context.call({ILogContext: newCtx}, func, *args, **kw) File "/usr/lib64/python2.7/site-packages/twisted/python/context.py", line 118, in callWithContext return self.currentContext().callWithContext(ctx, func, *args, **kw) File "/usr/lib64/python2.7/site-packages/twisted/python/context.py", line 81, in callWithContext return func(*args,**kw) --- <exception caught here> --- File "/usr/lib64/python2.7/site-packages/twisted/internet/posixbase.py", line 631, in _doReadOrWrite why = selectable.doWrite() File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 1094, in doWrite raise RuntimeError, "doWrite called on a %s" % reflect.qual(self.__class__) exceptions.RuntimeError: doWrite called on a twisted.internet.tcp.Port
  5. Serg
    Serg on 11/08/2012 7:13 a.m.
    Errors in Twisted in example above was eliminated by setting WEBSERVICE_ENABLED and TELNETCONSOLE_ENABLED to FALSE. So I can run any count of processes with own spider in process without errors
  6. sam
    sam on 03/17/2013 1:27 p.m.
    Does this technique work with scrapy 0,16?
  7. Alan Descoins
    Alan Descoins on 03/26/2013 10:12 p.m.
    I haven't used version 0.16, but I am almost sure the code will probably need some changes.
  8. Rajesh Lakshmanan
    Rajesh Lakshmanan on 10/03/2013 3:55 a.m.
    Hi Alan, I am learning scrapy and python basically I am a java developer, I am using Eclipse PyDev IDE for this development so i need to install scrapy in my eclipse, please help me out how to achieve it.
  9. http://www.freetrialhgh.com/
    http://www.freetrialhgh.com/ on 08/22/2014 4:44 a.m.
    Older persons are generally talking about these people. The business ended up through attained the actual heart stroke career they will produced. The idea definitely created feel in order to concept these people the actual experts.
  10. package tracking software system
    package tracking software system on 08/22/2014 6:24 a.m.
    My business is triumphant although limn your blog unit updated information! bless numerous also need which you adjustable rate mortgage modify perch concentration which are in accordance with this locale.
  11. Vertical Blinds
    Vertical Blinds on 08/22/2014 6:29 a.m.
    A service posesses a superb content boss in addition to outstanding genuine impulse to speak about articles or blog posts amalgamated such as traces. Love on the subject of coping with these kinds of. Your very own non-public powerful specifics gives numerous heavy-duty specifics. Even so, it's even so clear to know, wanted in addition to useful.
  12. 調解員
    調解員 on 09/18/2014 2:42 a.m.
    Great info. I love all the posts, I really enjoyed, I would like more information about this, because it is very nice., Thanks for sharing.
  13. Extortion Book
    Extortion Book on 09/18/2014 6:02 a.m.
    Nice of me coming here &happy to find a nice article.Will recommend my friends to read your posts.Continue doing your good work!
  14. essayplanet
    essayplanet on 09/19/2014 4:43 a.m.
    I recently observed your blog whilst still being are not long ago learning along with. We all imagined We would give up my own, personal primary thoughts. We all do not know what exactly what you should declare with the exception of i always include savored learning. Pleasant web site. Well then, i'll maintain visiting your blog normally.
  15. security gates aberdeen
    security gates aberdeen on 09/20/2014 3:03 a.m.
    Now i am doing just about any examine by using this subject theme. Entire document could possibly be which has additional actually useful details. Now i am sure a new to be able to consider one's unique details intended for the future examine.
  16. PAT Testing London
    PAT Testing London on 09/20/2014 4:32 a.m.
    We're a respected electrical saftey testing company, serving London's PAT testing needs.
  17. photobooth hire
    photobooth hire on 09/22/2014 7:51 a.m.
    Hi close friend, in which is probably the most finest content which in turn I’ve without notice observed; chances are you'll consist of further tips from the equivalent principle. I’m even so considering a few fascinating vistas through versions attribute within your using record.
  18. Singles USA
    Singles USA on 09/23/2014 6:41 a.m.
    This can be completely an enormous cart additional tires. I truly do think about that's employing to begin with essential & excessive career. Critical equipment the truth is. Relation tons while using the discuss within employing.
  19. alkaline diet plan
    alkaline diet plan on 09/24/2014 2:38 a.m.
    I've attained browsed all-around another person site along with your do the job can be extremely useful in addition to fantastic for that certain building contractors.
  20. manchester printers
    manchester printers on 09/24/2014 6:10 a.m.
    This post give a some information and give some experiences of live. This post has helped me for an article which I am writing. Thank you for giving me another point of view on this topic. Now I can easily complete my article.
  21. Hajj al badal
    Hajj al badal on 09/25/2014 11:29 p.m.
    A naive resolution such as this testament negative process, as in apiece of the object claims we desire to get the Wry reactor restarted, furthermore this is unfortunately nay earthly.
  22. What is Technology
    What is Technology on 09/26/2014 6:25 a.m.
    Only wanted to take by simply as well as appreciate it pertaining to all of the excellent do the job you regularly generate. Carry on the truly great do the job.
  23. raghav
    raghav on 09/29/2014 4:25 a.m.
    I wish your article that I read all your articles in a single day. Please continue and keep on writing excellent posts.
  24. Heathrow  to London taxi
    Heathrow to London taxi on 09/29/2014 5:43 a.m.
    My partner and i favor these kinds of post, In conjunction with I assume that they having a great time to see these kinds of post, they might have a very good web page to produce a facts, best wishes regarding articulating in which to help my estimation.
  25. thank you cards
    thank you cards on 09/30/2014 12:30 a.m.
    My partner and i really appreciated the item high quality facts someone offer in your viewers for this reason web page. My partner and i most definitely may take a note of your site and now have the good friends check here regularly.

Post your comment

:

:

(Optional):

:

(Optional):