問題描述
現(xiàn)在我有 2 只蜘蛛,我想做的是
For now I have 2 spiders, what I would like to do is
- Spider
1
轉(zhuǎn)到url1
并且如果出現(xiàn)url2
,用url2<調(diào)用蜘蛛
2
/代碼>.也使用管道保存url1
的內(nèi)容. - 蜘蛛
2
去url2
做點什么.
- Spider
1
goes tourl1
and ifurl2
appears, call spider2
withurl2
. Also saves the content ofurl1
by using pipeline. - Spider
2
goes tourl2
and do something.
由于兩種蜘蛛的復(fù)雜性,我想將它們分開.
Due to the complexities of both spiders I would like to have them separated.
我使用 scrapy crawl
的嘗試:
def parse(self, response):
p = multiprocessing.Process(
target=self.testfunc())
p.join()
p.start()
def testfunc(self):
settings = get_project_settings()
crawler = CrawlerRunner(settings)
crawler.crawl(<spidername>, <arguments>)
它會加載設(shè)置但不會抓取:
It does load the settings but doesn't crawl:
2015-08-24 14:13:32 [scrapy] INFO: Enabled extensions: CloseSpider, LogStats, CoreStats, SpiderState
2015-08-24 14:13:32 [scrapy] INFO: Enabled downloader middlewares: DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, HttpAuthMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-24 14:13:32 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-24 14:13:32 [scrapy] INFO: Spider opened
2015-08-24 14:13:32 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
文檔中有一個關(guān)于從腳本啟動的示例,但我想做的是在使用 scrapy crawl
命令時啟動另一個蜘蛛.
The documentations has a example about launching from script, but what I'm trying to do is launch another spider while using scrapy crawl
command.
完整代碼
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from multiprocessing import Process
import scrapy
import os
def info(title):
print(title)
print('module name:', __name__)
if hasattr(os, 'getppid'): # only available on Unix
print('parent process:', os.getppid())
print('process id:', os.getpid())
class TestSpider1(scrapy.Spider):
name = "test1"
start_urls = ['http://www.google.com']
def parse(self, response):
info('parse')
a = MyClass()
a.start_work()
class MyClass(object):
def start_work(self):
info('start_work')
p = Process(target=self.do_work)
p.start()
p.join()
def do_work(self):
info('do_work')
settings = get_project_settings()
runner = CrawlerRunner(settings)
runner.crawl(TestSpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
return
class TestSpider2(scrapy.Spider):
name = "test2"
start_urls = ['http://www.google.com']
def parse(self, response):
info('testspider2')
return
我希望是這樣的:
- scrapy 抓取測試1(例如,當(dāng) response.status_code 為 200 時:)
- 在test1中,調(diào)用
scrapy crawl test2
推薦答案
我不會深入給出,因為這個問題真的很老,但我會繼續(xù)從官方 Scrappy 文檔中刪除這個片段......你非常接近!哈哈
I won't go in depth given since this question is really old but I'll go ahead drop this snippet from the official Scrappy docs.... You are very close! lol
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
https://doc.scrapy.org/en/latest/topics/實踐.html
然后使用回調(diào),你可以在你的蜘蛛之間傳遞項目做你所說的邏輯函數(shù)
And then using callbacks you can pass items between your spiders do do w.e logic functions your talking about
這篇關(guān)于是否可以從 Scrapy spider 運行另一個蜘蛛?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!