Scrapy Splash Crawler Reator NotRestartable

本教程将介绍Scrapy Splash Crawler Reator NotRestartable的处理方法,这篇教程是从别的地方看到的,然后加了一些国外程序员的疑问与解答,希望能对你有所帮助,好了,下面开始学习吧。

Scrapy Splash Crawler Reator NotRestartable 教程 第1张

问题描述

我已经在Windows 10上使用Visual Studio代码开发了一个SRapy Splash Screper。

当我在没有runner.py文件的情况下像这样运行我的刮取器时,它会工作并生成抓取的内容int"out.json":scrapy crawl mytest -o out.json

但是,当我使用runner.py文件在Visual Studio代码中以调试模式运行刮除器时,它在execute行(下面的完整代码)上失败:

Exception has occurred: ReactorNotRestartable
exception: no description
  File "C:scrapyhw_spidersspidersunner.py", line 8, in <module>
 execute( 

我已经检查过了:

    Scrapy - Reactor not Restartable

    Scrapy raises ReactorNotRestartable when CrawlerProcess is ran twice

    ReactorNotRestartable error in while loop with scrapy

从这些帖子来看,如果我启动第二个爬行器(例如,多次调用Crawl,而只启动一次),似乎是一个问题,然而,我看不到我应该从哪里开始。

我还在那里看到while循环和Twisted reactor存在潜在问题,但我在代码中也看不到这些问题。

所以我现在不知道需要在哪里修复代码。

runner.py

#https://newbedev.com/debugging-scrapy-project-in-visual-studio-code
import os
from scrapy.cmdline import execute

os.chdir(os.path.dirname(os.path.realpath(__file__)))

try:
 execute(
  [
'scrapy',
'crawl',
'mytest',
'-o',
'out.json',
  ]
 )
except SystemExit:
 pass

Launch.json

{
 "version": "0.1.0",
 "configurations": [
  {
"name": "Python: Launch Scrapy Spider",
"type": "python",
"request": "launch",
"module": "scrapy",
"args": [
 "runspider",
 "${file}"
],
"console": "integratedTerminal"
  }
 ]
}

settings.json

{
 "python.analysis.extraPaths": [
  "./hw_spiders"
 ]
}

Midlewares.py

from scrapy import signals
from itemadapter import is_item, ItemAdapter

class MySpiderMiddleware:

 @classmethod
 def from_crawler(cls, crawler):
  s = cls()
  crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
  return s

 def process_spider_input(self, response, spider):
  return None

 def process_spider_output(self, response, result, spider):
  for i in result:
yield i

 def process_spider_exception(self, response, exception, spider):
  pass

 def process_start_requests(self, start_requests, spider):
  for r in start_requests:
yield r

 def spider_opened(self, spider):
  spider.logger.info('Spider opened: %s' % spider.name)


class MyDownloaderMiddleware:

 @classmethod
 def from_crawler(cls, crawler):
  s = cls()
  crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
  return s

 def process_request(self, request, spider):
  return None

 def process_response(self, request, response, spider):
  return response

 def process_exception(self, request, exception, spider):
  pass

 def spider_opened(self, spider):
  spider.logger.info('Spider opened: %s' % spider.name)

Pipelines.py

from itemadapter import ItemAdapter


class MyPipeline:
 def process_item(self, item, spider):
  return item

settings.py

BOT_NAME = 'hw_spiders'
SPIDER_MODULES = ['hw_spiders.spiders']
NEWSPIDER_MODULE = 'hw_spiders.spiders'
ROBOTSTXT_OBEY = True

SPIDER_MIDDLEWARES = {
 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
 # 'hw_spiders.middlewares.MySpiderMiddleware': 543,
}

DOWNLOADER_MIDDLEWARES = {
 # 'hw_spiders.middlewares.MyDownloaderMiddleware': 543,
 'scrapy_splash.SplashCookiesMiddleware': 723,
 'scrapy_splash.SplashMiddleware': 725,
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
 'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware': 500,
} 

SPLASH_URL = 'http://localhost:8050/' 
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
ROBOTSTXT_OBEY = False

myest.py

import json
import re
import os

import scrapy
import time
from scrapy_splash import SplashRequest
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

from ..myitems import CarItem

class MyTest_Spider(scrapy.Spider):
 name = 'mytest'
 start_urls = ['<hidden>']

 def start_requests(self):
  yield SplashRequest(
self.start_urls[0], self.parse
  )

 def parse(self, response):
  object_links = response.css('div.wrapper div.inner33 > a::attr(href)').getall()

  for link in object_links:
yield scrapy.Request(link, self.parse_object)

  next_page = response.css('div.nav-links a.next.page-numbers::attr(href)').get()
  if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)


 def parse_object(self, response):
  item = RentalItem()

  item['url'] = response.url

  object_features = response.css('table.info tr')
  for feature in object_features:
try:
 feature_title = feature.css('th::text').get().strip()
 feature_info = feature.css('td::text').get().strip()
except:
 continue
  item['thumbnails'] = response.css("ul#objects li a img::attr(src)").getall()

更新%1

所以我现在从我的项目中删除了runner.py,只有.vcodelaunch.json:

当我在Visual Studio代码中打开文件mytest.py并按F5进行调试时,我看到以下输出:

Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.

Try the new cross-platform PowerShell https://aka.ms/pscore6

PS C:scrapyhw_spiders>  & 'C:UsersAdamAppDataLocalProgramsPythonPython38-32python.exe' 'c:UsersAdam.vscodeextensionsms-python.python-2021.11.1422169775pythonFileslibpythondebugpylauncher' '51812' '--' '-m' 'scrapy' 'runspider' 'c:scrapyhw_spidersspidersmytest.py'
2021-11-19 14:19:02 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: hw_spiders)
2021-11-19 14:19:02 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 
15:43:08) [MSC v.1926 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 3.0, Platform Windows-10-10.0.19041-SP0
2021-11-19 14:19:02 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
Usage
=====
  scrapy runspider [options] <spider_file>

runspider: error: Unable to load 'c:scrapyhw_spidersspidersmytest.py': attempted relative import with no known parent package

这肯定是第from ..myitems import RentalItem行,但我不知道为什么失败。

推荐答案

您应该拥有runner.py文件并使用scrapylaunch.json(如您的问题所示),而不是两者都有。

看起来您问题中的article只是复制了this Stackoverflow question中的所有答案,并在没有上下文的情况下将它们组合在一起。

好了关于Scrapy Splash Crawler Reator NotRestartable的教程就到这里就结束了,希望趣模板源码网找到的这篇技术文章能帮助到大家,更多技术教程可以在站内搜索。