在使用deathbycaptcha服务处理Google reCAPTCHA v2时,怎么控制SCRAPY中的请求流?

本教程将介绍在使用deathbycaptcha服务处理Google reCAPTCHA v2时,如何控制SCRAPY中的请求流?的处理方法,这篇教程是从别的地方看到的,然后加了一些国外程序员的疑问与解答,希望能对你有所帮助,好了,下面开始学习吧。

在使用deathbycaptcha服务处理Google reCAPTCHA v2时,怎么控制SCRAPY中的请求流? 教程 第1张

问题描述

Hello:)我正在使用一个抓取Web爬行的框架,使用的是python,抓取一个网站,并用deathbycaptcha服务解决我在他们的页面上遇到的验证码。我的下载延迟被设置为30秒,我只需要抓取几个页面来获取基本信息,这样我就不会太多地侵入网站带宽或诸如此类的东西。我认为抓取就像是在普通浏览器上发生的一种体验。

所以我们先来谈谈问题。

(代码中)
我怎么才能基本上停止创建新的请求,或者在解决验证码的同时过多地扰乱验证码?我尝试了很多不同的方法都没有用,而且我还是个新手,所以我不太擅长编辑下载器中间件或scrapy引擎代码,但如果这是唯一的方法,那就这样吧,但我希望有一个非常简单和有效的解决方案,让验证码做它的事情,而不是新的请求完全中断它。

(代码中)
我怎么修复这个计时器函数,我认为这与第一个问题有些关联。如果验证码上的超时没有求解,那么它将永远不会重置captchaIsRunning布尔值,并且不断地禁止验证码开始再次尝试求解。计时器是我尝试解决第一个问题的方法之一,但是……我得到了一个错误。我不确定这是否与从IMPORT语句中的threadingtimeit中提取的事实有关,但我认为这没有太大区别。有人能为我指引正确的方向来修复计时器语句吗?

就像我说的,deathbycaptcha API运行得很好,当它有机会运行时,但粗糙的请求确实干扰了我,我还没有找到这个问题的相关解决方案。再说一次,我还不是刮刮专家,所以有些东西已经远远超出了我的舒适区,需要推动,但不要太用力,以至于我最终打破了一切。谢谢你的帮助,我非常感激!对于这个超长的问题,很抱歉。

无论怎么,该页面允许您查找几个结果,在大约40-60页之后,它重定向到具有reCAPTCHA v2的验证码页面。Deathbycaptcha服务有一个用于解算reCAPTCHAv2的API,但不幸的是,它们的解算时间有时可能会超过几分钟,这非常令人失望,但它确实发生了。因此,我自然而然地将DOWNLOAD_TIMEOUT设置调整为240秒,这样它就有足够的时间来解决验证码问题,然后继续抓取,这样它就不会再重定向了。我的简陋设置如下:

CONCURRENT_REQUESTS = 1
DEPTH_LIMIT = 1
DOWNLOAD_DELAY = 30
CONCURRENT_REQUESTS_PER_DOMAIN = 1
CONCURRENT_REQUESTS_PER_IP = 1
DOWNLOAD_TIMEOUT = 240
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 10
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 60

然后显然是其他的,但我认为这些是对我的问题最重要的。我启用了一个扩展,然后中间件中有几个额外的东西,因为我在这个文件中还使用了docker和scrapy-spash。

SPIDER_MIDDLEWARES = {
 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DOWNLOADER_MIDDLEWARES = {
 'scrapy_splash.SplashCookiesMiddleware': 723,
 'scrapy_splash.SplashMiddleware': 725,
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

MYEXT_ENABLED = False
MYEXT_ITEMCOUNT = 100

EXTENSIONS = {
'scrapy.extensions.telnet.TelnetConsole': None,
'scrapy.extensions.spideroclog.SpiderOpenCloseLogging':500,
}

所以我不认为这个东西会对验证码或下载器中间件产生很大影响……但以下是我的刮板上的一些代码:

Python:

import sys
import os
sys.path.append(r'F:DocumentsScrapyDirectoryscrapernamescrapernamespiders')
import deathbycaptcha
import json
import scrapy
import requests
from datetime import datetime
import math
import urllib
import time
from scrapy_splash import SplashRequest
from threading import Timer
from timeit import Timer

class scrapername(scrapy.Spider):
 name = "scrapername"
 start_urls = []

 global scrapeUrlList
 global charCompStorage
 global captchaIsRunning

 r = requests.get('http://example.com/examplejsonfeed.php')

 myObject = json.loads(r.text)

 #print("Loading names...")
 for o in myObject['objects']:
  #a huge function for creating basically a lot of objects and appending links created from these objects to the scrapeUrlList function

 print(len(scrapeUrlList))
 for url in scrapeUrlList:
  start_urls.append(url[1])
  #add all those urls that just got created to the start_urls list


 link_collection = []

 def resetCaptchaInformation():
  global captchaIsRunning
  if captchaIsRunning:
captchaIsRunning = False

 def afterCaptchaSubmit(self, response):
  global captchaIsRunning
  print("Captcha submitted: " + response.request.url)
  captchaIsRunning = False

 def parse(self, response):
  global captchaIsRunning
  self.logger.info("got response %s for %r" % (response.status, response.url))

  if "InternalCaptcha" in response.request.url:
  #checks for captcha in the url and if it's there it starts running the captcha solver API
if not captchaIsRunning:
#I have this statement here as a deterrent to prevent the captcha solver from starting again and again and 
#again with every new request (which it does)  *ISSUE 1*
 if "captchasubmit" in response.request.url:
  print("Found captcha submit in url")
 else:
  print("Internal Captcha is activated")
  captchaIsRunning = True
  t = Timer(240.0, self.resetCaptchaInformation)
  #so I have been having major issues here not sure why?
  #*ISSUE 2*
  t.start()

  username = "username"
  password = "password"

  print("Set username and password")

  Captcha_dict = {
  'googlekey': '6LcMUhgUAAAAAPn2MfvqN9KYxj7KVut-oCG2oCoK',
  'pageurl': response.request.url}

  print("Created catpcha dict")

  json_Captcha = json.dumps(Captcha_dict)

  print("json.dumps on captcha dict:")
  print(json_Captcha)

  client = deathbycaptcha.SocketClient(username, password)

  print("Set up client with deathbycaptcha socket client")

  try:
print("Trying to solve captcha")
balance = client.get_balance()

print("Remaining Balance: " + str(balance))

# Put your CAPTCHA type and Json payload here:
captcha = client.decode(type=4,token_params=json_Captcha)

if captcha:
 # The CAPTCHA was solved; captcha["captcha"] item holds its
 # numeric ID, and captcha["text"] item its a text token".
 print("CAPTCHA %s solved: %s" % (captcha["captcha"], captcha["text"]))

 data = {
  'g-recaptcha-response':captcha["text"],
 }

 try:
  dest = response.xpath("/html/body/form/@action").extract_first()
  print("Form URL: " + dest)
  submitURL = "https://exampleaddress.com" + dest
  yield scrapy.FormRequest(url=submitURL, formdata=data, callback=self.afterCaptchaSubmit, dont_filter = True)

  print("Yielded form request")

  if '':  # check if the CAPTCHA was incorrectly solved
client.report(captcha["captcha"])
 except TypeError:
  sys.exit()
  except deathbycaptcha.AccessDeniedException:
# Access to DBC API denied, check your credentials and/or balance
print("error: Access to DBC API denied, check your credentials and/or balance")
else:
 pass
  else:
print("no Captcha")
#this will run if no captcha is on the page that the redirect landed on
#and basically parses all the information on the page

非常抱歉这么多代码,感谢您耐心地通读它。如果你对为什么会有什么东西有任何疑问,只要问一问,我就可以解释。所以验证码确实解决了问题。这不是问题的关键。当刮取器运行时,有许多请求发生,它运行到302重定向,然后它得到200响应,爬行页面,检测验证码并开始求解它。然后,scrapy发送另一个请求,该请求在验证码页面上得到302重定向、200响应,并检测验证码并再次尝试解决它。它多次启动API,浪费了我的令牌。因此,if not captchaIsRunning:语句的存在是为了阻止这种情况发生。所以这是我现在有的粗糙的日志,当它到达验证码时,请记住在此之前一切都很好,运行我所有的解析日志。

抓取日志:

2018-07-19 14:10:35 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.exampleaddress.com/InternalCaptcha?returnUrl=%2fresults%3fname%3dThomas%2520Garrett%26citystatezip%3dLas%2520Vegas%2c%2520Nv> from <GET https://www.exampleaddress.com/results?name=Thomas%20Garrett&citystatezip=Las%20Vegas,%20Nv>
2018-07-19 14:10:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.exampleaddress.com/InternalCaptcha?returnUrl=%2fresults%3fname%3dThomas%2520Garrett%26citystatezip%3dLas%2520Vegas%2c%2520Nv> (referer: None)
2018-07-19 14:10:49 [scrapername] INFO: got response 200 for 'https://www.exampleaddress.com/InternalCaptcha?returnUrl=%2fresults%3fname%3dThomas%2520Garrett%26citystatezip%3dLas%2520Vegas%2c%2520Nv'
Internal Captcha is activated
2018-07-19 14:10:49 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.exampleaddress.com/InternalCaptcha?returnUrl=%2fresults%3fname%3dThomas%2520Garrett%26citystatezip%3dLas%2520Vegas%2c%2520Nv> (referer: None)
Traceback (most recent call last):
  File "F:Program Files (x86)Anaconda3libsite-packagesscrapyutilsdefer.py", line 102, in iter_errback
 yield next(it)
  File "F:Program Files (x86)Anaconda3libsite-packagesscrapy_splashmiddleware.py", line 156, in process_spider_output
 for el in result:
  File "F:Program Files (x86)Anaconda3libsite-packagesscrapyspidermiddlewaresoffsite.py", line 29, in process_spider_output
 for x in result:
  File "F:Program Files (x86)Anaconda3libsite-packagesscrapyspidermiddlewareseferer.py", line 339, in <genexpr>
 return (_set_referer(r) for r in result or ())
  File "F:Program Files (x86)Anaconda3libsite-packagesscrapyspidermiddlewaresurllength.py", line 37, in <genexpr>
 return (r for r in result or () if _filter(r))
  File "F:Program Files (x86)Anaconda3libsite-packagesscrapyspidermiddlewaresdepth.py", line 58, in <genexpr>
 return (r for r in result or () if _filter(r))
  File "F:DocumentsScrapyDirectoryscraperNamescraperNamespidersscraperName- Copy.py", line 232, in parse
 t = Timer(240.0, self.resetCaptchaInformation)
  File "F:Program Files (x86)Anaconda3lib	imeit.py", line 130, in __init__
 raise ValueError("stmt is neither a string nor callable")
ValueError: stmt is neither a string nor callable
2018-07-19 14:10:53 [scrapy.extensions.logstats] INFO: Crawled 63 pages (at 2 pages/min), scraped 13 items (at 0 items/min)
2018-07-19 14:11:02 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.exampleaddress.com/InternalCaptcha?returnUrl=%2fresults%3fname%3dSamuel%2520Van%2520Cleave%26citystatezip%3dLas%2520Vegas%2c%2520Nv> from <GET https://www.exampleaddress.com/results?name=Samuel%20Van%20Cleave&citystatezip=Las%20Vegas,%20Nv>
2018-07-19 14:11:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.exampleaddress.com/InternalCaptcha?returnUrl=%2fresults%3fname%3dSamuel%2520Van%2520Cleave%26citystatezip%3dLas%2520Vegas%2c%2520Nv> (referer: None)
2018-07-19 14:11:13 [scrapername] INFO: got response 200 for 'https://www.exampleaddress.com/InternalCaptcha?returnUrl=%2fresults%3fname%3dSamuel%2520Van%2520Cleave%26citystatezip%3dLas%2520Vegas%2c%2520Nv'
#and then an endless supply of 302 redirects, and 200 response for their crawl
#nothing happens, because the Timer failed, the captcha never solved?
#I'm not sure what is going wrong with it, hence the issues I am having

推荐答案

我仍需要几周的经验来解决您的问题,但我会尝试使用RetryMiddleware:

可能向重试代码添加302就足够了:

RETRY_HTTP_CODES
Default: [500, 502, 503, 504, 408]

好了关于在使用deathbycaptcha服务处理Google reCAPTCHA v2时,怎么控制SCRAPY中的请求流?的教程就到这里就结束了,希望趣模板源码网找到的这篇技术文章能帮助到大家,更多技术教程可以在站内搜索。