Я пытаюсь использовать SCRAPY для очистки результатов поиска этого веб-сайта для любого поискового запроса - http://www.bewakoof.com .
Веб-сайт использует AJAX (в форме XHR) для отображения результатов поиска. Мне удалось отследить XHR, и вы заметили это в моем коде, как показано ниже (внутри цикла for, в котором я сохраняю URL-адрес для temp и увеличиваю «i» в цикле):
from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess, CrawlerRunner
import scrapy
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
import datetime
from multiprocessing import Process, Queue
import os
from scrapy.http import Request
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.signalmanager import SignalManager
import re
query='shirt'
query1=query.replace(" ", "+")
class DmozItem(scrapy.Item):
productname = scrapy.Field()
product_link = scrapy.Field()
current_price = scrapy.Field()
mrp = scrapy.Field()
offer = scrapy.Field()
imageurl = scrapy.Field()
outofstock_status = scrapy.Field()
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["http://www.bewakoof.com"]
def start_requests(self):
task_urls = [
]
i=1
for i in range(1,2):
temp=( "http://www.bewakoof.com/search/searchload/search_text/" + query + "/page_num/" + str(i) )
task_urls.append(temp)
i=i+1
start_urls = (task_urls)
p=len(task_urls)
print 'hi'
return [ Request(url = start_url) for start_url in start_urls ]
print 'hi'
def parse(self, response):
print 'hi'
print response
items = []
for sel in response.xpath('//html/body/div[@class="main-div-of-product-item"]'):
item = DmozItem()
item['productname'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@title').extract())[17:-6]
item['product_link'] = "http://www.bewakoof.com"+str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@href').extract())[3:-2]
item['current_price']='Rs. ' + str(sel.xpath('div[1]/div[@class="product_info"]/div[@class="product_price_nomrp"]/span[1]/text()').extract())[3:-2]
item['mrp'] = item['current_price']
item['offer'] = str('No additional offer available')
item['imageurl'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@data-original').extract())[3:-2]
item['outofstock_status'] = str('In Stock')
items.append(item)
spider1 = DmozSpider()
settings = Settings()
settings.set("PROJECT", "dmoz")
settings.set("DOWNLOAD_DELAY" , 5)
crawler = CrawlerProcess(settings)
crawler.crawl(spider1)
crawler.start()
Теперь, когда я выполняю это, я получаю неожиданные ошибки:
2015-07-09 11:46:01 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-07-09 11:46:01 [scrapy] INFO: Optional features available: ssl, http11
2015-07-09 11:46:01 [scrapy] INFO: Overridden settings: {'DOWNLOAD_DELAY': 5}
2015-07-09 11:46:02 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-09 11:46:02 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-09 11:46:02 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-09 11:46:02 [scrapy] INFO: Enabled item pipelines:
hi
2015-07-09 11:46:02 [scrapy] INFO: Spider opened
2015-07-09 11:46:02 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-09 11:46:02 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-09 11:46:03 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:09 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:13 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:13 [scrapy] ERROR: Error downloading <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>: [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:13 [scrapy] INFO: Closing spider (finished)
2015-07-09 11:46:13 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 3,
'downloader/request_bytes': 780,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 7, 9, 6, 16, 13, 793446),
'log_count/DEBUG': 4,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2015, 7, 9, 6, 16, 2, 890066)}
2015-07-09 11:46:13 [scrapy] INFO: Spider closed (finished)
Если вы правильно видите мой код, я также установил DOWNLOAD_DELAY=5, но он выдает те же ошибки, что и когда я его не сохранил. Я также увеличил DOWNLOAD_DELAY=10, все равно выдает те же ошибки. Я прочитал много вопросов, связанных с этим, на Stack Overflow, а также на GitHub, но ни один из них не помогает.
Прочитал в одном из ответов, что ТОР с Polipo может помочь. Но я немного сомневаюсь в его использовании, потому что я не знаю, законно ли использовать комбинацию TOR с Polipo для очистки веб-сайтов с помощью Scrapy? (Я не хочу столкнуться с какими-либо юридическими проблемами.) Вот почему я не предпочел его использовать. Итак, если это законно, пожалуйста, предоставьте код для моего КОНКРЕТНОГО СЛУЧАЯ, используя TOR и POLIPO.
Вернее, если это незаконно, помогите мне решить это, не используя их.
Пожалуйста, помогите мне решить эти ошибки!
РЕДАКТИРОВАТЬ:
Это мой обновленный код:
from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess, CrawlerRunner
import scrapy
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
import datetime
from multiprocessing import Process, Queue
import os
from scrapy.http import Request
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.signalmanager import SignalManager
import re
query='shirt'
query1=query.replace(" ", "+")
class DmozItem(scrapy.Item):
productname = scrapy.Field()
product_link = scrapy.Field()
current_price = scrapy.Field()
mrp = scrapy.Field()
offer = scrapy.Field()
imageurl = scrapy.Field()
outofstock_status = scrapy.Field()
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["http://www.bewakoof.com"]
def _monkey_patching_HTTPClientParser_statusReceived(self):
from scrapy.xlib.tx._newclient import HTTPClientParser, ParseError
old_sr = HTTPClientParser.statusReceived
def statusReceived(self, status):
try:
return old_sr(self, status)
except ParseError, e:
if e.args[0] == 'wrong number of parts':
return old_sr(self, status + ' OK')
raise
statusReceived.__doc__ == old_sr.__doc__
HTTPClientParser.statusReceived = statusReceived
def start_requests(self):
task_urls = [
]
i=1
for i in range(1,2):
temp = "http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1"
task_urls.append(temp)
i=i+1
start_urls = (task_urls)
p=len(task_urls)
print 'hi'
self._monkey_patching_HTTPClientParser_statusReceived()
return [ Request(url = start_url) for start_url in start_urls ]
print 'hi'
def parse(self, response):
print 'hi'
print response
items = []
for sel in response.xpath('//html/body/div[@class="main-div-of-product-item"]'):
item = DmozItem()
item['productname'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@title').extract())[17:-6]
item['product_link'] = "http://www.bewakoof.com"+str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@href').extract())[3:-2]
item['current_price']='Rs. ' + str(sel.xpath('div[1]/div[@class="product_info"]/div[@class="product_price_nomrp"]/span[1]/text()').extract())[3:-2]
item['mrp'] = item['current_price']
item['offer'] = str('No additional offer available')
item['imageurl'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@data-original').extract())[3:-2]
item['outofstock_status'] = str('In Stock')
items.append(item)
print (items)
spider1 = DmozSpider()
settings = Settings()
settings.set("PROJECT", "dmoz")
settings.set("DOWNLOAD_DELAY" , 5)
crawler = CrawlerProcess(settings)
crawler.crawl(spider1)
crawler.start()
И это мой обновленный вывод, отображаемый на терминале:
2015-07-10 13:06:00 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-07-10 13:06:00 [scrapy] INFO: Optional features available: ssl, http11
2015-07-10 13:06:00 [scrapy] INFO: Overridden settings: {'DOWNLOAD_DELAY': 5}
2015-07-10 13:06:01 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-10 13:06:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-10 13:06:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-10 13:06:01 [scrapy] INFO: Enabled item pipelines:
hi
2015-07-10 13:06:01 [scrapy] INFO: Spider opened
2015-07-10 13:06:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-10 13:06:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-10 13:06:02 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:08 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:12 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:12 [scrapy] ERROR: Error downloading <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>: [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:13 [scrapy] INFO: Closing spider (finished)
2015-07-10 13:06:13 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 3,
'downloader/request_bytes': 780,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 7, 10, 7, 36, 13, 11023),
'log_count/DEBUG': 4,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2015, 7, 10, 7, 36, 1, 114912)}
2015-07-10 13:06:13 [scrapy] INFO: Spider closed (finished)
Итак, как видите, ошибки все те же! :( . Итак, пожалуйста, помогите мне решить эту проблему!
ОБНОВЛЕНО:
Это вывод, когда я пытаюсь поймать исключение, которое @JoeLinux предложил сделать:
>>> try:
... fetch("http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1")
... except Exception as e:
... e
...
2015-07-10 17:51:13 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 17:51:14 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 17:51:15 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
ResponseFailed([<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>],)
>>> print e.reasons[0].getTraceback()
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 614, in _doReadOrWrite
why = selectable.doRead()
File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 214, in doRead
return self._dataReceived(data)
File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 220, in _dataReceived
rval = self.protocol.dataReceived(data)
File "/usr/lib/python2.7/dist-packages/twisted/internet/endpoints.py", line 114, in dataReceived
return self._wrappedProtocol.dataReceived(data)
--- <exception caught here> ---
File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 1523, in dataReceived
self._parser.dataReceived(bytes)
File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 382, in dataReceived
HTTPParser.dataReceived(self, data)
File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 571, in dataReceived
why = self.lineReceived(line)
File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 271, in lineReceived
self.statusReceived(line)
File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 409, in statusReceived
raise ParseError("wrong number of parts", status)
twisted.web._newclient.ParseError: ('wrong number of parts', 'HTTP/1.1 500')