Issue
I am working Google search crawling using scrapy. This is the code and it works well to get search results.
GoogleBot.py:
class GoogleBotsSpider(scrapy.Spider):
name = 'GoogleScrapyBot'
allowed_domains = ['google.com']
start_urls = ['https://www.google.com/search?q=apple&hl=en&rlz=&start=0']
def parse(self, response):
item = {}
all_page = response.xpath('//*[@id="main"]')
for page in all_page:
title = page.xpath('//*[@id="main"]/div/div/div/a/h3/div/text()').extract()
link = page.xpath('//*[@id="main"]/div/div/div/a/@href').extract()
for title, link in zip(title, link):
print(title)
print(link.lstrip("/url?q="))
My next step is use "pipeline" on Scrapy to save a csv file for results. Here is the code that I have written so far.
setting.py:
ITEM_PIPELINES = {'GoogleScrapy.pipelines.GooglePipeline': 300,}
pipelines.py:
from scrapy.exporters import CsvItemExporter
class GooglePipeline(object):
def __init__(self):
self.file = open("GoogleSearchResult.csv", 'wb')
self.exporter = CsvItemExporter(self.file, encoding='utf-8')
self.exporter.start_exporting()
def close_spider(self, spider):
self.exporter.finish_exporting()
self.file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
This is modified my spider code.
GoogleBot.py:
def parse(self, response):
item = {}
all_page = response.xpath('//*[@id="main"]')
for page in all_page:
item['title'] = page.xpath('//*[@id="main"]/div/div/div/a/h3/div/text()').extract()
item['link'] = page.xpath('//*[@id="main"]/div/div/div/a/@href').extract()
for title, link in zip(title, link):
print(title)
print(link.lstrip("/url?q="))
yield item
It has error where in:
for title, link in zip(title, link):
print(title)
print(link.lstrip("/url?q="))
I get this error:
for title, link in zip(title, link): UnboundLocalError: local variable 'title' referenced before assignment
Solution
Here is the working output according to your comment.
import scrapy
class GoogleBotsSpider(scrapy.Spider):
name = 'GoogleScrapyBot'
allowed_domains = ['google.com']
start_urls = ['https://www.google.com/search?q=apple&hl=en&rlz=&start=0']
def parse(self, response):
all_page = response.xpath('//*[@id="main"]')
for page in all_page:
titles = page.xpath('//*[@id="main"]/div/div/div/a/h3/div//text()').extract()
for title in titles:
links = page.xpath('//*[@id="main"]/div/div/div/a/@href').extract()
for link in links:
item={
'Title': title,
'Link':link
}
yield item
Answered By - F.Hoque
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.