Sunday, January 2, 2022

[FIXED] Python, Scrapy Pipeline csv out problem, error in for loop

January 02, 2022 python, scrapy No comments

Issue

I am working Google search crawling using scrapy. This is the code and it works well to get search results.

GoogleBot.py:

class GoogleBotsSpider(scrapy.Spider):
name = 'GoogleScrapyBot'
allowed_domains = ['google.com']
start_urls = ['https://www.google.com/search?q=apple&hl=en&rlz=&start=0']

def parse(self, response):
    item = {}
    all_page = response.xpath('//*[@id="main"]')
    for page in all_page:
        title = page.xpath('//*[@id="main"]/div/div/div/a/h3/div/text()').extract()
        link = page.xpath('//*[@id="main"]/div/div/div/a/@href').extract()
        for title, link in zip(title, link):
            print(title)
            print(link.lstrip("/url?q="))

My next step is use "pipeline" on Scrapy to save a csv file for results. Here is the code that I have written so far.

setting.py:

ITEM_PIPELINES = {'GoogleScrapy.pipelines.GooglePipeline': 300,}

pipelines.py:

from scrapy.exporters import CsvItemExporter  
class GooglePipeline(object):
  def __init__(self):
    self.file = open("GoogleSearchResult.csv", 'wb')
    self.exporter = CsvItemExporter(self.file, encoding='utf-8')
    self.exporter.start_exporting()

  def close_spider(self, spider):
    self.exporter.finish_exporting()
    self.file.close()

  def process_item(self, item, spider):
    self.exporter.export_item(item)
    return item

This is modified my spider code.

GoogleBot.py:

def parse(self, response):
item = {}
all_page = response.xpath('//*[@id="main"]')
for page in all_page:
    item['title'] = page.xpath('//*[@id="main"]/div/div/div/a/h3/div/text()').extract()
    item['link'] = page.xpath('//*[@id="main"]/div/div/div/a/@href').extract()
    for title, link in zip(title, link):
        print(title)
        print(link.lstrip("/url?q="))
yield item

It has error where in:

for title, link in zip(title, link):
    print(title)
    print(link.lstrip("/url?q="))

I get this error:

for title, link in zip(title, link): UnboundLocalError: local variable 'title' referenced before assignment

Solution

Here is the working output according to your comment.

import scrapy

class GoogleBotsSpider(scrapy.Spider):
    name = 'GoogleScrapyBot'
    allowed_domains = ['google.com']
    start_urls = ['https://www.google.com/search?q=apple&hl=en&rlz=&start=0']


    def parse(self, response):
       
        all_page = response.xpath('//*[@id="main"]')
        for page in all_page:
            
                titles = page.xpath('//*[@id="main"]/div/div/div/a/h3/div//text()').extract()
                for title in titles:

                    links = page.xpath('//*[@id="main"]/div/div/div/a/@href').extract() 
                    for link in links:
                        item={
                            'Title': title,
                            'Link':link
                            }
                        yield item

Answered By - F.Hoque

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 2, 2022

[FIXED] Python, Scrapy Pipeline csv out problem, error in for loop

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels