Wednesday, August 3, 2022

[FIXED] Web scraper doesn't work correctly - field does not show any data

August 03, 2022 python, scrapy No comments

Issue

I tried to do a web scraper from Stackoverflow questions, but the 3rd column doesn't download the data, can you help me please?

from scrapy.item import Field
from scrapy.item import Item
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.loader import ItemLoader

class Question(Item):
    a_id = Field()
    b_question = Field()
    c_desc = Field()

class StackOverflowSpider(Spider):
    name = "MyFirstSpider"
    custom_settings = {
        'USER-AGENT': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
    }
    start_urls = ['https://stackoverflow.com/questions']

    def parse(self, response):
        sel = Selector(response)
        questions = sel.xpath('//div[@id="questions"]//div[@class="s-post-summary--content"]')
        i = 1
        for quest in questions:
            item = ItemLoader(Question(), quest)
            item.add_xpath('b_question', './/h3/a/text()')
            item.add_xpath('c_desc', './/div[@class="s-post-summary--content-excerpt"]/text()')
            item.add_value('a_id', i)
            i = i+1
            yield item.load_item()

picture from csv file output

picture from website and the html code

Solution

Try it like this: I added some inline notes to explain the changes

from scrapy.spiders import Spider

class StackOverflowSpider(Spider):
    name = "MyFirstSpider"
    custom_settings = {
        'USER-AGENT': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
    }
    start_urls = ['https://stackoverflow.com/questions']

    def parse(self, response):
        # iterate through each question as an xpath object.
        for i, question in enumerate(response.xpath("//div[@class='s-post-summary--content']")):
            # use get method to grab text
            title = question.xpath('.//h3/a/text()').get()
            content = question.xpath('.//div[@class="s-post-summary--content-excerpt"]/text()').get()
            # yielding a regular dictionary in your case is the same thing
            yield {
                "b_question": title.strip(),
                "c_desc": content.strip(),
                "a_id": i
             }

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, August 3, 2022

[FIXED] Web scraper doesn't work correctly - field does not show any data

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels