Thursday, March 24, 2022

[FIXED] Extract content from blog

March 24, 2022 python, scrapy, web-scraping No comments

Issue

import scrapy
from scrapy.http import Request

class PushpaSpider(scrapy.Spider):
    name = 'pushpa'
    start_urls = ['https://davestruestories.medium.com']
    headers = {
        'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
    
    # custom settings
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1
    }
    

   
    def parse(self, response):
        link=response.xpath("//h1/a/@href").extract()
        for links in link:
            url = response.urljoin(links)
            yield Request(url, callback=self.parse_book,headers=self.headers)
            
    def parse_book(self, response):
        title=response.xpath("//h1/text()").get()
        content=response.xpath("//section/text()").getall()
        yield{
            'title':title,
            'article':content
            }

I want to extract content from the blog but they provided me nothing these is page link https://davestruestories.medium.com/?p=169d7850744a as you see below this is the content I want to extract

Solution

Looks like your xpath is wrong

import scrapy


class PushpaSpider(scrapy.Spider):
    name = 'pushpa'
    start_urls = ['https://davestruestories.medium.com']

    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }

    def parse(self, response):
        links = response.xpath("//h1/a/@href").getall()
        for link in links:
            yield response.follow(link, callback=self.parse_book, headers=response.request.headers)

    def parse_book(self, response):
        title = response.xpath("//h1/text()").get()
        content = response.xpath("//section//p/text()").getall()
        # if you want a string instead of a list:
        # content = ''.join(content)

        # test which of this is better:
        # date = response.xpath('//div[section]//p/span/text()').get()
        date = response.xpath('//div//p/span/text()').get()

        yield{
            'title': title,
            'date': date,
            'article': content
        }

Also either add the user-agent to settings or create a start_requests method so you can add the user-agent to the first request.

Answered By - SuperUser

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, March 24, 2022

[FIXED] Extract content from blog

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels