Friday, December 2, 2022

[FIXED] I am trying to scrape on content from the article but its div includes many other things in scrapy spider using python

December 02, 2022 python, scrapy, web-scraping, xpath No comments

Issue

I am getting this whole run the code Article_content = '' can anyone fix this code to get only the article's content?

Here is the URL
https://www.fodors.com/world/asia/india/experiences/news/things-you-need-to-know-before-you-visit-india

This is my Code

 # Content = {}
    # header,paragraphs  = "",[]
    # for element in response.xpath('//*[@class="entry-content content-single container "]/*'):
    #     tag = element.re(r"<(\w+)\s")  # get the tag name
    #     # if its a paragraph add it to the paragraph list
    #     if tag[0] == "p":              
    #         paragraphs += element.xpath(".//text()").getall()
    #     # if it's a heading place the heading and paragraphs in the
    #     # dictionary and start a new heading with the current text.
    #     elif tag[0] == "h3":
    #         Content[header] = ''.join(paragraphs).strip()
    #         header = ' '.join(element.xpath(".//text()").getall()).strip()
    #         paragraphs = []

    
    Article_Content = response.xpath('//*[@class="entry-content content-single container "]/text()')
    Content = '\n'.join(Article_Content.getall()).strip()

    yield{
        'Category':Category,
        'Headlines':Headlines,
        'Author': Author,
        'Source': Source,
        'Publication Date': Published_Date,
        'Feature_Image': Feature_Image,
        'Article Content': Content
    }

Solution

You need to use more precise locator.
Instead of the parent block element locator try using the following locator:

'//*[@class="entry-content content-single container "]//p | //*[@class="entry-content content-single container "]//h2'

This locator matches all the text elements inside that block.
This will give you a list of web objects. Now, you will have to iterate over that list end extract each text content separately.

Answered By - Prophet

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 2, 2022

[FIXED] I am trying to scrape on content from the article but its div includes many other things in scrapy spider using python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels