Issue
I am getting this whole run the code Article_content = ''
can anyone fix this code to get only the article's content?
Here is the URL
https://www.fodors.com/world/asia/india/experiences/news/things-you-need-to-know-before-you-visit-india
This is my Code
# Content = {}
# header,paragraphs = "",[]
# for element in response.xpath('//*[@class="entry-content content-single container "]/*'):
# tag = element.re(r"<(\w+)\s") # get the tag name
# # if its a paragraph add it to the paragraph list
# if tag[0] == "p":
# paragraphs += element.xpath(".//text()").getall()
# # if it's a heading place the heading and paragraphs in the
# # dictionary and start a new heading with the current text.
# elif tag[0] == "h3":
# Content[header] = ''.join(paragraphs).strip()
# header = ' '.join(element.xpath(".//text()").getall()).strip()
# paragraphs = []
Article_Content = response.xpath('//*[@class="entry-content content-single container "]/text()')
Content = '\n'.join(Article_Content.getall()).strip()
yield{
'Category':Category,
'Headlines':Headlines,
'Author': Author,
'Source': Source,
'Publication Date': Published_Date,
'Feature_Image': Feature_Image,
'Article Content': Content
}
Solution
You need to use more precise locator.
Instead of the parent block element locator try using the following locator:
'//*[@class="entry-content content-single container "]//p | //*[@class="entry-content content-single container "]//h2'
This locator matches all the text elements inside that block.
This will give you a list of web objects. Now, you will have to iterate over that list end extract each text content separately.
Answered By - Prophet
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.