I have an XML file which looks like this:
<rss version="2.0"
<title>Label: some_title"</title>
<guid isPermaLink="false"></guid>
<content:encoded><![CDATA[[vc_row][vc_column][vc_column_text]<strong>some text<a href="" target="_blank" rel="noopener noreferrer">text</a> some more text</strong><!--more-->
[caption id="attachment_344" align="aligncenter" width="524"]<img class="-image-" src="" alt="" width="524" height="316" /> <em>A <a href="" target="_blank" rel="noopener noreferrer">screenshot</a> by the people</em>[/caption]
<strong>some more text</strong>
<div class="entry-content">
<em>Leave your comments</em>
<div class="post-meta wf-mobile-collapsed">
<div class="entry-meta"></div>
[/vc_column_text][/vc_column][/vc_row][vc_row][vc_column][/vc_column][/vc_row][vc_row][vc_column][dt_quote]<strong><b>RESEARCH | ARTICLE </b></strong>University[/dt_quote][/vc_column][/vc_row]]]></content:encoded>
some more <item> </item>s here
I want to extract the raw text within the <content:encoded>
section, excluding the tags and urls. I have tried this with BeautifulSoup, and Scarpy, as well as other lxml methods. Most return an empty list.
Is there a way for me to retrieve this information without having to use regex?
Much appreciated.
I opened the XML file using:
content = []
with open(xml_file, "r") as file:
content = file.readlines()
content = "".join(content)
xml = bs(content, "lxml")
then I tried this with scrapy:
response = HtmlResponse(url=xml_file, encoding='utf-8')
which returns an empty list.
and tried the code in the first answer:
soup = bs(xml.select_one("content:encoded").text, "html.parser")
text = "\n".join(
s.get_text(strip=True, separator=" ") for s in"strong"))
and get this error: Only the following pseudo-classes are implemented: nth-of-type.
When I opened the file with lxml, I ran this for loop:
data = {}
n = 0
for item in xml.findall('item'):
id = 'claim_id_' + str(n)
keys = {}
title = item.find('title').text
keys['label'] = title.split(': ')[0]
keys['claim'] = title.split(': ')[1]
if item.find('content:encoded'):
keys['text'] = bs(html.unescape(item.encoded.text), 'lxml')
data[id] = keys
n += 1
It saved the label and claim perfectly well, but nothing for the text. Now that I opened the file using BeautifulSoup, it returns this error: 'NoneType' object is not callable
I eventually got the text part using regular expressions (regex).
import re
for item in root.iter('item'):
grandchildren = item.getchildren()
for grandchild in grandchildren:
if 'encoded' in grandchild.tag:
text = grandchild.text
text = re.sub(r'\[.*?\]', "", text) # gets rid of square brackets and their content
text = re.sub(r'\<.*?\>', "", text) # gets rid of <> signs and their content
text = text.replace(" ", "") # gets rid of
text = " ".join(text.split())
Answered By - Moe B
Post a Comment
Note: Only a member of this blog may post a comment.