Issue
hope everything is well. I am trying to extract content under something called CEO Pay Ratio. I am gathering links to this section and sending them to a function. Now, I want the function to start gathering content from the given tag as a starting point to a thematic break as the ending point:
def extract_content(soup, name):
tag = soup.find('a', {'name': name})
all_breaks = soup.select('hr[style*="width:100%"]')
content = ""
if tag:
current_tag = tag.find_next()
print(len(all_breaks))
while current_tag and not (current_tag.name == 'hr' and current_tag in all_breaks):
content += current_tag.get_text().strip() + " "
current_tag = current_tag.find_next()
return content.strip()
- all_breaks is a list of all the thematic breaks in the HTML file.
[<hr style="page-break-after:always;width:100%;"/>, <hr style="page-break-after:always;width:100%;"/>, <hr style="page-break-after:always;width:100%;"/>, <hr style="page-break-after:always;width:100%;"/>, .....]
HTML Portion I am trying to extract:
<p style="margin-bottom:0pt;margin-top:0pt;text-indent:0%;font-weight:bold;color:#0061F0;font-size:22pt;font-family:Segoe UI;text-transform:uppercase;font-style:normal;font-variant: normal;"><a name="_AEIOULastRenderedPageBreakAEIOU209">
<a name="CEO_PAY_RATIO"></a><font style="font-weight:bold;color:#0061F0;font-size:22pt;
font-family:Segoe UI;text-transform:uppercase;font-style:normal;font-variant: normal;">
<a name="CEO_PAY_RATIO"></a>CEO PAY</font><font style="font-weight:bold;color:#0061F0;font-size:22pt;
font-family:Segoe UI;text-transform:uppercase;font-style:normal;font-variant: normal;"> RATIO</font></p>
<some other stuff>
<hr style="page-break-after:always;width:100%;"/>
<some other we don't want>
The code I got from here:
soup = BeautifulSoup(html_text, "html.parser")
# preprocess the document (remove <font> tags)
for font in soup.select("font"):
font.unwrap()
soup.smooth()
# find first tag
tag = soup.find('a', {'name':"CEO_PAY_RATIO"})
text = []
while tag := tag.next_sibling:
if tag.name == "hr" and "width:100%" in tag.attrs.get("style", ""):
break
t = tag.get_text(strip=True, separator=" ")
if t:
text.append(t)
text = "\n".join(text)
print(text)
This doesn't print out the content except for CEO PAY RATIO
. But it works if the <a name = "CEO_PAY_RATIO>
is not inside any other tag.
Thank you all for your time.
Solution
Here is an example how you can get the portion of the HTML document:
from bs4 import BeautifulSoup
html_text = """\
<p> I don't want this </p>
<p style="margin-bottom:0pt;margin-top:0pt;text-indent:0%;font-weight:bold;color:#0061F0;font-size:22pt;font-family:Segoe UI;text-transform:uppercase;font-style:normal;font-variant: normal;"><a name="_AEIOULastRenderedPageBreakAEIOU209">
<a name="CEO_PAY_RATIO"></a><font style="font-weight:bold;color:#0061F0;font-size:22pt;
font-family:Segoe UI;text-transform:uppercase;font-style:normal;font-variant: normal;">
<a name="CEO_PAY_RATIO"></a>CEO PAY</font><font style="font-weight:bold;color:#0061F0;font-size:22pt;
font-family:Segoe UI;text-transform:uppercase;font-style:normal;font-variant: normal;"> RATIO</font></p>
<p> I want this </p>
<hr style="page-break-after:always;width:100%;"/>
<p> I don't want this </p>
"""
soup = BeautifulSoup(html_text, "html.parser")
# preprocess the document (remove <font> tags)
for font in soup.select("font"):
font.unwrap()
soup.smooth()
# find first tag
tag = soup.select_one('*:has(a[name="CEO_PAY_RATIO"]):has(~hr[style*="width:100"])')
text = [tag.get_text(strip=True, separator=" ")]
while tag := tag.next_sibling:
if tag.name == "hr" and "width:100%" in tag.attrs.get("style", ""):
break
t = tag.get_text(strip=True, separator=" ")
if t:
text.append(t)
text = "\n".join(text)
print(text)
Prints:
CEO PAY RATIO
I want this
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.