Sunday, January 7, 2024

[FIXED] Scraping, content retrieving multiple times

January 07, 2024 beautifulsoup, html, python, web-scraping No comments

Issue

hope everything is well. I am trying to extract content under something called CEO Pay Ratio. I am gathering links to this section and sending them to a function. Now, I want the function to start gathering content from the given tag as a starting point to a thematic break as the ending point:

def extract_content(soup, name):
    tag = soup.find('a', {'name': name})
    all_breaks = soup.select('hr[style*="width:100%"]')

    content = ""
    if tag:
        current_tag = tag.find_next()
        print(len(all_breaks))
        while current_tag and not (current_tag.name == 'hr' and current_tag in all_breaks):
            content += current_tag.get_text().strip() + " "
            current_tag = current_tag.find_next()

    return content.strip()

all_breaks is a list of all the thematic breaks in the HTML file.

[<hr style="page-break-after:always;width:100%;"/>, <hr style="page-break-after:always;width:100%;"/>, <hr style="page-break-after:always;width:100%;"/>, <hr style="page-break-after:always;width:100%;"/>, .....]

HTML Portion I am trying to extract:

<p style="margin-bottom:0pt;margin-top:0pt;text-indent:0%;font-weight:bold;color:#0061F0;font-size:22pt;font-family:Segoe UI;text-transform:uppercase;font-style:normal;font-variant: normal;"><a name="_AEIOULastRenderedPageBreakAEIOU209">
<a name="CEO_PAY_RATIO"></a><font style="font-weight:bold;color:#0061F0;font-size:22pt;
font-family:Segoe UI;text-transform:uppercase;font-style:normal;font-variant: normal;">
<a name="CEO_PAY_RATIO"></a>CEO PAY</font><font style="font-weight:bold;color:#0061F0;font-size:22pt;
font-family:Segoe UI;text-transform:uppercase;font-style:normal;font-variant: normal;"> RATIO</font></p>
<some other stuff>
<hr style="page-break-after:always;width:100%;"/>
<some other we don't want>

The code I got from here:

soup = BeautifulSoup(html_text, "html.parser")

# preprocess the document (remove <font> tags)
for font in soup.select("font"):
    font.unwrap()

soup.smooth()

# find first tag
tag = soup.find('a', {'name':"CEO_PAY_RATIO"})

text = []
while tag := tag.next_sibling:
    if tag.name == "hr" and "width:100%" in tag.attrs.get("style", ""):
        break
    t = tag.get_text(strip=True, separator=" ")
    if t:
        text.append(t)

text = "\n".join(text)
print(text)

This doesn't print out the content except for CEO PAY RATIO. But it works if the <a name = "CEO_PAY_RATIO> is not inside any other tag.

Thank you all for your time.

Solution

Here is an example how you can get the portion of the HTML document:

from bs4 import BeautifulSoup

html_text = """\
<p> I don't want this </p>

<p style="margin-bottom:0pt;margin-top:0pt;text-indent:0%;font-weight:bold;color:#0061F0;font-size:22pt;font-family:Segoe UI;text-transform:uppercase;font-style:normal;font-variant: normal;"><a name="_AEIOULastRenderedPageBreakAEIOU209">
<a name="CEO_PAY_RATIO"></a><font style="font-weight:bold;color:#0061F0;font-size:22pt;
font-family:Segoe UI;text-transform:uppercase;font-style:normal;font-variant: normal;">
<a name="CEO_PAY_RATIO"></a>CEO PAY</font><font style="font-weight:bold;color:#0061F0;font-size:22pt;
font-family:Segoe UI;text-transform:uppercase;font-style:normal;font-variant: normal;"> RATIO</font></p>

<p> I want this </p>

<hr style="page-break-after:always;width:100%;"/>

<p> I don't want this </p>
"""

soup = BeautifulSoup(html_text, "html.parser")

# preprocess the document (remove <font> tags)
for font in soup.select("font"):
    font.unwrap()

soup.smooth()

# find first tag
tag = soup.select_one('*:has(a[name="CEO_PAY_RATIO"]):has(~hr[style*="width:100"])')

text = [tag.get_text(strip=True, separator=" ")]
while tag := tag.next_sibling:
    if tag.name == "hr" and "width:100%" in tag.attrs.get("style", ""):
        break
    t = tag.get_text(strip=True, separator=" ")
    if t:
        text.append(t)

text = "\n".join(text)
print(text)

Prints:

CEO PAY RATIO
I want this

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 7, 2024

[FIXED] Scraping, content retrieving multiple times

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels