Sunday, January 30, 2022

[FIXED] Unable to print once to get all the data altogether

January 30, 2022 beautifulsoup, python, python-3.x, web-scraping No comments

Issue

I've written a script in python to scrape the tablular content from a webpage. In the first column of the main table there are the names. Some names have links to lead another page, some are just the names without any link. My intention is to parse the rows when a name has no link to another page. However, when the name has link to another page then the script will first parse the concerning rows from the main table and then follow that link to parse associated information of that name from the table located at the bottom under the title Companies. Finally, write them in a csv file.

site link

I've tried so far:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

link = "https://suite.endole.co.uk/insight/company/ajax_people.php?ajax_url=ajax_people&page=1&company_number=03512889"
base = "https://suite.endole.co.uk"

res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("table tr")[1:]:
    if not item.select_one("td a[href]"):
        first_table = [i.text for i in item.select("td")]

        print(first_table)

    else:
        first_table = [i.text for i in item.select("td")]

        print(first_table)

        url = urljoin(base,item.select_one("td a[href]").get("href"))
        resp = requests.get(url)
        soup_ano = BeautifulSoup(resp.text,"lxml")
        for elems in soup_ano.select(".content:contains(Companies) table tr")[1:]:
            associated_info = [elem.text for elem in elems.select("td")]

            print(associated_info)

My above script can do almost everything but I can't create any logic to print once rather than printing thrice to get all the data atltogether so that I can write them in a csv file.

Solution

Put all your scraped data into a list, here I've called the list associated_info then all the data is in one place & you can iterate over the list to print it out to a CSV if you like...

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

link = "https://suite.endole.co.uk/insight/company/ajax_people.php?ajax_url=ajax_people&page=1&company_number=03512889"
base = "https://suite.endole.co.uk"

res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
associated_info = []

for item in soup.select("table tr")[1:]:
    
    if not item.select_one("td a[href]"):
        associated_info.append([i.text for i in item.select("td")])

    else:
        associated_info.append([i.text for i in item.select("td")])

        url = urljoin(base,item.select_one("td a[href]").get("href"))
        resp = requests.get(url)
        soup_ano = BeautifulSoup(resp.text,"lxml")
        
        for elems in soup_ano.select(".content:contains(Companies) table tr")[1:]:
            associated_info.append([elem.text for elem in elems.select("td")])

print(associated_info)

Answered By - DrBwts

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 30, 2022

[FIXED] Unable to print once to get all the data altogether

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels