Sunday, January 9, 2022

[FIXED] How do I scrape data from URLs in a python-scraped list of URLs?

January 09, 2022 beautifulsoup, orange, python, web-scraping No comments

Issue

I'm trying to use BeautifulSoup4 in Orange to scrape data from a list of URLs scraped from that same website.

I have managed to scraped the data from a single page when I set the URL manually.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import csv
import re

url = "https://data.ushja.org/awards-standings/zone-points.aspx?year=2021&zone=1&section=1901"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")

rank = soup.find("table", class_="table-standings-body")
for child in rank.children:
    print(url,child)

and I have been able to scrape the list of URLs I need

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import csv
import re

url = "https://data.ushja.org/awards-standings/zones.aspx?year=2021&zone=1"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")

rank = soup.find("table", class_="table-standings-body")

link = soup.find('div',class_='contentSection')

url_list = link.find('a').get('href')
for url_list in link.find_all('a'):
    print (url_list.get('href'))

But so far I haven't been able to combine both to scrape the data from that URL list. Can I do that only by nesting for loops, and if so, how? Or how can I do it?

I am sorry if this is a stupid question, but I only started trying with Python and Web-Scraping yesterday and I have not been able to figure this by consulting similar-ish topics.

Solution

Try:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://data.ushja.org/awards-standings/zones.aspx?year=2021&zone=1"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")

# get all links
url_list = []
for a in soup.find("div", class_="contentSection").find_all("a"):
    url_list.append(a["href"].replace("§", "&sect"))

# get all data from URLs
all_data = []
for url in url_list:
    print(url)

    req = requests.get(url)
    soup = BeautifulSoup(req.text, "html.parser")

    h2 = soup.h2
    sub = h2.find_next("p")

    for tr in soup.select("tr:has(td)"):
        all_data.append(
            [
                h2.get_text(strip=True),
                sub.get_text(strip=True),
                *[td.get_text(strip=True) for td in tr.select("td")],
            ]
        )

# save data to CSV
df = pd.DataFrame(
    all_data,
    columns=[
        "title",
        "sub_title",
        "Rank",
        "Horse / Owner",
        "Points",
        "Total Comps",
    ],
)
print(df)
df.to_csv("data.csv", index=None)

This traverses all URLs and saves all data to data.csv (screenshot from LibreOffice):

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 9, 2022

[FIXED] How do I scrape data from URLs in a python-scraped list of URLs?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels