Tuesday, May 24, 2022

[FIXED] iterate over a set of URLs and gather the output of data in CSV formate

May 24, 2022 beautifulsoup, csv, pandas, python No comments

Issue

i have a question: how to iterate over a set of 700 Urls to get the data of 700 digital hubs in CSV (or Excel-formate)-

see the page where we have the datasts: shown here

https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool

with the list of urls like here;

https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/3480/view
https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/13281/view
https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1417/view
https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1349/view

and so on and so forth

question: can we apply this to the similar task too: see the collection of data as digital hubs: i have applied a scraper to a single site with this- and it works - but how to achieve a csv-output to the scraper that iterates on the urls can we put the output into csv too - while applying the same technique!?

i want to pair web scraped paragraphs with the most recent scraped heading from the hubCards: I am currently scraping the hubCards as single pages to find the method, however, I would like to get all the 700 Cards scraping scraped with the headings so I can see the data together in a CSV file. i want to write it to results to an appropiate formate - which may be a csv file. Note: we have the following h2 headings;

note: we have the following heading on each HubCard:

Title: (probably a h4 tag)
Contact: 
Description:
'Organization', 
'Evolutionary Stage', 
'Geographical Scope', 
'Funding', 
'Partners', 
'Technologies'

what i have for a single page is this:

from bs4 import BeautifulSoup
import requests

page_link = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/3480/view'
page_response = requests.get(page_link,verify=False, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
textContent = []
for tag in page_content.find_all('h4')[1:]:
    texth4=tag.text.strip()
    textContent.append(texth4)
    for item in tag.find_next_siblings('p'):
        if texth4 in item.find_previous_siblings('h4')[0].text.strip():
            textContent.append(item.text.strip())

print(textContent)

output in console:

Description', 'Link to national or regional initiatives for digitising industry', 'Market and Services', 'Service Examples', 'Leveraging the holding system "EndoTAIX" from scientific development to ready-to -market', 'For one of SurgiTAIX AG\'s products, the holding system "EndoTAIX" for surgical instrument fixation, the SurgiTAIX AG cooperated very closely with the RWTH University\'s Helmholtz institute. The services provided comprised the complete first phase of scientific development. Besides, after the first concepts of the holding system took shape, a prototype was successfully build in the scope of a feasibility study. In the role regarding the self-conception as a transfer service provider offering services itself, the SurgiTAIX AG refined the technology to market level and successfully performed all the steps necessary within the process to the approval and certification of the product. Afterwards, the product was delivered to another vendor with SurgiTAIX AG carrying out the production process as an OEM.', 'Development of a self-adapting robotic rehabilitation system', 'Based on the expertise of different partners of the hub, DIERS International GmbH (SME) was enabled to develop a self-adapting robotic rehabilitation system that allows patients after stroke to relearn motion patterns autonomously. The particular challenge of this cooperation was to adjust the robot to the individual and actual needs of the patient at any particular time of the exercise. Therefore, different sensors have been utilized to detect the actual movement performance of the patient. Feature extraction algorithms have been developed to identify the actual needs of the individual patient and intelligent predicting control algorithms enable the robot to independently adapt the movement task to the needs of the patient. These challenges could be solved only by the services provided by different partners of the hub which include the transfer of the newly developed technologies, access to patient data, acquisition of knowledge and demands from healthcare personal and coordinating the application for public funding.', 'Establishment of a robotic couch lab and test facility for radiotherapy', 'With the help of services provided by different partners of the hub, the robotic integrator SME BEC GmbH was given the opportunity to enhance their robotic patient positioning device "ExaMove" to allow for compensation of lung tumor movements during free breathing. The provided services solved the need to establish a test facility within the intended environment (the radiotherapy department) and provided the transfer of necessary innovative technologies such as new sensors and intelligent automatic control algorithms. Furthermore, the provided services included the coordination of the consortium, identifying, preparing and coordinating the application for public funding, provision of access to the hospital’s infrastructure and the acquisition of knowledge and demands from healthcare personal.', 'Organization', 'Evolutionary Stage', 'Geographical Scope', 'Funding', 'Partners', 'Technologies']

so far so good: what is aimed now is to have a nice solution: how to iterate over a set of the 700 Urls (in other words the 700 hubCards) to get the data of 700 digital hubs in CSV (or Excel-formate)?

Solution

You can iterate over tags with class="hubCardTitle" and next element afterward using zip():

import requests
import pandas as pd
from bs4 import BeautifulSoup

urls = [
    "https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/3480/view",
    "https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/13281/view",
    "https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1417/view",
    "https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1349/view",
]


out = []
for url in urls:
    print(f"Getting {url}")
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    d = {"URL": url, "Title": soup.h2.text}

    titles = soup.select("div.hubCardTitle")
    content = soup.select("div.hubCardTitle + div")

    for t, c in zip(titles, content):
        t = t.get_text(strip=True)
        c = c.get_text(strip=True, separator="\n")
        d[t] = c

    out.append(d)

df = pd.DataFrame(out)
df.to_csv("data.csv", index=False)

Creates data.csv (screenshot from LibreOffice):

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, May 24, 2022

[FIXED] iterate over a set of URLs and gather the output of data in CSV formate

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels