Issue
I would like to scrape a table from the URLs below. The scraping works but the problem I have is that it only shows the information from the first URL. How can I fix my code so that it adds the information of the second URL as well? I hope my question is clear.
import pandas as pd
import requests
from bs4 import BeautifulSoup
urls = ['https://www.funda.nl/en/koop/ridderkerk/huis-42649106-natalstraat-15/', 'https://www.funda.nl/en/en/koop/rotterdam/huis-42648673-courzandseweg-67/']
#df = pd.DataFrame()
dl = []# Storage for data
dt = []# Storage for column names
for url in urls:
headers = {"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36",}
soup = BeautifulSoup(requests.get(url, headers=headers).text, "html.parser")
dl_data = soup.find_all("dd") # Scraping the data
for dlitem in dl_data:
dl.append(dlitem.text.strip())
dt_data = soup.find_all("dt") # Scraping the column names
for dtitem in dt_data:
dt.append(dtitem.text.strip())
df = pd.DataFrame(dl) # Creating the dataframe
df = df.T # Transposing it because otherwise it is 1D
df.columns = dt # Giving the column names to the dataframe
Solution
Avoid the multiple lists, just choose a more leaner approached to process your data and save in more structured way e.g. dict
- These dict comprehension
selects all <dd>
that follows an <dt>
creates a dict
and appends it to data
. Simply create a DataFrame
from this list of dicts:
data.append({e.find_previous_sibling('dt').text.strip(): e.text.strip() for e in soup.select('dt + dd')})
Example
import pandas as pd
import requests
from bs4 import BeautifulSoup
urls = ['https://www.funda.nl/en/koop/ridderkerk/huis-42649106-natalstraat-15/', 'https://www.funda.nl/en/en/koop/rotterdam/huis-42648673-courzandseweg-67/']
data = []
for url in urls:
headers = {"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36",}
soup = BeautifulSoup(requests.get(url, headers=headers).text, "html.parser")
data.append({e.find_previous_sibling('dt').text.strip(): e.text.strip() for e in soup.select('dt + dd')})
pd.DataFrame(data)
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.