Issue
This was part of another question (Reading URLs from .csv and appending scrape results below previous with Python, BeautifulSoup, Pandas ) which was generously answered by @HedgeHog and contributed to by @QHarr. Now posting this part as a separate question.
In the code below, I'm pasting 3 example source URLs into the code and it works. But I have a long list of URLs (1000+) to scrape and they are stored in a single first column of a .csv file (let's call it 'urls.csv'). I would prefer to read directly from that file.
I think I know the basic structure of 'with open' (e.g. the way @bguest answered it below), but I'm having problems how to link that to the rest of the code, so that the rest continues to work. How can I replace the list of URLs with iterative reading of .csv, so that I'm passing the URLs correctly into the code?
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/',
'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/',
'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
toc = soup.find("div", id="toc")
def get_drivers():
data.append({
'url': url,
'type': 'driver',
'list': [x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
})
get_drivers()
def get_challenges():
data.append({
'url': url,
'type': 'challenges',
'list': [x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if
'Table Impact of drivers and challenges' not in x.get_text(strip=True)]
})
get_challenges()
pd.concat([pd.DataFrame(data)[['url', 'type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],
axis=1).to_csv('output.csv')
Solution
Since you're using pandas, read_csv
will do the trick for you: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
If you want to write it on your own, you could use the built in csv library
import csv
with open('urls.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row["url"])
Edit: was asked how to make the rest of the code use the urls from the csv.
First put the urls in a urls.csv file
url
https://www.google.com
https://www.facebook.com
Now gather the urls from the csv
import csv
with open('urls.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
urls = [row["url"] for row in reader]
# remove the following lines
urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/',
'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/',
'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
Now the urls should be used by the rest of the code
Answered By - bguest
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.