Issue
I have a list of items that I scraped from Github. This is sitting in df_actionname ['ActionName']. Each ['ActionName'] can then be converted into a ['Weblink'] to create a website link. I am trying to loop through each weblink and scrape data from it.
My code:
#Code to create input data
import pandas as pd
actionnameListFinal = ['TruffleHog OSS','Metrics embed','Super-Linter',]
df_actionname = pd.DataFrame(actionnameListFinal, columns = ['ActionName'])
# Create dataframes
df_actionname = pd.DataFrame(actionnameListFinal, columns = ['ActionName'])
#Create new column for parsed action names
df_actionname['Parsed'] = df_actionname['ActionName'].str.replace( r'[^A-Za-z0-9]+','-', regex = True)
df_actionname['Weblink'] = 'https://github.com/marketplace/actions/' + df_actionname['Parsed']
for website in df_actionname['Weblink']:
URL = df_actionname['Weblink']
detailpage = requests.get(URL)
My code is failing at " detailpage= requests.get(URL) " The error message I am getting is:
in get_adapter raise InvalidSchema(f"No connection adapters were found for {url!r}") requests.exceptions.InvalidSchema: No connection adapters were found for '0 https://github.com/marketplace/actions/Truffle...\n1 https://github.com/marketplace/actions/Metrics...\n2 https://github.com/marketplace/actions/Super-L...\n3 https://github.com/marketplace/actions/Swift-Doc\nName: Weblink, dtype: object'
Solution
You need to set a single valid url. Changing your for
loop to
# from bs4 import BeautifulSoup
for website in df_actionname['Weblink']:
detailpage = requests.get(website)
pageSoup = BeautifulSoup(detailpage.content, 'html.parser')
print(f'scraped "{pageSoup.title.text}" from {website}')
gives me the output
scraped "TruffleHog OSS · Actions · GitHub Marketplace · GitHub" from https://github.com/marketplace/actions/TruffleHog-OSS
scraped "Metrics embed · Actions · GitHub Marketplace · GitHub" from https://github.com/marketplace/actions/Metrics-embed
scraped "Super-Linter · Actions · GitHub Marketplace · GitHub" from https://github.com/marketplace/actions/Super-Linter
The way you were doing it, not only was your code basically trying to repeatedly sending the same GET request every loop (since URL
was not dependent on website
at all), the input of requests.get
was not a single url, as you can see if you add a print
before the request:
Answered By - PerpetuallyConfused
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.