Issue
It's been a step by step process getting the code to this point, the goal of it was to visit a list of URLs and scrape specific data. This has been accomplished with the script below:
import requests
from bs4 import BeautifulSoup as bs
import json
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
urls = ['https://www.nba.com/game/bkn-vs-phi-0022100993',
'https://www.nba.com/game/was-vs-lac-0022100992']
for url in urls:
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
page_obj = soup.select_one('script#__NEXT_DATA__')
json_obj = json.loads(page_obj.text)
print('Title:', json_obj['props']['pageProps']
['story']['header']['headline'])
print('Date:', json_obj['props']['pageProps']['story']['date'])
print('Content:', json_obj['props']['pageProps']['story']['content'])
I had an idea I hoped to implement -- I feel I'm very close but not sure why it's not running. Basically, rather than having the static list of URLs, I wanted to use a Google Sheet as the source of URLs. Meaning, a column on this tab will have the URL list that needs to be scraped.
From there, when run, the script will PULL the URLS from the first tab, the data will get scraped, then the info will be pushed to the data in the second tab.
I've been able to print the URLs in terminal with the code above - basically, getting to the source, and requesting all records.
I thought then, I'd be able to still loop through those links in the same way (new code):
from unittest import skip
import requests
from bs4 import BeautifulSoup as bs
import json
import gspread
gc = gspread.service_account(filename='creds.json')
sh = gc.open_by_key('1NFrhsJT7T0zm3dRaP5J8OY0FryBHy5W_wEEGvwBg58I')
worksheet = sh.sheet1
freshurls = gc.open("NBA Stories").get_worksheet(1)
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
urls = freshurls.get_all_records()
for url in urls:
try:
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
page_obj = soup.select_one('script#__NEXT_DATA__')
json_obj = json.loads(page_obj.text)
title = (json_obj['props']['pageProps']['story']['header']['headline'])
date = (json_obj['props']['pageProps']['story']['date'])
content = str(json_obj['props']['pageProps']['story']['content'])
AddData = [url, title, date, content]
worksheet.append_row(AddData)
except:
skip
Even if I switch the ending actions (AddData & append rows) to just print the results, I'm not seeing anything.
Seems like I'm missing a step? Is there something I could do differently here to leverage those URLs right from the sheet, instead of having to paste them in the script every time?
Solution
SUGGESTION
You can try using the batch_get
method in a separate script file to get the URL data from a sheet tab and then just call the URL data to your scraping script file in your looping method to reduce complexity and for the readability of your script. For more context, see the sample script below.
In my understanding, here is your goal:
- Put a list of URLs on a specific sheet tab in a spreadsheet file.
- Get the URL data from that Sheet tab in Python
- Loop through it in your Python script and scrape the data per URL
- Append each scrape data to a second sheet tab.
Sample Script
The
getURLsFromSpreadsheet.py
file
import gspread
gc = gspread.service_account(filename='creds.json')
# Open a spreadsheet by ID
sh = gc.open_by_key('1XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
# Get the sheets
wk = sh.worksheet("Sheet1")
apprendWk = sh.worksheet("Sheet2")
# E.G. the URLs are listed on Sheet 1 on Column A
urls = wk.batch_get(('A2:A',) )[0]
The
scrapeScript.py
file
from getURLsFromSpreadsheet import *
import requests
from bs4 import BeautifulSoup as bs
import json
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
for url in urls:
r = requests.get(url[0], headers=headers)
soup = bs(r.text, 'html.parser')
page_obj = soup.select_one('script#__NEXT_DATA__')
json_obj = json.loads(page_obj.text)
samplelist = [[str(json_obj['props']['pageProps']['story']['header']['headline']),
str(json_obj['props']['pageProps']['story']['date']),
str(json_obj['props']['pageProps']['story']['content'])[2:-1]
]]
apprendWk.append_rows(samplelist)
Demonstration
Sample spreadsheet file. URLs are listed on
Column A
The Sheet 2 tab after running the
scrapeScript.py
file
Reference
Answered By - Irvin Jay G.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.