Sunday, September 25, 2022

[FIXED] How to loop through URLs hosted in a Google Sheet

September 25, 2022 beautifulsoup, google-sheets-api, gspread, python, python-requests No comments

Issue

It's been a step by step process getting the code to this point, the goal of it was to visit a list of URLs and scrape specific data. This has been accomplished with the script below:

import requests
from bs4 import BeautifulSoup as bs
import json

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

urls = ['https://www.nba.com/game/bkn-vs-phi-0022100993',
        'https://www.nba.com/game/was-vs-lac-0022100992']


for url in urls:

    r = requests.get(url, headers=headers)

    soup = bs(r.text, 'html.parser')

    page_obj = soup.select_one('script#__NEXT_DATA__')
    json_obj = json.loads(page_obj.text)

    print('Title:', json_obj['props']['pageProps']
          ['story']['header']['headline'])
    print('Date:', json_obj['props']['pageProps']['story']['date'])
    print('Content:', json_obj['props']['pageProps']['story']['content'])

I had an idea I hoped to implement -- I feel I'm very close but not sure why it's not running. Basically, rather than having the static list of URLs, I wanted to use a Google Sheet as the source of URLs. Meaning, a column on this tab will have the URL list that needs to be scraped.

From there, when run, the script will PULL the URLS from the first tab, the data will get scraped, then the info will be pushed to the data in the second tab.

I've been able to print the URLs in terminal with the code above - basically, getting to the source, and requesting all records.

I thought then, I'd be able to still loop through those links in the same way (new code):

from unittest import skip
import requests
from bs4 import BeautifulSoup as bs
import json
import gspread

gc = gspread.service_account(filename='creds.json')
sh = gc.open_by_key('1NFrhsJT7T0zm3dRaP5J8OY0FryBHy5W_wEEGvwBg58I')
worksheet = sh.sheet1
freshurls = gc.open("NBA Stories").get_worksheet(1)

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

urls = freshurls.get_all_records()


for url in urls:
    try:
        r = requests.get(url, headers=headers)

        soup = bs(r.text, 'html.parser')

        page_obj = soup.select_one('script#__NEXT_DATA__')
        json_obj = json.loads(page_obj.text)

        title = (json_obj['props']['pageProps']['story']['header']['headline'])
        date = (json_obj['props']['pageProps']['story']['date'])
        content = str(json_obj['props']['pageProps']['story']['content'])
        AddData = [url, title, date, content]
        worksheet.append_row(AddData)

    except:
        skip

Even if I switch the ending actions (AddData & append rows) to just print the results, I'm not seeing anything.

Seems like I'm missing a step? Is there something I could do differently here to leverage those URLs right from the sheet, instead of having to paste them in the script every time?

Solution

SUGGESTION

You can try using the batch_get method in a separate script file to get the URL data from a sheet tab and then just call the URL data to your scraping script file in your looping method to reduce complexity and for the readability of your script. For more context, see the sample script below.

In my understanding, here is your goal:

Put a list of URLs on a specific sheet tab in a spreadsheet file.
Get the URL data from that Sheet tab in Python
Loop through it in your Python script and scrape the data per URL
Append each scrape data to a second sheet tab.

Sample Script

The getURLsFromSpreadsheet.py file

import gspread

gc = gspread.service_account(filename='creds.json')

# Open a spreadsheet by ID
sh = gc.open_by_key('1XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')

# Get the sheets
wk = sh.worksheet("Sheet1")
apprendWk = sh.worksheet("Sheet2")

# E.G. the URLs are listed on Sheet 1 on Column A
urls = wk.batch_get(('A2:A',) )[0]

The scrapeScript.py file

from getURLsFromSpreadsheet import *
import requests
from bs4 import BeautifulSoup as bs
import json

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}


for url in urls:

    r = requests.get(url[0], headers=headers)

    soup = bs(r.text, 'html.parser')

    page_obj = soup.select_one('script#__NEXT_DATA__')
    json_obj = json.loads(page_obj.text)
    samplelist = [[str(json_obj['props']['pageProps']['story']['header']['headline']),
                   str(json_obj['props']['pageProps']['story']['date']),
                   str(json_obj['props']['pageProps']['story']['content'])[2:-1]
                 ]]
    apprendWk.append_rows(samplelist)

Demonstration

Sample spreadsheet file. URLs are listed on Column A

The Sheet 2 tab after running the scrapeScript.py file

In action:

Reference

Answered By - Irvin Jay G.

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, September 25, 2022

[FIXED] How to loop through URLs hosted in a Google Sheet

Issue

Solution

SUGGESTION

Sample Script

Demonstration

Reference

0 comments:

Post a Comment

Popular Posts

Labels