Wednesday, November 2, 2022

[FIXED] Trying to web scrape text from a table on a website

November 02, 2022 beautifulsoup, python, scrapy, web-scraping No comments

Issue

I am a novice at this, but I've been trying to scrape data on a website (https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA) but I keep coming up empty. I've tried BeautifulSoup and Scrapy but I can't get the text out.

Eventually I want to get the row of each individual wine in the table into a dataframe/csv (from all pages) but currently I can't even get the first wine producer name.

If you inspect the webpage all the details are in tags with no id or class.

My BeautifulSoup attempt

URL = 'https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.52"}

page = requests.get(URL, headers=headers)

soup = BeautifulSoup(page.content, "html.parser")
soup2 = soup.prettify()

producer = soup2.find_all('td').get_text()

print(producer)

Which is throwing the error:

producer = soup2.find_all('td').get_text()
AttributeError: 'str' object has no attribute 'find_all'

My Scrapy attempt

winedf = pd.DataFrame()

class WineSpider(scrapy.Spider):
    name = 'wine_spider'

    def start_requests(self):
        dwwa_url = "https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA"
        yield scrapy.Request(url=dwwa_url, callback=self.parse_front)

    def parse_front(self, response):
        table = response.xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table')
        page_links = table.xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/div[2]/div[1]/ul/li[3]/a(@class,\
        "dwwa-page-link") @href')
        links_to_follow = page_links.extract()
        for url in links_to_follow:
            yield response.follow(url=url, callback=self.parse_pages)

    def parse_pages(self, response):
        wine_name = Selector(response=response).xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table/tbody/\
        tr[1]/td[1]/text()').get()
        wine_name_ext = wine_name.extract().strip()
        winedf.append(wine_name_ext)
        medal = Selector(response=response).xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table/tbody/tr[1]/\
        td[4]/text()').get()
        medal_ext = medal.extract().strip()
        winedf.append(medal_ext)

Which produces and empty df.

Any help would be greatly appreciated.

Thank you!

Solution

Try:

import pandas as pd

url = "https://decanterresultsapi.decanter.com/api/DWWA/2022/wines/search?competitionType=DWWA"
df = pd.read_json(url)

# print last items in df:
print(df.tail().to_markdown())

Prints:

	producer	name	id	competition	award	score	country	region	subRegion	vintage	color	style	priceBandLetter	competitionYear	competitionType
14853	Telavi Wine Cellar	Marani	718257	DWWA 2022	7	86	Georgia	Kakheti	Kindzmarauli	2021	Red	Still - Medium (between 19 and 44 g/L residual sugar)	B	2022	DWWA
14854	Štrigova	Muškat Žuti	716526	DWWA 2022	7	87	Croatia	Continental	Zagorje - Međimurje	2021	White	Still - Medium (between 19 and 44 g/L residual sugar)	C	2022	DWWA
14855	Kopjar	Muscat žUti	717754	DWWA 2022	7	86	Croatia	Continental	Zagorje - Međimurje	2021	White	Still - Medium (between 19 and 44 g/L residual sugar)	C	2022	DWWA
14856	Cleebronn-Güglingen	Blanc De Noir Fein & Fruchtig	719836	DWWA 2022	7	87	Germany	Württemberg	Not Applicable	2021	White	Still - Medium (between 19 and 44 g/L residual sugar)	B	2022	DWWA
14857	Winnice Czajkowski	Thoma 8 Grand Selection	719891	DWWA 2022	6	90	Poland	Not Applicable	Not Applicable	2021	White	Still - Medium (between 19 and 44 g/L residual sugar)	D	2022	DWWA

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, November 2, 2022

[FIXED] Trying to web scrape text from a table on a website

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels