Issue
I am a novice at this, but I've been trying to scrape data on a website (https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA) but I keep coming up empty. I've tried BeautifulSoup and Scrapy but I can't get the text out.
Eventually I want to get the row of each individual wine in the table into a dataframe/csv (from all pages) but currently I can't even get the first wine producer name.
If you inspect the webpage all the details are in tags with no id or class.
My BeautifulSoup attempt
URL = 'https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.52"}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")
soup2 = soup.prettify()
producer = soup2.find_all('td').get_text()
print(producer)
Which is throwing the error:
producer = soup2.find_all('td').get_text()
AttributeError: 'str' object has no attribute 'find_all'
My Scrapy attempt
winedf = pd.DataFrame()
class WineSpider(scrapy.Spider):
name = 'wine_spider'
def start_requests(self):
dwwa_url = "https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA"
yield scrapy.Request(url=dwwa_url, callback=self.parse_front)
def parse_front(self, response):
table = response.xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table')
page_links = table.xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/div[2]/div[1]/ul/li[3]/a(@class,\
"dwwa-page-link") @href')
links_to_follow = page_links.extract()
for url in links_to_follow:
yield response.follow(url=url, callback=self.parse_pages)
def parse_pages(self, response):
wine_name = Selector(response=response).xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table/tbody/\
tr[1]/td[1]/text()').get()
wine_name_ext = wine_name.extract().strip()
winedf.append(wine_name_ext)
medal = Selector(response=response).xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table/tbody/tr[1]/\
td[4]/text()').get()
medal_ext = medal.extract().strip()
winedf.append(medal_ext)
Which produces and empty df.
Any help would be greatly appreciated.
Thank you!
Solution
Try:
import pandas as pd
url = "https://decanterresultsapi.decanter.com/api/DWWA/2022/wines/search?competitionType=DWWA"
df = pd.read_json(url)
# print last items in df:
print(df.tail().to_markdown())
Prints:
producer | name | id | competition | award | score | country | region | subRegion | vintage | color | style | priceBandLetter | competitionYear | competitionType | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14853 | Telavi Wine Cellar | Marani | 718257 | DWWA 2022 | 7 | 86 | Georgia | Kakheti | Kindzmarauli | 2021 | Red | Still - Medium (between 19 and 44 g/L residual sugar) | B | 2022 | DWWA |
14854 | Štrigova | Muškat Žuti | 716526 | DWWA 2022 | 7 | 87 | Croatia | Continental | Zagorje - Međimurje | 2021 | White | Still - Medium (between 19 and 44 g/L residual sugar) | C | 2022 | DWWA |
14855 | Kopjar | Muscat žUti | 717754 | DWWA 2022 | 7 | 86 | Croatia | Continental | Zagorje - Međimurje | 2021 | White | Still - Medium (between 19 and 44 g/L residual sugar) | C | 2022 | DWWA |
14856 | Cleebronn-Güglingen | Blanc De Noir Fein & Fruchtig | 719836 | DWWA 2022 | 7 | 87 | Germany | Württemberg | Not Applicable | 2021 | White | Still - Medium (between 19 and 44 g/L residual sugar) | B | 2022 | DWWA |
14857 | Winnice Czajkowski | Thoma 8 Grand Selection | 719891 | DWWA 2022 | 6 | 90 | Poland | Not Applicable | Not Applicable | 2021 | White | Still - Medium (between 19 and 44 g/L residual sugar) | D | 2022 | DWWA |
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.