Issue
im trying to use Scrapy to return the results and statistics from live games in SofaScore.
Site : https://www.sofascore.com/
The code below :
import scrapy
class SofascoreSpider(scrapy.Spider):
name = 'SofaScore'
allowed_domains = ['sofascore.com']
start_urls = ['http://sofascore.com/']
def parse(self, response):
time1 =
response.xpath("/html/body/div[1]/main/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/a/div/div").extract()
print(time1)
pass
I tried to use response.xpath("//html/body/div[1]/main/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/a/div/div").getall()
too, but it returns nothing. I used so many different xpath's and it didn't return. What im doing wrong ?
Like, today 10/06 the first match on the page is France vs Austria, xpath : /html/body/div[1]/main/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/a/div/div
Solution
The data is generated with JavaScript, but you can get it from the API.
Open devtools in the browser and click on the network
tab. Then click on the live
button and look where it loads the data from. Then look at the JSON file to see its structure.
import scrapy
class SofascoreSpider(scrapy.Spider):
name = 'SofaScore'
allowed_domains = ['sofascore.com']
start_urls = ['https://api.sofascore.com/api/v1/sport/football/events/live']
custom_settings = {'DOWNLOAD_DELAY': 0.4}
def start_requests(self):
headers = {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"DNT": "1",
"Host": "api.sofascore.com",
"Origin": "https://www.sofascore.com",
"Pragma": "no-cache",
"Referer": "https://www.sofascore.com/",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-site",
"Sec-GPC": "1",
"TE": "trailers",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
}
yield scrapy.Request(url=self.start_urls[0], headers=headers)
def parse(self, response):
events = response.json()
events = events['events']
# now iterate throught the list and get what you want from it
# example:
for event in events:
yield {
'event name': event['tournament']['name'],
'time': event['time']
}
Answered By - SuperUser
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.