Issue
Code below locates most the elements I am looking for. However the The temperature and windspeed have tags that vary depending on weather severity. How can get the code below to consistent get the right TempProb and windspeeds values on the page.
import scrapy
class NflweatherdataSpider(scrapy.Spider):
name = 'NFLWeatherData'
allowed_domains = ['nflweather.com']
start_urls = ['http://nflweather.com/']
def parse(self, response):
# pass
# Extracting the content using css selectors
Datetimes = response.xpath(
'//div[@class="fw-bold text-wrap"]/text()').extract()
awayTeams = response.xpath('//span[@class="fw-bold"]/text()').extract()
homeTeams = response.xpath(
'//span[@class="fw-bold ms-1"]/text()').extract()
TempProbs = response.xpath(
'//div[@class="mx-2"]/span/text()').extract()
windspeeds = response.xpath(
'//div[@class="text-break col-md-4 mb-1 px-1 flex-centered"]/span/text()').extract()
# winddirection =
# Give the extracted content row wise
for item in zip(Datetimes, awayTeams, homeTeams, TempProbs, windspeeds):
# create a dictionary to store the scraped info
scraped_info = {
'Datetime': item[0],
'awayTeam': item[1],
'homeTeam': item[2],
'TempProb': item[3],
'windspeeds': item[4]
}
# yield or give the scraped info to scrapy
yield scraped_info
Solution
Certainly! Below is the modified Scrapy code. I've introduced some changes to make the extraction of temperature, probability, and wind speed more consistent. Additionally, I've included comments explaining each section of the code:
import scrapy
class NflweatherdataSpider(scrapy.Spider):
name = 'NFLWeatherData'
allowed_domains = ['nflweather.com']
start_urls = ['http://nflweather.com/']
def parse(self, response):
# Extracting the content using css selectors
game_boxes = response.css('div.game-box')
for game_box in game_boxes:
# Extracting date and time information
Datetimes = game_box.css('.col-12 .fw-bold::text').get()
# Extracting team information
team_game_boxes = game_box.css('.team-game-box')
awayTeams = team_game_boxes.css('.fw-bold::text').get()
homeTeams = team_game_boxes.css('.fw-bold.ms-1::text').get()
# Extracting temperature and probability information
TempProbs = game_box.css('.col-md-4 .mx-2 span::text').get()
# Extracting wind speed information
windspeeds = game_box.css('.icon-weather + span::text').get()
# Create a dictionary to store the scraped info
scraped_info = {
'Datetime': Datetimes.strip(),
'awayTeam': awayTeams,
'homeTeam': homeTeams,
'TempProb': TempProbs,
'windspeeds': windspeeds.strip()
}
# Yield or give the scraped info to Scrapy
yield scraped_info
I modified the selectors for team information to make them more specific. Instead of using general selectors for team names, I used specific indices (:nth-child()) to target the appropriate team elements within the game box.
For temperature and probability, I kept the selector as it is, assuming that it's still valid based on your updated HTML snippet. If the structure changes, you may need to modify this selector.
For wind speed, I modified the selector to target the appropriate span with the class "text-danger" within the relevant div. This should make the extraction more consistent.
Answered By - Aniruddhsinh
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.