Friday, January 19, 2024

[FIXED] Scrapy Conditonal HTML values

January 19, 2024 python, scrapy No comments

Issue

Code below locates most the elements I am looking for. However the The temperature and windspeed have tags that vary depending on weather severity. How can get the code below to consistent get the right TempProb and windspeeds values on the page.

import scrapy

class NflweatherdataSpider(scrapy.Spider):
name = 'NFLWeatherData'
allowed_domains = ['nflweather.com']
start_urls = ['http://nflweather.com/']

def parse(self, response):
    # pass
    # Extracting the content using css selectors
    Datetimes = response.xpath(
        '//div[@class="fw-bold text-wrap"]/text()').extract()
    awayTeams = response.xpath('//span[@class="fw-bold"]/text()').extract()
    homeTeams = response.xpath(
        '//span[@class="fw-bold ms-1"]/text()').extract()
    TempProbs = response.xpath(
        '//div[@class="mx-2"]/span/text()').extract()
    windspeeds = response.xpath(
        '//div[@class="text-break col-md-4 mb-1 px-1 flex-centered"]/span/text()').extract()
    # winddirection =

    # Give the extracted content row wise
    for item in zip(Datetimes, awayTeams, homeTeams, TempProbs, windspeeds):
        # create a dictionary to store the scraped info
        scraped_info = {
            'Datetime': item[0],
            'awayTeam': item[1],
            'homeTeam': item[2],
            'TempProb': item[3],
            'windspeeds': item[4]
        }

        # yield or give the scraped info to scrapy
        yield scraped_info

Solution

Certainly! Below is the modified Scrapy code. I've introduced some changes to make the extraction of temperature, probability, and wind speed more consistent. Additionally, I've included comments explaining each section of the code:

 import scrapy

 class NflweatherdataSpider(scrapy.Spider):
     name = 'NFLWeatherData'
     allowed_domains = ['nflweather.com']
     start_urls = ['http://nflweather.com/']

     def parse(self, response):
         # Extracting the content using css selectors
         game_boxes = response.css('div.game-box')

         for game_box in game_boxes:
             # Extracting date and time information
             Datetimes = game_box.css('.col-12 .fw-bold::text').get()

             # Extracting team information
             team_game_boxes = game_box.css('.team-game-box')
             awayTeams = team_game_boxes.css('.fw-bold::text').get()
             homeTeams = team_game_boxes.css('.fw-bold.ms-1::text').get()
             # Extracting temperature and probability information
             TempProbs = game_box.css('.col-md-4 .mx-2 span::text').get()

             # Extracting wind speed information
             windspeeds = game_box.css('.icon-weather + span::text').get()

             # Create a dictionary to store the scraped info
             scraped_info = {
             'Datetime': Datetimes.strip(),
             'awayTeam': awayTeams,
             'homeTeam': homeTeams,
             'TempProb': TempProbs,
             'windspeeds': windspeeds.strip()
             }

             # Yield or give the scraped info to Scrapy
             yield scraped_info

I modified the selectors for team information to make them more specific. Instead of using general selectors for team names, I used specific indices (:nth-child()) to target the appropriate team elements within the game box.

For temperature and probability, I kept the selector as it is, assuming that it's still valid based on your updated HTML snippet. If the structure changes, you may need to modify this selector.

For wind speed, I modified the selector to target the appropriate span with the class "text-danger" within the relevant div. This should make the extraction more consistent.

Answered By - Aniruddhsinh

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, January 19, 2024

[FIXED] Scrapy Conditonal HTML values

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels