Saturday, January 27, 2024

[FIXED] Python webscraper not scraping the latest information

January 27, 2024 beautifulsoup, python, python-requests, web-scraping No comments

Issue

Work requires us to be up to date on developments in regards to customs regulations. Instead of manually going to websites I attempted to create a simple webscraper that goes to defined websites, gets the latest items there and writes them to an excel file.

import requests
import pandas as pd
import regex as re
import openpyxl
from bs4 import BeautifulSoup

urls = ['https://www.evofenedex.nl/actualiteiten/', 'https://douaneinfo.nl/index.php/nieuws']

myworkbook = openpyxl.load_workbook('output.xlsx')
worksheet = myworkbook.get_sheet_by_name('Output')

for index, url in enumerate(urls):
    response = requests.get(url)
    if response.status_code == 200:
        #empty array to store the links in
        links = []

        #evofenedex    
        if index == 0:
            # Parse the HTML content
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # Find the elements containing the news items
            news_items = soup.find_all('div', class_='block__content')
            title_element = soup.find_all('p', class_="list__title")
            date_element = soup.find_all('p', class_="list__subtitle")

            x = 0
            link_elements = []
            for titles in title_element:
                link_elements.append(soup.find_all('a', title=title_element[x].text))
                x = x + 1

            for link_element in link_elements:
                reg_str = re.findall(r'"(.*?)"', str(link_element))
                links.append(f"www.evofenedex.nl{reg_str[1]}")

        #douaneinfo
        if index == 1:
            # Parse the HTML content
            soup = BeautifulSoup(response.text, 'html.parser')

            news_items = soup.find_all('div', class_='content-category')
            for item in news_items:
                title_element = soup.find_all('th', class_="list-title")
                date_element = soup.find_all('td', class_="list-date small")
                for element in title_element:
                    element_string = str(element)
                    reg_str = re.findall(r'"(.*?)"', element_string)[2]
                    links.append(f"www.douaneinfo.nl{reg_str}")

        if title_element and date_element:
            y = 0
            x = 1
            z = 1
            #Loops through elements to add them to the excel file
            for element in title_element:
                titleX = element.text.strip()
                date = date_element[y].text.strip()
                link = links[y]

                cellref = worksheet.cell(row = x, column = z)
                cellref.value = titleX
                z = z + 1
                cellref = worksheet.cell(row = x, column = z)
                cellref.value = date
                z = z + 1
                cellref = worksheet.cell(row = x, column = z)
                cellref.value = link
                z = 1
                y = y + 1
                x = x + 1

myworkbook.save('output.xlsx')
print('The scraping is complete')

The issue I'm having is that the first website doesn't get the latest information but rather starts on information that's a few months old.

If you go to the first website, the first row of data I'm scraping is (currently) on the second page of the news url.

Solution

Data comes from an API and is rendered dynamically, so there are minimum two options:

Fetch your data with requests via API:

import requests

json_data = requests.get('https://www.evofenedex.nl/api/v1/pages/news?page=1').json()

for item in json_data.get('Resources'):
    print(
        item.get('Resource').get('Title'),
        item.get('Resource').get('Created'),
        item.get('Resource').get('AbsolutePath')
    )

Use selenium that mimics a browser and wait until content is rendered to process it.

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, January 27, 2024

[FIXED] Python webscraper not scraping the latest information

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels