Issue
Work requires us to be up to date on developments in regards to customs regulations. Instead of manually going to websites I attempted to create a simple webscraper that goes to defined websites, gets the latest items there and writes them to an excel file.
import requests
import pandas as pd
import regex as re
import openpyxl
from bs4 import BeautifulSoup
urls = ['https://www.evofenedex.nl/actualiteiten/', 'https://douaneinfo.nl/index.php/nieuws']
myworkbook = openpyxl.load_workbook('output.xlsx')
worksheet = myworkbook.get_sheet_by_name('Output')
for index, url in enumerate(urls):
response = requests.get(url)
if response.status_code == 200:
#empty array to store the links in
links = []
#evofenedex
if index == 0:
# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find the elements containing the news items
news_items = soup.find_all('div', class_='block__content')
title_element = soup.find_all('p', class_="list__title")
date_element = soup.find_all('p', class_="list__subtitle")
x = 0
link_elements = []
for titles in title_element:
link_elements.append(soup.find_all('a', title=title_element[x].text))
x = x + 1
for link_element in link_elements:
reg_str = re.findall(r'"(.*?)"', str(link_element))
links.append(f"www.evofenedex.nl{reg_str[1]}")
#douaneinfo
if index == 1:
# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
news_items = soup.find_all('div', class_='content-category')
for item in news_items:
title_element = soup.find_all('th', class_="list-title")
date_element = soup.find_all('td', class_="list-date small")
for element in title_element:
element_string = str(element)
reg_str = re.findall(r'"(.*?)"', element_string)[2]
links.append(f"www.douaneinfo.nl{reg_str}")
if title_element and date_element:
y = 0
x = 1
z = 1
#Loops through elements to add them to the excel file
for element in title_element:
titleX = element.text.strip()
date = date_element[y].text.strip()
link = links[y]
cellref = worksheet.cell(row = x, column = z)
cellref.value = titleX
z = z + 1
cellref = worksheet.cell(row = x, column = z)
cellref.value = date
z = z + 1
cellref = worksheet.cell(row = x, column = z)
cellref.value = link
z = 1
y = y + 1
x = x + 1
myworkbook.save('output.xlsx')
print('The scraping is complete')
The issue I'm having is that the first website doesn't get the latest information but rather starts on information that's a few months old.
If you go to the first website, the first row of data I'm scraping is (currently) on the second page of the news url.
Solution
Data comes from an API and is rendered dynamically, so there are minimum two options:
Fetch your data with
requests
via API:import requests json_data = requests.get('https://www.evofenedex.nl/api/v1/pages/news?page=1').json() for item in json_data.get('Resources'): print( item.get('Resource').get('Title'), item.get('Resource').get('Created'), item.get('Resource').get('AbsolutePath') )
Use
selenium
that mimics a browser and wait until content is rendered to process it.
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.