Issue
I'm trying to create an automation script that will logon to my works scheduling software, parse the table data, and then print out all my scheduled shifts for the week.
I'm struggling to get the data to parse correctly. I've been able to get the time which I'm scheduled but not the days I've tried using ChatGPT to help but haven't been able to find a solution to my problem.
Any help would be appreciated.
Below is the table structure:
<table class="sc-bTmccw sc-laFCIP sc-ctuRtO ibflim hoCFsG gsXXhe">
<thead>
<tr>
<th class="sc-ccJWcV pRLto">America/Chicago</th>
<th class="sc-gZVZiL kofAkN">
<div class="sc-bgnbwU cSZwXw">Monday</div>
<div class="sc-bgnbwU cSZwXw">Sep 25, 2023</div>
</th>
<th class="sc-gZVZiL kofAkN">
<div class="sc-bgnbwU cSZwXw">Tuesday</div>
<div class="sc-bgnbwU cSZwXw">Sep 26, 2023</div>
</th>
<th class="sc-gZVZiL kofAkN">
<div class="sc-bgnbwU cSZwXw">Wednesday</div>
<div class="sc-bgnbwU cSZwXw">Sep 27, 2023</div>
</th>
<th class="sc-gZVZiL kofAkN">
<div class="sc-bgnbwU cSZwXw">Thursday</div>
<div class="sc-bgnbwU cSZwXw">Sep 28, 2023</div>
</th>
<th class="sc-gZVZiL kofAkN">
<div class="sc-bgnbwU cSZwXw">Friday</div>
<div class="sc-bgnbwU cSZwXw">Sep 29, 2023</div>
</th>
<th class="sc-gZVZiL kofAkN">
<div class="sc-bgnbwU cSZwXw">Saturday</div>
<div class="sc-bgnbwU cSZwXw">Sep 30, 2023</div>
</th>
<th class="sc-gZVZiL kofAkN">
<div class="sc-bgnbwU cSZwXw">Sunday</div>
<div class="sc-bgnbwU cSZwXw">Oct 1, 2023</div>
</th>
</tr>
</thead>
<tbody>
<tr>
<td class="sc-gVAlfg bWxTWJ">
<div class="sc-iRaSfU kkorew">12:00 AM</div>
<div class="sc-iRaSfU kkorew">1:00 AM</div>
<div class="sc-iRaSfU kkorew">2:00 AM</div>
<div class="sc-iRaSfU kkorew">3:00 AM</div>
<div class="sc-iRaSfU kkorew">4:00 AM</div>
<div class="sc-iRaSfU kkorew">5:00 AM</div>
<div class="sc-iRaSfU kkorew">6:00 AM</div>
<div class="sc-iRaSfU kkorew">7:00 AM</div>
<div class="sc-iRaSfU kkorew">8:00 AM</div>
<div class="sc-iRaSfU kkorew">9:00 AM</div>
<div class="sc-iRaSfU kkorew">10:00 AM</div>
<div class="sc-iRaSfU kkorew">11:00 AM</div>
<div class="sc-iRaSfU kkorew">12:00 PM</div>
<div class="sc-iRaSfU kkorew">1:00 PM</div>
<div class="sc-iRaSfU kkorew">2:00 PM</div>
<div class="sc-iRaSfU kkorew">3:00 PM</div>
<div class="sc-iRaSfU kkorew">4:00 PM</div>
<div class="sc-iRaSfU kkorew">5:00 PM</div>
<div class="sc-iRaSfU kkorew">6:00 PM</div>
<div class="sc-iRaSfU kkorew">7:00 PM</div>
<div class="sc-iRaSfU kkorew">8:00 PM</div>
<div class="sc-iRaSfU kkorew">9:00 PM</div>
<div class="sc-iRaSfU kkorew">10:00 PM</div>
<div class="sc-iRaSfU kkorew">11:00 PM</div>
</td>
<td class="sc-gVAlfg sc-bRDiDf bWxTWJ hISnGZ"></td>
<td class="sc-gVAlfg sc-bRDiDf bWxTWJ hISnGZ">
<a class="sc-qxRCG sc-gBMFIu iTTzGz ifHVoI" style="height: 325px; transform: translateY(775px);">
<span class="sc-cQAbrN bHxBCY">
<h4 class="sc-gdnLxT sc-dMxQAh fbZdSM tkxgA">Coverage</h4>
<span class="sc-jRztuO sc-hAJyxc bGJrvP hdkZRS"><time>3:30 PM - 10:00 PM</time></span>
<aside></aside>
</span>
</a>
</td>
<td class="sc-gVAlfg sc-bRDiDf bWxTWJ hISnGZ"></td>
<td class="sc-gVAlfg sc-bRDiDf bWxTWJ hISnGZ">
<a class="sc-qxRCG sc-gBMFIu iTTzGz ifHVoI" style="height: 350px; transform: translateY(750px);">
<span class="sc-cQAbrN bHxBCY">
<h4 class="sc-gdnLxT sc-dMxQAh fbZdSM tkxgA">Keyholder</h4>
<span class="sc-jRztuO sc-hAJyxc bGJrvP hdkZRS"><time>3:00 PM - 10:00 PM</time></span>
<aside></aside>
</span>
</a>
</td>
<td class="sc-gVAlfg sc-bRDiDf bWxTWJ hJzsUD">
<a class="sc-qxRCG sc-gBMFIu iTTzGz ifHVoI" style="height: 225px; transform: translateY(900px);">
<span class="sc-cQAbrN bHxBCY">
<h4 class="sc-gdnLxT sc-dMxQAh fbZdSM tkxgA">Coverage</h4>
<span class="sc-jRztuO sc-hAJyxc bGJrvP hdkZRS"><time>6:00 PM - 10:30 PM</time></span>
<aside></aside>
</span>
</a>
</td>
<td class="sc-gVAlfg sc-bRDiDf bWxTWJ hISnGZ">
<a class="sc-qxRCG sc-gBMFIu iTTzGz ifHVoI" style="height: 350px; transform: translateY(500px);">
<span class="sc-cQAbrN bHxBCY">
<h4 class="sc-gdnLxT sc-dMxQAh fbZdSM tkxgA">Coverage</h4>
<span class="sc-jRztuO sc-hAJyxc bGJrvP hdkZRS"><time>10:00 AM - 5:00 PM</time></span>
<aside></aside>
</span>
</a>
</td>
<td class="sc-gVAlfg sc-bRDiDf bWxTWJ hISnGZ">
<a class="sc-qxRCG sc-gBMFIu iTTzGz ifHVoI" style="height: 325px; transform: translateY(250px);">
<span class="sc-cQAbrN bHxBCY">
<h4 class="sc-gdnLxT sc-dMxQAh fbZdSM tkxgA">Keyholder</h4>
<span class="sc-jRztuO sc-hAJyxc bGJrvP hdkZRS"><time>5:00 AM - 11:30 AM</time></span>
<aside></aside>
</span>
</a>
</td>
</tr>
</tbody>
</table>
here is my current parsing structure
`soup = BeautifulSoup(browser.page_source, 'html.parser')
# Find the table headers
table_headers = soup.find_all('th', class_='sc-gZVZiL kofAkN')
# Initialize a dictionary to store the data
data = {}
# Extract column headers (dates) from the table headers
column_headers = [header.find('div', class_='sc-bgnbwU cSZwXw').text.strip() for header in table_headers]
# Find the rows
rows = soup.find_all('tr')
# Initialize variables to store date and time
current_date = ""
current_time_ranges = []
# Extract the data from the rows
for row in rows:
cells = row.find_all('td')
if cells:
# Check if the cell contains coverage information
coverage_cell = cells[1].find('span', class_='sc-cNAXLz bGJrvP PQKTa')
if coverage_cell:
# Extract the time range from the coverage cell
time_range = coverage_cell.find('time').text.strip()
current_time_ranges.append(time_range)
else:
# If the cell doesn't contain coverage information, use an empty string for the time range
time_range = ""
# Extract the data for the remaining cells
data_row = [column_headers[1]] + current_time_ranges + [info.text.strip() for info in cells[2:]]
# Store the data in the dictionary
data[row] = data_row
# Print the extracted data
for key, value in data.items():
print(key)
for item in value:
print(item)
This is the output I'm currently getting:
Coverage - 3:30 PM - 10:00 PM
Keyholder - 3:00 PM - 10:00 PM
Coverage - 6:00 PM - 10:30 PM
Coverage - 10:00 AM - 5:00 PM
Keyholder - 5:00 AM - 11:30 AM
This is the output that I want:
Tue 3:30 - 10:00 PM and so on
Solution
If variable html_text
contains the HTML snippet from the question you can try:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_text, "html.parser")
headers = [th.get_text(strip=True, separator=" ") for th in soup.table.select("th")[1:]]
for row in soup.tbody.select("tr"):
for i, td in enumerate(row.select("td")[1:]):
for a in td.select("a"):
name, time = a.h4.text, a.time.text
print(headers[i], name, time, "-" * 80, sep="\n")
Prints:
Tuesday Sep 26, 2023
Coverage
3:30 PM - 10:00 PM
--------------------------------------------------------------------------------
Thursday Sep 28, 2023
Keyholder
3:00 PM - 10:00 PM
--------------------------------------------------------------------------------
Friday Sep 29, 2023
Coverage
6:00 PM - 10:30 PM
--------------------------------------------------------------------------------
Saturday Sep 30, 2023
Coverage
10:00 AM - 5:00 PM
--------------------------------------------------------------------------------
Sunday Oct 1, 2023
Keyholder
5:00 AM - 11:30 AM
--------------------------------------------------------------------------------
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.