Issue
I am trying to web scrape glassdoor, but I get an IndexError. My code has the following form:
html = requests.get('https://www.glassdoor.com/Job/germany-data-science-jobs-SRCH_IL.0,7_IN96_KO8,20_IP1.htm?includeNoSalaryJobs=true', timeout = 5)
soup = BeautifulSoup(html.content, 'lxml')
# extracts the hyperlinks in each jobposting
link = []
for i in soup.find_all('div', class_ = 'd-flex flex-column pl-sm css-1buaf54 job-search-key-1mn3dn8 e1rrn5ka0'):
li = 'https://www.glassdoor.com' + i.a['href']
link.append(li)
# extracts the job descriptions by creating a new soup from each link extracted above
description = []
for links in link:
page = requests.get(links, headers=headers)
soup = BeautifulSoup(page.content, 'lxml')
for job in soup.find_all('div', class_ = 'desc css-58vpdc ecgq1xb5')[0]:
try:
description.append(job.text.strip())
except:
description.append(None)
I want to extract the job description of all jobs, which are within div
or p
tags within the div
tag of ('div', class_ = 'desc css-58vpdc ecgq1xb5')
. When running the code I get the following error:
Traceback (most recent call last):
File "C:\Users\aedan\PycharmProjects\Data_Science_Job_Openings\main.py", line 47, in <module>
for job in soup.find_all('div', class_ = 'desc css-58vpdc ecgq1xb5')[0]:
IndexError: list index out of range
Process finished with exit code 1
I used try and except to append "nothing" to solve the error, as shown above, but it didn't work. I also used html.parser
instead of lxml
. I also tried the solution of this post how to fix error in BeautifulSoup IndexError: list index out of range, but I was not able to structure my code in that way to try it.
Solution
As mentioned, use find()
or 'select_one()' if you like to select only one element or check that there is an element in your ResultSet
.
But also, if you use find()
try to check the availability of the element you searched for and handle this case.
Example
import requests
from bs4 import BeautifulSoup
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'}
html = requests.get('https://www.glassdoor.com/Job/germany-data-science-jobs-SRCH_IL.0,7_IN96_KO8,20_IP1.htm?includeNoSalaryJobs=true', timeout = 5)
soup = BeautifulSoup(html.content, 'lxml')
data =[]
for url in ['https://www.glassdoor.com' + a.get('href') for a in soup.select('li[data-id] a:first-of-type')]:
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content)
data.append({
'url':url,
'desc':None if not soup.find('div', {'id':'JobDescriptionContainer'}) else soup.find('div', {'id':'JobDescriptionContainer'}).get_text(strip=True)
})
data
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.