Issue
I have a basic indeed web scraper set up using BeautifulSoup that I am able to return the job title and company of each job from the first page of the indeed job search url I am using:
def extract():
headers = headers
url = f'https://www.indeed.com/jobs?q=Network%20Architect&start=&vjk=e8bcf3fbe7498a5f'
r = requests.get(url,headers)
#return r.status_code
soup = BeautifulSoup(r.content, 'html.parser')
return soup
def transform(soup):
for job in soup.select('.result'):
title = job.select_one('.jobTitle').get_text(' ')
company = job.find(class_='companyName').text
print(f'title: {title}')
print(f'company: {company}')
c = extract()
transform(c)
Output
title: new Network Architect
company: MetroSys
title: new Network Architect
company: Federal Working Group
title: new REMOTE Network Architect - CCIE
company: CyberCoders
title: new Network Architect SME
company: Emergere Technologies
title: Cybersecurity Apprentice
company: IBM
title: Network Engineer (NEW YORK) ONSITE ONLY NEED TO APPLY
company: QnA Tech
title: new Network Architect
company: EdgeCo Holdings
title: new Network Architect
company: JKL Technologies, Inc.
title: Network Architect
company: OTELCO
title: new Network Architect
company: Illinois Municipal Retirement Fund (IMRF)
title: new Network Architect, Google Enterprise Network
company: Google
title: new Network Infrastructure Lead Or Architect- Menlo Park CA -Ful...
company: Xforia Technologies
title: Network Architect
company: Fairfax County Public Schools
title: new Network Engineer
company: Labatt Food Service
title: new Network Architect (5056-3)
company: JND
Now on indeed it appears they have a unique ID for each job, I am trying to access this ID WITH each job so that I can use it later in an SQL database so that I don't add duplicate jobs. I am able the access the job IDs with the following code:
for tag in soup.find_all('a', class_ = 'result') :
print(tag.get('id'))
Output:
job_a678f3bfc20cb753
job_eef3e4c10d979c1e
job_faedfdbadab2f19b
job_190a6b55b99c78f0
job_32d20498e8fbf692
job_aeaabb9af50f36d6
job_92432325a24212d0
job_819ce9d7ec6e5890
job_d979bf7daac01528
job_0879369d166a9b94
job_2d377bc2e5085ad7
job_bb8e5d0f651c072f
job_dcff58df466f1ecb
job_f70d55871eb1df3f
sj_54a09e5e34e08948
When I try to implement this with my working code I can access the IDs however, they all get returned together instead of one at a time with the corresponding job, or 1 with each job posting (instead of 15 total getting 15x15) I have tried this way:
def transform(soup):
for job in soup.select('.result'):
title = job.select_one('.jobTitle').get_text(' ')
company = job.find(class_='companyName').text
tag = soup.find_all('a', class_='result')
for x in tag:
print(x.get('id'))
print(f'title: {title}')
print(f'company: {company}')
And this way:
def transform(soup):
for job in soup.select('.result'):
title = job.select_one('.jobTitle').get_text(' ')
company = job.find(class_='companyName').text
tag = soup.find_all('a', class_='result')
for x in tag:
print(x.get('id'))
print(f'title: {title}')
print(f'company: {company}')
The second way is the closest to my result however instead of getting 1 title, 1 company, and 1 id, adding up to 15 total jobs postings, I get the id returned with each job posting so 15x15.
The desired result is just to get it returned as:
title
company
ID
title
company
ID
Solution
You still have the job and extract information from it, so why not simply extract id from it -> job.get('id')
should work for you:
def transform(soup):
for job in soup.select('.result'):
title = job.select_one('.jobTitle').get_text(' ')
company = job.find(class_='companyName').text
id = job.get('id')
print(f'title: {title}')
print(f'company: {company}')
print(f'id: {id}')
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.