Thursday, March 10, 2022

[FIXED] Python / BeautifulSoup return ids with indeed jobs

March 10, 2022 beautifulsoup, python No comments

Issue

I have a basic indeed web scraper set up using BeautifulSoup that I am able to return the job title and company of each job from the first page of the indeed job search url I am using:

def extract():
    headers = headers
    url = f'https://www.indeed.com/jobs?q=Network%20Architect&start=&vjk=e8bcf3fbe7498a5f'
    r = requests.get(url,headers)
    #return r.status_code
    soup = BeautifulSoup(r.content, 'html.parser')
    return soup

def transform(soup):
    for job in soup.select('.result'):
        title = job.select_one('.jobTitle').get_text(' ')
        company = job.find(class_='companyName').text 
        print(f'title: {title}')
        print(f'company: {company}')
        
        
        
c = extract()
transform(c)

Output

title: new Network Architect
company: MetroSys
title: new Network Architect
company: Federal Working Group
title: new REMOTE Network Architect - CCIE
company: CyberCoders
title: new Network Architect SME
company: Emergere Technologies
title: Cybersecurity Apprentice
company: IBM
title: Network Engineer (NEW YORK) ONSITE ONLY NEED TO APPLY
company: QnA Tech
title: new Network Architect
company: EdgeCo Holdings
title: new Network Architect
company: JKL Technologies, Inc.
title: Network Architect
company: OTELCO
title: new Network Architect
company: Illinois Municipal Retirement Fund (IMRF)
title: new Network Architect, Google Enterprise Network
company: Google
title: new Network Infrastructure Lead Or Architect- Menlo Park CA -Ful...
company: Xforia Technologies
title: Network Architect
company: Fairfax County Public Schools
title: new Network Engineer
company: Labatt Food Service
title: new Network Architect (5056-3)
company: JND

Now on indeed it appears they have a unique ID for each job, I am trying to access this ID WITH each job so that I can use it later in an SQL database so that I don't add duplicate jobs. I am able the access the job IDs with the following code:

for tag in soup.find_all('a', class_ = 'result') :
    print(tag.get('id'))

Output:

job_a678f3bfc20cb753
job_eef3e4c10d979c1e
job_faedfdbadab2f19b
job_190a6b55b99c78f0
job_32d20498e8fbf692
job_aeaabb9af50f36d6
job_92432325a24212d0
job_819ce9d7ec6e5890
job_d979bf7daac01528
job_0879369d166a9b94
job_2d377bc2e5085ad7
job_bb8e5d0f651c072f
job_dcff58df466f1ecb
job_f70d55871eb1df3f
sj_54a09e5e34e08948

When I try to implement this with my working code I can access the IDs however, they all get returned together instead of one at a time with the corresponding job, or 1 with each job posting (instead of 15 total getting 15x15) I have tried this way:

def transform(soup):
    for job in soup.select('.result'):
        title = job.select_one('.jobTitle').get_text(' ')
        company = job.find(class_='companyName').text 
         tag = soup.find_all('a', class_='result')
         for x in tag:
           print(x.get('id'))
        print(f'title: {title}')
        print(f'company: {company}')

And this way:

def transform(soup):
    for job in soup.select('.result'):
        title = job.select_one('.jobTitle').get_text(' ')
        company = job.find(class_='companyName').text 
        tag = soup.find_all('a', class_='result')
        for x in tag:
            print(x.get('id'))
            print(f'title: {title}')
            print(f'company: {company}')

The second way is the closest to my result however instead of getting 1 title, 1 company, and 1 id, adding up to 15 total jobs postings, I get the id returned with each job posting so 15x15.

The desired result is just to get it returned as:

title
company
ID
title
company
ID

Solution

You still have the job and extract information from it, so why not simply extract id from it -> job.get('id') should work for you:

def transform(soup):
    for job in soup.select('.result'):
        title = job.select_one('.jobTitle').get_text(' ')
        company = job.find(class_='companyName').text 
        id = job.get('id')
        print(f'title: {title}')
        print(f'company: {company}')
        print(f'id: {id}')

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, March 10, 2022

[FIXED] Python / BeautifulSoup return ids with indeed jobs

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels