Friday, December 29, 2023

[FIXED] How to cycle through all job postings using scrapy?

December 29, 2023 python, scrapy No comments

Issue

I am trying to loop through different job postings from this link: "https://www.jobs.bg/front_job_search.php?subm=1&categories%5B%5D=56&techs%5B%5D=Python" It returns the postings with Python only. I've tried probably about 100 combinations of different Xpaths but I always get either the first job posting or multiples of it. This is the code I have. Everything else works just fine, I just have a problem with getting all job postings.

import scrapy
import re
from jobscraper.items import JobscraperItem
from datetime import datetime
from datetime import date


class ScraperNameSpider(scrapy.Spider):
    name = "scraper_name"
    start_urls = [
        'https://www.jobs.bg/front_job_search.php?subm=1&categories%5B%5D=56&techs%5B%5D=Python',
    ]

    def parse(self, response, **kwargs):
        # Get today's date
        today = date.today()
        date_string = today.strftime("%Y-%m-%d")
        
        for job in response.xpath('//*[@id="listContainerScrollable"]'): 
            item = JobscraperItem() 
            item['position'] = job.xpath('//*[@id="listContainer"]/ul[1]/li[3]/div/div/div/div[2]/div/div[1]/a/div/span/text()').get()
            item['company_name'] = job.xpath('//*[@id="listContainer"]/ul[1]/li[3]/div/div/div/div[2]/div/div[2]/a/div/div[2]/div[1]/text()[normalize-space()]').get()
            item['location'] = job.xpath('//*[@id="listContainer"]/ul[1]/li[3]/div/div/div/div[2]/div/div[1]/a/div[2]/text()[1]').get().strip()
            item['number_of_employees'] = job.xpath('//*[@id="listContainer"]/ul[1]/li[4]/div/div/div/div[2]/div/div[2]/a/div/div[2]/div[2]/div/span[1]/text()').get().strip()
            item['todays_date'] = date_string

            item['date_posted'] = job.xpath('//*[@id="listContainer"]/ul[1]/li[3]/div/div/div/div[1]/div[1]/text()').get()
            if 'днес' in item['date_posted']:
                todays_date = datetime.now().strftime("%Y-%m-%d")
            else:
                todays_date = date_posted
            item['date_posted'] = todays_date

            item['job_url'] = job.xpath('//*[@id="listContainer"]/ul[1]/li[3]/div/div/div/div[2]/div/div[1]/a').get()
            match = re.search(r'href=[\'"]?([^\'" >]+)', item['job_url'])
            if match:
                item['job_url'] = match.group(1)
            else:
                item['job_url'] = ''

            item['work_from_home'] = job.xpath('//*[@id="listContainer"]/ul[1]/li[3]/div/div/div/div[2]/div/div[1]/a/div[2]/span[1]/text()').get()

            # extract skills list
            skills_list = job.xpath('//*[@id="listContainer"]/ul[1]/li[4]/div/div/div/div[2]/div/div[1]/a/div[3]/div/div[@class="skill"]')
            skills = []
            for skill in skills_list:
                img_tag = skill.xpath('./img')
                if img_tag:
                    alt_attr = img_tag[0].xpath('./@alt').get()
                    if alt_attr:
                        skills.append(alt_attr)
                else:
                    skill_name = skill.xpath('./div[@class="skill-not-img"]/text()').get()
                    if skill_name:
                        skills.append(skill_name.strip())

            item['skills_list'] = skills

            yield item

I tried different combinations of Xpaths with //ul and //li inside the bigger but nothing worked. I expect to get every job posting that has python as the tech stack in it. Something interesting is I tried the Xpath for job posting # 14 or 20 and it still returned the first one.

Solution

The reason you are only getting one result is because your for loop is looping over a list of only one element. That plus the fact that all of your xpaths are absolute so they each only target a single element.

The solution would be to loop over the individual job rows instead of the whole section, that way your loop contains just as many iteration as there are jobs. This also helps with using relative xpaths.

The page is not in my language so I wasn't able to translate all of it but I did what I could to get you started, for the items I wasn't able to translate I commented them out.

import scrapy
import re
from datetime import datetime
from datetime import date


class ScraperNameSpider(scrapy.Spider):
    name = "scraper_name"
    start_urls = [
        'https://www.jobs.bg/front_job_search.php?subm=1&categories%5B%5D=56&techs%5B%5D=Python',
    ]

    def parse(self, response, **kwargs):
        # Get today's date
        today = date.today()
        date_string = today.strftime("%Y-%m-%d")
        for page in response.xpath('//ul[contains(@class, "page")]'):
            for job in page.xpath('.//div[@class="mdc-card"]'):
                item = {}
                item['position'] = job.xpath('.//div[contains(@class, "card-title")]/span/text()').getall()
                item['company_name'] = job.xpath('.//div[@class="right"]/a/@title').get()
                # item['location'] = job.xpath('//*[@id="listContainer"]/ul[1]/li[3]/div/div/div/div[2]/div/div[1]/a/div[2]/text()[1]').get().strip()
                # item['number_of_employees'] = job.xpath('//*[@id="listContainer"]/ul[1]/li[4]/div/div/div/div[2]/div/div[2]/a/div/div[2]/div[2]/div/span[1]/text()').get().strip()
                item['todays_date'] = date_string

                item['date_posted'] = job.css('div.card-date::text').get().strip()
                if 'днес' in item['date_posted']:
                    todays_date = datetime.now().strftime("%Y-%m-%d")
                else:
                    todays_date = item["date_posted"]
                item['date_posted'] = todays_date

                item['job_url'] = job.css('div.left a').attrib['href']
                # item['work_from_home'] = job.xpath('//*[@id="listContainer"]/ul[1]/li[3]/div/div/div/div[2]/div/div[1]/a/div[2]/span[1]/text()').get()

                # extract skills list
                # skills_list = job.xpath('//*[@id="listContainer"]/ul[1]/li[4]/div/div/div/div[2]/div/div[1]/a/div[3]/div/div[@class="skill"]')
                # skills = []
                # for skill in skills_list:
                #     img_tag = skill.xpath('./img')
                #     if img_tag:
                #         alt_attr = img_tag[0].xpath('./@alt').get()
                #         if alt_attr:
                #             skills.append(alt_attr)
                #     else:
                #         skill_name = skill.xpath('./div[@class="skill-not-img"]/text()').get()
                #         if skill_name:
                #             skills.append(skill_name.strip())

                # item['skills_list'] = skills

                yield item

output:

{'position': ['Senior CAD (EDA) Engineer'], 'company_name': 'MELEXIS BULGARIA', 'todays_date': '2023-03-10', 'date_posted': 'вчера', 'job_url': 'https://www.jobs.bg/job/6748683'}
2023-03-10 20:32:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobs.bg/front_job_search.php?subm=1&categories%5B%5D=56&techs%5B%5D=Python>
{'position': ['IT Security Analyst'], 'company_name': 'GfK Bulgaria, Market Research Institute', 'todays_date': '2023-03-10', 'date_posted': 'вчера', 'job_url': 'https://www.jobs.bg/job/6748657'}
2023-03-10 20:32:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobs.bg/front_job_search.php?subm=1&categories%5B%5D=56&techs%5B%5D=Python>
{'position': ['Data Engineer'], 'company_name': 'KBC Global Services BG', 'todays_date': '2023-03-10', 'date_posted': 'вчера', 'job_url': 'https://www.jobs.bg/job/6748374'}
2023-03-10 20:32:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobs.bg/front_job_search.php?subm=1&categories%5B%5D=56&techs%5B%5D=Python>
{'position': ['Data Engineer'], 'company_name': 'KBC Global Services BG', 'todays_date': '2023-03-10', 'date_posted': 'вчера', 'job_url': 'https://www.jobs.bg/job/6748350'}
2023-03-10 20:32:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobs.bg/front_job_search.php?subm=1&categories%5B%5D=56&techs%5B%5D=Python>
{'position': ['Rookie to Data Engineer Rockstar'], 'company_name': 'MentorMate Bulgaria Ltd.', 'todays_date': '2023-03-10', 'date_posted': 'вчера', 'job_url': 'https://www.jobs.bg/job/6747734'}
2023-03-10 20:32:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobs.bg/front_job_search.php?subm=1&categories%5B%5D=56&techs%5B%5D=Python>
{'position': ['Rookie to Data Engineer Rockstar'], 'company_name': 'MentorMate Bulgaria Ltd.', 'todays_date': '2023-03-10', 'date_posted': '09.03.23', 'job_url': 'https://www.jobs.bg/job/6743270'}
2023-03-10 20:32:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobs.bg/front_job_search.php?subm=1&categories%5B%5D=56&techs%5B%5D=Python>
{'position': ['Business Intelligence Analyst'], 'company_name': 'Хемисфиър Комърс ЕООД', 'todays_date': '2023-03-10', 'date_posted': '09.03.23', 'job_url': 'https://www.jobs.bg/job/6747414'}
2023-03-10 20:32:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobs.bg/front_job_search.php?subm=1&categories%5B%5D=56&techs%5B%5D=Python>
{'position': ['Rookie to Software Engineer Rockstar'], 'company_name': 'MentorMate Bulgaria Ltd.', 'todays_date': '2023-03-10', 'date_posted': '09.03.23', 'job_url': 'https://www.jobs.bg/job/674343
0'}
2023-03-10 20:32:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobs.bg/front_job_search.php?subm=1&categories%5B%5D=56&techs%5B%5D=Python>
{'position': ['KUBERNETES OPERATIONS ENGINEER WITH PYTHON'], 'company_name': 'Schwarz Global Services Bulgaria EOOD', 'todays_date': '2023-03-10', 'date_posted': '09.03.23', 'job_url': 'https://www
.jobs.bg/job/6746537'}
2023-03-10 20:32:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobs.bg/front_job_search.php?subm=1&categories%5B%5D=56&techs%5B%5D=Python>
{'position': ['Automation QA Engineer'], 'company_name': 'Hilscher Development and Test Center Ltd', 'todays_date': '2023-03-10', 'date_posted': '09.03.23', 'job_url': 'https://www.jobs.bg/job/6718
564'}
2023-03-10 20:32:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobs.bg/front_job_search.php?subm=1&categories%5B%5D=56&techs%5B%5D=Python>
{'position': ['Electronic System Test Engineer'], 'company_name': 'ЛИБХЕР - ХАУСГЕРЕТЕ МАРИЦА ЕООД', 'todays_date': '2023-03-10', 'date_posted': '08.03.23', 'job_url': 'https://www.jobs.bg/job/6745
212'}
2023-03-10 20:32:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobs.bg/front_job_search.php?subm=1&categories%5B%5D=56&techs%5B%5D=Python>
{'position': ['Lead Technical Artist - New Project - CA Sofia'], 'company_name': 'Creative Assembly Sofia / SEGA Black Sea Ltd.', 'todays_date': '2023-03-10', 'date_posted': '08.03.23', 'job_url':
'https://www.jobs.bg/job/6744695'}
2023-03-10 20:32:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobs.bg/front_job_search.php?subm=1&categories%5B%5D=56&techs%5B%5D=Python>
{'position': ['Technical Animator/ Rigger - New Project - CA Sofia'], 'company_name': 'Creative Assembly Sofia / SEGA Black Sea Ltd.', 'todays_date': '2023-03-10', 'date_posted': '08.03.23', 'job_u
rl': 'https://www.jobs.bg/job/6744698'}
2023-03-10 20:32:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobs.bg/front_job_search.php?subm=1&categories%5B%5D=56&techs%5B%5D=Python>
{'position': ['Data Engineer'], 'company_name': 'GfK Bulgaria, Market Research Institute', 'todays_date': '2023-03-10', 'date_posted': '08.03.23', 'job_url': 'https://www.jobs.bg/job/6744021'}
2023-03-10 20:32:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobs.bg/front_job_search.php?subm=1&categories%5B%5D=56&techs%5B%5D=Python>
{'position': ['Junior Data Engineer'], 'company_name': 'UNIQA SOFTWARE - SERVICE BULGARIA Ltd.', 'todays_date': '2023-03-10', 'date_posted': '08.03.23', 'job_url': 'https://www.jobs.bg/job/6743899'
}
2023-03-10 20:32:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobs.bg/front_job_search.php?subm=1&categories%5B%5D=56&techs%5B%5D=Python>
{'position': ['Data Engineer'], 'company_name': 'UNIQA SOFTWARE - SERVICE BULGARIA Ltd.', 'todays_date': '2023-03-10', 'date_posted': '08.03.23', 'job_url': 'https://www.jobs.bg/job/6743904'}
2023-03-10 20:32:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobs.bg/front_job_search.php?subm=1&categories%5B%5D=56&techs%5B%5D=Python>
{'position': ['SEGA QA Data Analyst'], 'company_name': 'Creative Assembly Sofia / SEGA Black Sea Ltd.', 'todays_date': '2023-03-10', 'date_posted': '08.03.23', 'job_url': 'https://www.jobs.bg/job/6
743707'}
2023-03-10 20:32:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobs.bg/front_job_search.php?subm=1&categories%5B%5D=56&techs%5B%5D=Python>
{'position': ['Network Administrator'], 'company_name': 'Paysafe Bulgaria EOOD', 'todays_date': '2023-03-10', 'date_posted': '08.03.23', 'job_url': 'https://www.jobs.bg/job/6743353'}
2023-03-10 20:32:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobs.bg/front_job_search.php?subm=1&categories%5B%5D=56&techs%5B%5D=Python>
{'position': ['Senior Web Administrator'], 'company_name': 'АЙГЕЙМИНГ.КОМ ЕООД', 'todays_date': '2023-03-10', 'date_posted': '07.03.23', 'job_url': 'https://www.jobs.bg/job/6743120'}
2023-03-10 20:32:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobs.bg/front_job_search.php?subm=1&categories%5B%5D=56&techs%5B%5D=Python>
{'position': ['Инженер роботика'], 'company_name': 'КООПЕРАЦИЯ ПАНДА/Office 1', 'todays_date': '2023-03-10', 'date_posted': '07.03.23', 'job_url': 'https://www.jobs.bg/job/6742980'}
2023-03-10 20:32:03 [scrapy.core.engine] INFO: Closing spider (finished)

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 29, 2023

[FIXED] How to cycle through all job postings using scrapy?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels