Friday, December 31, 2021

[FIXED] Indeed Webscrape (Selenium): Script only returning one page of data frame into CSV/Long Run Time

December 31, 2021 dataframe, for-loop, python, selenium, web-scraping No comments

Issue

I am currently learning Python in order to webscrape and am running into an issue with my current script. After closing the pop-up on Page 2 of Indeed and cycling through the pages, the script only returns one page into the data frame to CSV. However, it does print out each page in my terminal area. It also on occasion only returns part of the data from a page. EX page 2 will return info for the first 3 jobs as part of my print(df_da), but nothing for the next 12. Additionally, it seems to take a very long time to run the script (averaging around 6 minutes and 45 seconds for the 5 pages, around 1 minute to 1.5 minutes per page). Any suggestions? I've attached my script and can also attach the return I get from my Print(df_da) if needed below. Thank you in advance!

import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

options = Options()
options.add_argument("window-size=1400,1400")

PATH = "C://Program Files (x86)//chromedriver.exe"
driver = webdriver.Chrome(PATH)

for i in range(0,50,10):
    driver.get('https://www.indeed.com/jobs?q=chemical%20engineer&l=united%20states&start='+str(i))
    driver.implicitly_wait(5)

    jobtitles = []
    companies = []
    locations = []
    descriptions = []



    jobs = driver.find_elements_by_class_name("slider_container")

    for job in jobs:

        jobtitle = job.find_element_by_class_name('jobTitle').text.replace("new", "").strip()
        jobtitles.append(jobtitle)
        company = job.find_element_by_class_name('companyName').text.replace("new", "").strip()
        companies.append(company)
        location = job.find_element_by_class_name('companyLocation').text.replace("new", "").strip()
        locations.append(location)
        description = job.find_element_by_class_name('job-snippet').text.replace("new", "").strip()
        descriptions.append(description)
        try:
            WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "button.popover-x-button-close.icl-CloseButton"))).click()
        except:
            pass



    df_da=pd.DataFrame()
    df_da['JobTitle']=jobtitles
    df_da['Company']=companies
    df_da['Location']=locations
    df_da['Description']=descriptions
    print(df_da)
    df_da.to_csv('C:/Users/Dan/Desktop/AZNext/file_name1.csv')

Solution

You are defining the df_da inside the outer for loop so that the df_da will contain the data from the last page only.
You should define it out of the loops and put the total data there only after all the data have been collected.
I guess you are getting not all the jobs on the second page because of the pop-up. So, you should close it before collecting the job details on that page.
Also, you can reduce waiting for the pop-up element from all the loop iterations and leave it for the second loop iteration only.
Your code can be something like this:

import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

options = Options()
options.add_argument("window-size=1400,1400")

PATH = "C://Program Files (x86)//chromedriver.exe"
driver = webdriver.Chrome(PATH)

jobtitles = []
companies = []
locations = []
descriptions = []

for i in range(0,50,10):
    driver.get('https://www.indeed.com/jobs?q=chemical%20engineer&l=united%20states&start='+str(i))
    driver.implicitly_wait(5)

    jobs = driver.find_elements_by_class_name("slider_container")

    for idx, job in enumerate(jobs):
        if(idx == 1):
            try:
                WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "button.popover-x-button-close.icl-CloseButton"))).click()
            except:
                pass

        jobtitle = job.find_element_by_class_name('jobTitle').text.replace("new", "").strip()
        jobtitles.append(jobtitle)
        company = job.find_element_by_class_name('companyName').text.replace("new", "").strip()
        companies.append(company)
        location = job.find_element_by_class_name('companyLocation').text.replace("new", "").strip()
        locations.append(location)
        description = job.find_element_by_class_name('job-snippet').text.replace("new", "").strip()
        descriptions.append(description)

df_da=pd.DataFrame()    
df_da['JobTitle']=jobtitles
df_da['Company']=companies
df_da['Location']=locations
df_da['Description']=descriptions
print(df_da)
df_da.to_csv('C:/Users/Dan/Desktop/AZNext/file_name1.csv')

Answered By - Prophet

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 31, 2021

[FIXED] Indeed Webscrape (Selenium): Script only returning one page of data frame into CSV/Long Run Time

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels