Issue
I'm practicing scraping with BeautifulSoup on a job page but my print is returning "None" for some odd reason, any ideas? Thanks in advance!
from bs4 import BeautifulSoup
import requests
import csv
url = 'https://jobgether.com/es/oferta/63083ece6d137a0ac6e701e6-part-time-business-psychologist-intern'
website = requests.get(url)
Soup = BeautifulSoup(website.content, 'html.parser')
Title = Soup.find('h5', class_="mb-0 p-2 w-100 bd-highlight fs-22")
print(Title)
Solution
That page is being hydrated with data via a javascript API: you can find that API by inspecting Dev tools - network tab, and you can see the information is being pulled as JSON from that API endpoint. This is one way to obtain thaat data, using requests:
import requests
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://filter-api.jobgether.com/api/offer/63083ece6d137a0ac6e701e6?%24populate%5B0%5D%5Bpath%5D=meta.continents&%24populate%5B0%5D%5Bselect%5D=name&%24populate%5B1%5D=meta.countries&%24populate%5B2%5D=meta.regions&%24populate%5B3%5D=meta.cities&%24populate%5B4%5D=meta.studiesArea&%24populate%5B5%5D=meta.salary&%24populate%5B6%5D=meta.languages&%24populate%5B7%5D=meta.hardSkills&%24populate%5B8%5D=meta.industries&%24populate%5B9%5D=meta.technologies&%24populate%5B10%5D%5Bpath%5D=company&%24populate%5B10%5D%5Bselect%5D=name%20meta.logo%20meta.industries%20meta.companyType%20meta.flexiblePolicy%20meta.employees%20meta.mainOfficeLocation%20meta.subOfficeLocation%20status%20description%20meta.mission%20meta.description%20meta.hardSkills%20meta.technologies%20meta.slug&%24populate%5B10%5D%5Bpopulate%5D%5B0%5D=meta.industries&%24populate%5B10%5D%5Bpopulate%5D%5B1%5D=meta.mainOfficeLocation&%24populate%5B10%5D%5Bpopulate%5D%5B2%5D=meta.subOfficeLocation'
r = requests.get(url, headers=headers)
obj = r.json()
print(obj['title'])
print(obj['meta']['apply_url'])
print(obj['meta']['countries'])
df = pd.json_normalize(obj['meta']['hardSkills'])
print(df)
This will display in terminal:
Part-Time Business Psychologist Intern
https://it.linkedin.com/jobs/view/externalApply/3221880417?url=https%3A%2F%2Fteamtailor%2Eassessfirst%2Ecom%2Fjobs%2F1462616-uk-part-time-business-psychologist-student-intern%3Fpromotion%3D464724-trackable-share-link-uk-business-psychologist-li&urlHash=dzk3&trk=public_jobs_apply-link-offsite
[{'_id': '622a65b4671f2c8b98fac83f', 'name': 'United Kingdom', 'alpha_code': 'GBR', 'continent': '622a659af0bac38678ed1398', 'geo': [-0.127758, 51.507351], 'name_es': 'Reino Unido', 'name_fr': 'Royaume-Uni', 'deleted_at': None, 'amount_of_use': 11407, 'alpha_2_code': 'GB'}]
_id id name name_es name_fr category_id status createdAt updatedAt deletedAt hard_skill_categories hard_skill_category
0 623ca7112198fdff24e1a1b0 5 Design Design Design 1 1 0000-00-00 00:00:00 0000-00-00 00:00:00 None Marketing 621d2a97058dc9445a92c4be
1 623ca7112198fdff24e1a249 173 Research Investigación Recherche 8 1 0000-00-00 00:00:00 0000-00-00 00:00:00 None Business 621d2a97058dc9445a92c4c5
2 623ca7112198fdff24e1a24a 174 Science Ciencia Science 8 1 0000-00-00 00:00:00 0000-00-00 00:00:00 None Business 621d2a97058dc9445a92c4c5
3 623ca7112198fdff24e1a292 1165 Customer Success Customer Success Customer Success 4 1 2021-07-07 10:53:19 2021-07-07 10:53:19 None Sales 621d2a97058dc9445a92c4c1
You can print out the full json response, inspect it, dissect it and extract the relevant information from it (it's quite comprehensive). Relevant documentation for requests:
https://requests.readthedocs.io/en/latest/
And also, pandas documentation:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html
Answered By - Barry the Platipus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.