Issue
I am trying to make a code that scrapes information from a list of websites. My goal is to get all the data and save it in JSON file. The end should look like this :
[
{
"title": "Python developer",
"place": "Slovensko",
"salary": "od 1000 €",
"contract_type": "dohoda",
"contact_email": "[email protected]"
},
...
]
I made a code that gets all the links from a seed website and its working okay but i am stuck at data scraping. Here is the code i wrote:
from bs4 import BeautifulSoup
import requests
import re
zaciatok = "https://www.hyperia.sk/kariera"
def getHTMLdocument(zaciatok):
response = requests.get(zaciatok)
return response.text
vsetky_linky= []
html_document = getHTMLdocument(zaciatok)
soup = BeautifulSoup(html_document, "html.parser")
for link in soup.find_all("a", attrs={'href',"arrow-link", }):
vsetky_linky.append(link.get("href"))
vsetky_linky.pop()
urls = []
for x in vsetky_linky:
urls.append("https://www.hyperia.sk"+x)
daaata = []
for url in urls:
print(url)
req = requests.get(url)
req.encoding = "utf-8-sig"
polievka = BeautifulSoup(req.text, "html.parser")
nadpis = polievka.find("div", attrs={'class': 'hero-text col-lg-12'})
br = polievka.find("br")
for p in polievka.select("p:has(br)"):
daaata.append(
[
nadpis.get_text(strip=True) ,
br.get_text(strip=True) ,
]
)
print(daaata)
At the end I printed the scrapped data and I see it also pulled a text from under the header ( I need only the header "Python developer" not the text under it). Can you help me?
Solution
Try to select your elements more specific, in your case the <h1>
:
"title": polievka.h1.text,
Example how to use in your for-loop
feel free to adapt it to your final needs, my slovak is not that good, so I do not know what matters ;)
...
daaata = []
for url in urls:
print(url)
req = requests.get(url)
req.encoding = "utf-8-sig"
polievka = BeautifulSoup(req.text, "html.parser")
daaata.append({
"title": polievka.h1.text,
"place": polievka.select_one('img[alt="place"] + p br').next,
"salary": polievka.select_one('img[alt="wage"] + p br').next,
"contract_type": polievka.select_one('img[alt="work"] + p br').next,
"contact_email": polievka.select_one('a[href^="mailto"]').get('href').split(':')[-1]
})
daaata
Output
[{'title': 'Python developer - študent', 'place': 'Slovensko', 'salary': '6 € / hodina', 'contract_type': 'dohoda o brig. práci študenta', 'contact_email': '[email protected]'}, {'title': 'Senior PPC špecialista', 'place': 'Slovensko', 'salary': 'od 1 800,- €', 'contract_type': 'TPP, živnosť', 'contact_email': '[email protected]'}, {'title': 'Product owner', 'place': 'Slovensko', 'salary': 'od 2 000 ,- €', 'contract_type': 'TPP, živnosť', 'contact_email': '[email protected]'}, {'title': 'Lead Frontend developer', 'place': 'Slovensko', 'salary': '2 000 - 4 000 ,- €', 'contract_type': 'TPP, živnosť', 'contact_email': '[email protected]'}, {'title': 'Frontend developer (medior/senior)', 'place': 'Slovensko', 'salary': '2 000 - 4 000 ,- €', 'contract_type': 'TPP, živnosť', 'contact_email': '[email protected]'}, {'title': 'Kimbino senior PHP developer', 'place': 'Slovensko', 'salary': 'od 2 000 ,- €', 'contract_type': 'TPP, živnosť', 'contact_email': '[email protected]'}]
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.