Issue
I'm trying to scrape data from this page with BeautifulSoup: https://www.ycombinator.com/jobs/role/software-engineer
This is the script I am using:
from bs4 import BeautifulSoup
import requests
ycombinator_website_all = requests.get("https://www.ycombinator.com/jobs/role/software-engineer")
ycomb_website = ycombinator_website_all.text
soup = BeautifulSoup(ycomb_website, "html.parser")
print(soup)
I expected I'd be able to see the entire website's contents on running this code and then individually get all anchor tags to get the titles of jobs, but all I get is a few html tags and some CSS. I'm wondering how I can actually get the entire website's contents. When I right-click on the job titles and Inspect, I am able to see tags, but not when I use Beautiful Soup
<!DOCTYPE html>
<html lang="en">
<head>
<title>Y Combinator | File Not Found</title>
<meta charset="utf-8">
<meta content="IE=edge;chrome=1" http-equiv="X-UA-Compatible">
<meta content="initial-scale=1, width=device-width, height=device-height" name="viewport">
<link href="/favicon.ico" rel="shortcut icon" type="image/x-icon">
<!--[if lt IE 9]><script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script><![endif]-->
<style>
* {
margin: 0;
padding: 0;
}
html, body {
height: 100%;
background: #FDFDF8;
color: #de662c;
font-family: 'Avenir', sans-serif;
display: flex;
flex-direction: column;
align-items: center;
justify-content: center;
}
h1 {
font-size: 80px;
line-height: 80px;
}
a {
color: #268bd2;
font-weight: 500;
}
p {
position: absolute;
bottom: 20px;
background: inherit;
}
</style>
</head>
<body class="nojs">
<h1>404</h1>
<h2>File Not Found</h2>
<a href="/" id="try">Back to the homepage</a>
<p>For support please contact <a href='mailto:[email protected]'>[email protected]</a></p>
</body>
</html>
Solution
To get the correct response from the server set Accept
HTTP header:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.ycombinator.com/jobs/role/software-engineer"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:120.0) Gecko/20100101 Firefox/120.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
all_data = []
for job in soup.select('li:has(a:-soup-contains("Apply"))'):
row = job.get_text(strip=True, separator="|||").split("|||")
all_data.append(row)
df = pd.DataFrame(all_data, columns=None)
df = df.drop(columns=[2, 11])
df["time"] = df.pop(4) + " " + df.pop(5) + " " + df.pop(6)
print(df.head(10).to_markdown(index=False))
Prints:
0 | 1 | 3 | 7 | 8 | 9 | 10 | time |
---|---|---|---|---|---|---|---|
Prelim | (S17) | Software for banks to open bank accounts | Software Engineer | Full-time | US / San Francisco, CA, US / New York, NY, US / Remote (US; San Francisco, CA, US; New York, NY, US) | Full Stack | (about 1 minute ago) |
Terra | (W21) | API for dynamic health data from wearables and sensors | Software Engineer | Full-time | London, UK | Full Stack | (about about 1 hour ago) |
Roboflow | (S20) | 🖼️ Give your software the sense of sight. | Full Stack Machine Learning Engineer | Full-time | Remote / New York, NY, US / San Francisco, CA, US / Des Moines, IA, US / Remote (US) | Machine Learning | (about 15 minutes ago) |
Fair Square | (W20) | Bringing delight, health, and financial security to people over 65 | Product Engineer @ Fair Square Medicare | Full-time | New York, NY, US / Remote (US) | Full Stack | (about 20 minutes ago) |
Invert | (W22) | Data analytics software for biomanufacturing. | Senior Software Engineer | Full-time | US / Remote (US) | Full Stack | (about about 2 hours ago) |
Optery | (W22) | Opt out software that removes your private info from the internet | Full Stack Developer - NodeJS | Contract | MX / DO / BO / CR / CL / AR / EC / PE / VE / IL / UA / Remote (MX; DO; BO; CR; CL; AR; EC; PE; VE; IL; UA) | Full Stack | (about 16 minutes ago) |
Govdash | (W22) | AI proposal and capture solution for GovCon. | Software Engineer, Full Stack | Full-time | San Francisco, CA, US | Backend | (about 17 minutes ago) |
Moonvalley | (W21) | Generate cinematic videos with AI | Head of AI | Full-time | Remote | Machine Learning | (about about 3 hours ago) |
Forage | (S21) | Payments infrastructure for government benefits | Principal Backend Engineer | Full-time | US / MX / CA / Remote (US; MX; CA) | Full Stack | (about 2 minutes ago) |
Kalshi | (W19) | 1st federally regulated exchange where people can trade on events | Data at Kalshi | Full-time | New York, NY, US / Remote | Data Science | (about about 2 hours ago) |
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.