Issue
I have to scrape this data
- The name of the company that is hiring
- The location of the company
- The position that the ad is for
This is the website that I want to scrape from link. I was able to get td data but I need to start from a specific td tag (i.e start from this tr tag)
<tr style="height:14px"></tr>
<tr class='athing' id='20463814'>
<td align="right" valign="top" class="title"><span class="rank"></span></td> <td></td><td class="title"><a href="https://mino-games.workable.com/j/69BCF95C8F" class="storylink" rel="nofollow">Mino Games (YC W11) Is Hiring Game Developers in Montreal</a><span class="sitebit comhead"> (<a href="from?site=workable.com"><span class="sitestr">workable.com</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">
<span class="age"><a href="item?id=20463814">11 hours ago</a></span> </td></tr>
and then keep on moving towards other tags and at the same time keep getting the data of company name, location and position in a separate variable. I know it's a lot to ask for but I would appreciate any help that you can provide.
this is what I tried:
import requests
from bs4 import BeautifulSoup
url = 'https://news.ycombinator.com/jobs'
plain_html_text = requests.get(url);
soup = BeautifulSoup(plain_html_text.text, "html.parser")
table_body = soup.find('tbody')
rows = soup.find('tr')
for row in rows:
cols = row.find_all('td')
cols = [x.text.strip() for x in cols]
print (cols)
Solution
What you want is not easy problem, but this script could get you started:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://news.ycombinator.com/jobs'
plain_html_text = requests.get(url);
soup = BeautifulSoup(plain_html_text.text, "html.parser")
rows = []
for title in soup.select('.title:not(:has(.morelink)) .storylink'):
t = title.get_text(strip=True)
company = re.findall(r'^(.*?)(?:is hiring|is looking|seeking|hiring)', t, flags=re.I)
if company:
company = company[0].strip()
else:
company = '-'
position = re.findall(r'(?:is hiring|is looking|seeking|hiring)(.*?)(?=\bin\b|$)', t, flags=re.I)
if position:
position = position[0].strip()
else:
position = '-'
location = re.findall(r'(?:\bin\b)(.*)', t, flags=re.I)
if location:
location = location[0].strip()
else:
location = '-'
rows.append([company, position, location])
print('{: ^50}{: ^80}{: ^20}'.format('Company', 'Position', 'Location'))
for row in rows:
c, p, l = row
print('{: <50}{: <80}{: <20}'.format(c, p, l))
Prints:
Company Position Location
Scale AI engineers to accelerate the development of AI -
Mino Games (YC W11) Game Developers Montreal
BuildZoom (YC W13) – Help us un-break construction -
Bitmovin (YC S15) a Video Solutions Architect/Software Engineer Brazil
Streak – CRM for Gmail (YC S11) Vancouver
ZeroCater (YC W11) a Director of Engineer SF
UpCodes (YC S17) engineers to automate compliance for architects -
Tech Nonprofit Upsolve (YC W19) a Software Engineer -
Gitlab (YC W15) an Engineering Manager, Ecosystem -
Saleswhale (YC S16) Our First U.S. Strategic Account Executive -
Jerry (YC S17) for a Director of Ops and Growth -
Sourceress (YC S17) Product and ML Engineers (Remote OK, No Prior ML OK) -
GiveCampus (YC S15) a Product Designer who cares about education -
Iris Automation an Account Executive for B2B Flying Vehicle Software -
LogDNA (YC W15) Software Engineers – DevOps Monitoring at Scale -
Flexport software engineers to work on our trucking apps Chicago
Mux an ML engineer to help train our machines to deliver better video -
The Muse (YC W12) a Product Director for Growth -
OneSignal an SRE to scale our bare-metal infrastructure -
Atomwise (YC W15) a Senior Systems/Cloud Engineer -
Demodesk (YC W19) Software Engineers Munich
Gusto for Android and iOS developers to build our native mobile app -
Fond (YC W12) an Engineering Manager Portland
ReadMe (YC W15) – Help us make APIs easy to use -
Keeper (YC W19) a lead engineer – help save gig workers money on taxes -
Asseta (YC S13) a technical lead -
Tesorio (YC S15) Engineering Managers, Senior Engineers -
Standard Cognition (YC S17) – Work on vision systems Rust
Curebase (YC S18) first sales hire – distributed clinical research -
Mashgin (YC W15) a Fullstack SWE Interested Computer Vision/AI
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.