Sunday, April 10, 2022

[FIXED] how to extract url from a string and save to a list

April 10, 2022 beautifulsoup, python, web-scraping No comments

Issue

i am having trouble saving down urls from a string.

i have tried something like this

url = "https://in.indeed.com/jobs?q=software%20engineer%20&l=Kerala"
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
Links1 = soup.find_all("div",{"class:","pagination"})

url = [Links1.find(('a')['href'] for tag in Links1)]
WEbsite=f'https://in.indeed.com{url[0]}'

but its not returning full url list. I need url to navigate to next page .

Solution

Are you just after the "next page" or do you want all the links?

so do you want just:

/jobs?q=software+engineer+&l=Kerala&start=10

or are you after all of these?

/jobs?q=software+engineer+&l=Kerala&start=10
/jobs?q=software+engineer+&l=Kerala&start=20
/jobs?q=software+engineer+&l=Kerala&start=30
/jobs?q=software+engineer+&l=Kerala&start=40
/jobs?q=software+engineer+&l=Kerala&start=10

Few issues:

Links1 is a list of elements. And you are then using .find('a') on a list, which won't work.
Since you want href attributes, consider using the find('a',href=True)

So here's how I would go about it:

import requests
from bs4 import BeautifulSoup

url = "https://in.indeed.com/jobs?q=software%20engineer%20&l=Kerala"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
Links1 = soup.find_all("div",{"class":"pagination"})

url = [tag.find('a',href=True)['href'] for tag in Links1]
website=f'https://in.indeed.com{url[0]}'

Output:

print(website)
https://in.indeed.com/jobs?q=software+engineer+&l=Kerala&start=10

To get all those links:

import requests
from bs4 import BeautifulSoup

url = "https://in.indeed.com/jobs?q=software%20engineer%20&l=Kerala"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
Links1 = soup.find("div",{"class":"pagination"})

urls = [tag['href'] for tag in Links1.find_all('a',href=True)]
website=f'https://in.indeed.com{url[0]}'

Answered By - chitown88

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, April 10, 2022

[FIXED] how to extract url from a string and save to a list

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels