Issue
I was creating a small script to crawl website of mine. This script will crawl to the website and will check if the page is broken or not based on their status code. Also, the crawler also will check if the URL have a certain word on it. The thing is, the website have this "tel:" thingies that indicate a phone number, and also "mailto:".
I've tried to add a couple if conditions to handle this sort of things.
But I'm always stuck with the error:
InvalidSchema: No connection adapters were found for 'tel:0412345678
Here is the snipped of my code:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
def check_url(url):
if url.startswith(('http://', 'https://')):
response = requests.get(url)
return response.status_code
else:
return None # Skip unsupported schemes
def extract_url(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
href = link.get('href')
if href:
parsed_url = urlparse(url)
if parsed_url.scheme != '' and parsed_url.scheme != 'tel:':
absolute_url = urljoin(url, href)
urls.append(absolute_url)
return urls
def crawl(url, word):
visited = set()
dead_pages = set()
def crawl_helper(url, word): # Add the 'word' parameter
if url in visited:
return
visited.add(url)
status_code = check_url(url)
if status_code == 404:
dead_pages.add(url)
if word in url:
print(f'URL "{url}" contains "{word}".')
urls = extract_url(url)
for u in urls:
crawl_helper(u, word) # Pass the 'word' parameter
crawl_helper(url, word) # Pass the 'word' parameter
return dead_pages
starting_url = 'https://google.com' # this is just a placeholder
word_to_search = 'test' # this is also a placeholder
dead_pages = crawl(starting_url, word_to_search)
for page in dead_pages:
print(f'Dead page: {page}')```
I'm still new to BeautifulSoup but eager to learn.
Solution
You need to check for valid url before using requests.get
at the beginning of extract_url
- maybe redefine it to something like:
def extract_url(url):
## check BEFORE request
parsed_url = urlparse(url)
if parsed_url.scheme not in ['http', 'https']:
return []
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
urls = []
## add href=True to find_all --> no need to check if href
for link in soup.find_all('a', href=True):
urls.append(urljoin(url, link.get('href') ))
If you want to also parse and check href
:
for link in soup.find_all('a', href=True):
parsed_href = urlparse(href := link.get('href'))
if parsed_href.scheme in ['', 'http', 'https']:
urls.append(urljoin(url, href))
On another note:
- It's a bit dangerous to crawl recursively, as you are with
crawl_helper
, since you risk running intoRecursionError
. - A queue-based crawler might be better - where you loop through a list but add new URLs to that list within the loop. Even then, you risk getting stuck in an infinite loop unless you set some kind of limit on the size of the queue or on the size of
visited
. Something likedef crawl(url, word, max_visits=1000): visited = set() dead_pages = set() url_queue = [url] for url in url_queue: # loop instead of recursion if url in visited: return visited.add(url) status_code = check_url(url) if status_code == 404: dead_pages.add(url) if word in url: print(f'URL "{url}" contains "{word}".') urls = extract_url(url) ## add to queue instead of recursive call url_queue += [u for u in urls if u not in url_queue] if not len(visited) < max_visits: break ## stop when limit is reached return dead_pages
Answered By - Driftr95
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.