Wednesday, November 15, 2023

[FIXED] How to handle the "tel:" and "mailto:" parameters using BeautifulSoup4?

November 15, 2023 beautifulsoup, python, python-3.x, web-crawler No comments

Issue

I was creating a small script to crawl website of mine. This script will crawl to the website and will check if the page is broken or not based on their status code. Also, the crawler also will check if the URL have a certain word on it. The thing is, the website have this "tel:" thingies that indicate a phone number, and also "mailto:".

I've tried to add a couple if conditions to handle this sort of things.

But I'm always stuck with the error: InvalidSchema: No connection adapters were found for 'tel:0412345678

Here is the snipped of my code:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

def check_url(url):
    if url.startswith(('http://', 'https://')):
        response = requests.get(url)
        return response.status_code
    else:
        return None  # Skip unsupported schemes

def extract_url(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    urls = []

    for link in soup.find_all('a'):
        href = link.get('href')
        if href:
            parsed_url = urlparse(url)
            if parsed_url.scheme != '' and parsed_url.scheme != 'tel:':
                absolute_url = urljoin(url, href)
                urls.append(absolute_url)

    return urls

def crawl(url, word):
    visited = set()
    dead_pages = set()

    def crawl_helper(url, word):  # Add the 'word' parameter
        if url in visited:
            return
        visited.add(url)

        status_code = check_url(url)
        if status_code == 404:
            dead_pages.add(url)

        if word in url:
            print(f'URL "{url}" contains "{word}".')

        urls = extract_url(url)
        for u in urls:
            crawl_helper(u, word)  # Pass the 'word' parameter

    crawl_helper(url, word)  # Pass the 'word' parameter

    return dead_pages

starting_url = 'https://google.com' # this is just a placeholder
word_to_search = 'test' # this is also a placeholder
dead_pages = crawl(starting_url, word_to_search)

for page in dead_pages:
    print(f'Dead page: {page}')```

I'm still new to BeautifulSoup but eager to learn.

Solution

You need to check for valid url before using requests.get at the beginning of extract_url - maybe redefine it to something like:

def extract_url(url):
    ## check BEFORE request
    parsed_url = urlparse(url)
    if parsed_url.scheme not in ['http', 'https']:
        return []

    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    urls = []

    ## add href=True to find_all --> no need to check if href
    for link in soup.find_all('a', href=True):
        urls.append(urljoin(url, link.get('href') ))

If you want to also parse and check href:

    for link in soup.find_all('a', href=True):
        parsed_href = urlparse(href := link.get('href'))
        if parsed_href.scheme in ['', 'http', 'https']:
            urls.append(urljoin(url, href))

On another note:

It's a bit dangerous to crawl recursively, as you are with crawl_helper, since you risk running into RecursionError.

A queue-based crawler might be better - where you loop through a list but add new URLs to that list within the loop. Even then, you risk getting stuck in an infinite loop unless you set some kind of limit on the size of the queue or on the size of visited. Something like

def crawl(url, word, max_visits=1000):
    visited = set()
    dead_pages = set()
    url_queue = [url]

    for url in url_queue:  # loop instead of recursion
        if url in visited:
            return
        visited.add(url)

        status_code = check_url(url)
        if status_code == 404:
            dead_pages.add(url)

        if word in url:
            print(f'URL "{url}" contains "{word}".')

        urls = extract_url(url)

        ## add to queue instead of recursive call
        url_queue += [u for u in urls if u not in url_queue] 

        if not len(visited) < max_visits:
            break ## stop when limit is reached


    return dead_pages

Answered By - Driftr95

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, November 15, 2023

[FIXED] How to handle the "tel:" and "mailto:" parameters using BeautifulSoup4?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels