Monday, June 20, 2022

[FIXED] Emails Scraping using BeautifulSoup from a list of URLs

June 20, 2022 beautifulsoup, export-to-csv No comments

Issue

Below is a simple script using BS to scrape emails from a single website, how do I modify the script if I have a list of URLs saved in excel and saved the results into csv file?

I am thinking if i should read the list of URL using pandas so it will be converted to pd dataframe?

from bs4 import BeautifulSoup
import re
import csv
from urllib.request import urlopen

f = urlopen('http://www.nus.edu.sg/contact')

s = BeautifulSoup(f, 'html.parser')
s = s.get_text()

phone = re.findall(r"((?:\d{3}|\(\d{3}\))?(?:\s|-|\.)?\d{3}(?:\s|-|\.)\d{4})",s)
emails = re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,3}",s)

if len(phone) == 0:
    print ("Sorry, no phone number found.")

    print('------------')
    print ()
else :
    count = 1
    for item in phone:
        print ( count, ' phone number(s) found : ',item )
        count += 1

print('------------')
print()

if len(emails) == 0:
    print("Sorry, no email address found.")
    print('------------')
    print()
else:
    count = 1
    for item in emails:
        print(count, ' email address(es) found : ', item)
        count += 1

Solution

findAll / find_all can search for text based on a regex pattern.

You can use the re.compile(email-pattern) and then pass that to findAll.

findAll(text=email_pattern)

The email pattern used is as per RFC 5322.

Email scraping from a single url

from bs4 import BeautifulSoup, Comment
import re
from urllib.request import urlopen
email_pattern = re.compile(r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])""")
f = urlopen('http://www.nus.edu.sg/contact').read()
soup = BeautifulSoup(f, 'html5lib')
emails = [x for x in soup.findAll(text=email_pattern) if not isinstance(x, Comment)]
print(emails)

Output:

[' [email protected]', ' [email protected]', ' [email protected]', ' [email protected]', ' [email protected]', ' [email protected]', ' [email protected]', ' [email protected]', ' [email protected]', ' [email protected]']

Reading urls from excel and saving in csv

You can just read the urls from excel file column, loop over each, get the emails and write to the csv file. You don't have to use pandas for this (although you can). You can use openpyxl to read excel.

websites.xlsx

Code

from bs4 import BeautifulSoup, Comment
import re
import csv
from urllib.request import urlopen
from openpyxl import load_workbook
email_pattern = re.compile(r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])""")
# The source xlsx file is named as source.xlsx
wb = load_workbook("websites.xlsx")
ws = wb.active
# Default name of first column is A
# change column if if you have a different column name
first_column = ws['A']
with open('output.csv', 'w') as output_file:
    writer = csv.writer(output_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for x in range(len(first_column)):
        link = first_column[x].value.strip()
        f = urlopen(link).read()
        soup = BeautifulSoup(f, 'html5lib')
        emails = [x for x in soup.findAll(text=email_pattern) if not isinstance(x, Comment)]
        # Add the link also
        emails.insert(0, link)
        writer.writerow(emails)

output.csv

http://www.nus.edu.sg/contact, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
https://sloanreview.mit.edu/contact/,[email protected],[email protected]

Ref:

How to validate an email address using a regular expression?

Answered By - Bitto Bennichan

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, June 20, 2022

[FIXED] Emails Scraping using BeautifulSoup from a list of URLs

Issue

Solution

Email scraping from a single url

Reading urls from excel and saving in csv

0 comments:

Post a Comment

Popular Posts

Labels