Issue
Below is a simple script using BS to scrape emails from a single website, how do I modify the script if I have a list of URLs saved in excel and saved the results into csv file?
I am thinking if i should read the list of URL using pandas so it will be converted to pd dataframe?
from bs4 import BeautifulSoup
import re
import csv
from urllib.request import urlopen
f = urlopen('http://www.nus.edu.sg/contact')
s = BeautifulSoup(f, 'html.parser')
s = s.get_text()
phone = re.findall(r"((?:\d{3}|\(\d{3}\))?(?:\s|-|\.)?\d{3}(?:\s|-|\.)\d{4})",s)
emails = re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,3}",s)
if len(phone) == 0:
print ("Sorry, no phone number found.")
print('------------')
print ()
else :
count = 1
for item in phone:
print ( count, ' phone number(s) found : ',item )
count += 1
print('------------')
print()
if len(emails) == 0:
print("Sorry, no email address found.")
print('------------')
print()
else:
count = 1
for item in emails:
print(count, ' email address(es) found : ', item)
count += 1
Solution
findAll / find_all can search for text based on a regex pattern.
You can use the re.compile(email-pattern) and then pass that to findAll.
findAll(text=email_pattern)
The email pattern used is as per RFC 5322.
Email scraping from a single url
from bs4 import BeautifulSoup, Comment
import re
from urllib.request import urlopen
email_pattern = re.compile(r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])""")
f = urlopen('http://www.nus.edu.sg/contact').read()
soup = BeautifulSoup(f, 'html5lib')
emails = [x for x in soup.findAll(text=email_pattern) if not isinstance(x, Comment)]
print(emails)
Output:
[' [email protected]', ' [email protected]', ' [email protected]', ' [email protected]', ' [email protected]', ' [email protected]', ' [email protected]', ' [email protected]', ' [email protected]', ' [email protected]']
Reading urls from excel and saving in csv
You can just read the urls from excel file column, loop over each, get the emails and write to the csv file. You don't have to use pandas for this (although you can). You can use openpyxl to read excel.
websites.xlsx
Code
from bs4 import BeautifulSoup, Comment
import re
import csv
from urllib.request import urlopen
from openpyxl import load_workbook
email_pattern = re.compile(r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])""")
# The source xlsx file is named as source.xlsx
wb = load_workbook("websites.xlsx")
ws = wb.active
# Default name of first column is A
# change column if if you have a different column name
first_column = ws['A']
with open('output.csv', 'w') as output_file:
writer = csv.writer(output_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for x in range(len(first_column)):
link = first_column[x].value.strip()
f = urlopen(link).read()
soup = BeautifulSoup(f, 'html5lib')
emails = [x for x in soup.findAll(text=email_pattern) if not isinstance(x, Comment)]
# Add the link also
emails.insert(0, link)
writer.writerow(emails)
output.csv
http://www.nus.edu.sg/contact, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
https://sloanreview.mit.edu/contact/,[email protected],[email protected]
Ref:
How to validate an email address using a regular expression?
Answered By - Bitto Bennichan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.