Issue
I have tried several iterations from other posts and nothing seems to be helping or working for my needs.
I have a list of URLs that I want to loop through and pull all associated URLs that contain email addresses. I then want to store the URLs and Email Addresses into a csv
file.
For example, if I went to 10torr.com, the program should find each of the sites within the main URL (ie: 10torr.com/about) and pull any emails.
Below is a list of 5 example websites that are currently in a data frame format when run through my code. They are saved under the variable small_site
.
A helpful answer will include the use of the user defined function listed below called get_info()
. Hard coding the the websites is into the Spider itself is not a feasible option as this will be used by many other people with different website list lengths.
Website
http://10torr.com/
https://www.10000drops.com/
https://www.11wells.com/
https://117westspirits.com/
https://www.onpointdistillery.com/
Below is the code that I am running. The spider seems to run, but there is no output in my csv
file.
import os
import pandas as pd
import re
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
small_site = site.head()
#%% Start Spider
class MailSpider(scrapy.Spider):
name = 'email'
def parse(self, response):
links = LxmlLinkExtractor(allow=()).extract_links(response)
links = [str(link.url) for link in links]
links.append(str(response.url))
for link in links:
yield scrapy.Request(url=link, callback=self.parse_link)
def parse_link(self, response):
for word in self.reject:
if word in str(response.url):
return
html_text = str(response.text)
mail_list = re.findall('\w+@\w+\.{1}\w+', html_text)
dic = {'email': mail_list, 'link': str(response.url)}
df = pd.DataFrame(dic)
df.to_csv(self.path, mode='a', header=False)
df.to_csv(self.path, mode='a', header=False)
#%% Preps a CSV File
def ask_user(question):
response = input(question + ' y/n' + '\n')
if response == 'y':
return True
else:
return False
def create_file(path):
response = False
if os.path.exists(path):
response = ask_user('File already exists, replace?')
if response == False: return
with open(path, 'wb') as file:
file.close()
#%% Defines function that will extract emails and enter it into CSV
def get_info(url_list, path, reject=[]):
create_file(path)
df = pd.DataFrame(columns=['email', 'link'], index=[0])
df.to_csv(path, mode='w', header=True)
print('Collecting Google urls...')
google_urls = url_list
print('Searching for emails...')
process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
process.start()
for i in small_site.Website.iteritems():
print('Searching for emails...')
process.crawl(MailSpider, start_urls=google_urls, path=path, reject=reject)
##process.start()
print('Cleaning emails...')
df = pd.read_csv(path, index_col=0)
df.columns = ['email', 'link']
df = df.drop_duplicates(subset='email')
df = df.reset_index(drop=True)
df.to_csv(path, mode='w', header=True)
return df
url_list = small_site
path = 'email.csv'
df = get_info(url_list, path)
I am not certain where I am going wrong as I am not getting any error messages. If you need additional information please just ask. I have been trying to get this for almost a month now and I feel like I am just banging my head against the wall at this point.
The majority of this code was found on the article Web scraping to extract contact information— Part 1: Mailing Lists after a few weeks. However, I have not been successful in expanding it to my needs. It worked no problem with one offs while incorporating their google search function to get the base URLs.
Thank you in advance for any assistance you are able to provide.
Solution
It took awhile, but the answer finally came to me. The following is how the final answer came to be. This will work with a changing list as was the original question.
The change ended up being very minor. I needed to add the following user defined function.
def get_urls(io, sheet_name):
data = pd.read_excel(io, sheet_name)
urls = data['Website'].to_list()
return urls
From there, it was a simple change to the get_info()
user defined function. We needed to set google_urls
in this function to our get_urls
function and pass in the list. The full code for this function is below.
def get_info(io, sheet_name, path, reject=[]):
create_file(path)
df = pd.DataFrame(columns=['email', 'link'], index=[0])
df.to_csv(path, mode='w', header=True)
print('Collecting Google urls...')
google_urls = get_urls(io, sheet_name)
print('Searching for emails...')
process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
process.crawl(MailSpider, start_urls=google_urls, path=path, reject=reject)
process.start()
print('Cleaning emails...')
df = pd.read_csv(path, index_col=0)
df.columns = ['email', 'link']
df = df.drop_duplicates(subset='email')
df = df.reset_index(drop=True)
df.to_csv(path, mode='w', header=True)
return df
No other changes were needed to get this to run. Hopefully this helps.
Answered By - Chris
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.