Thursday, January 27, 2022

[FIXED] Scraping E-mails from Websites

January 27, 2022 python, scrapy, web-scraping No comments

Issue

I have tried several iterations from other posts and nothing seems to be helping or working for my needs.

I have a list of URLs that I want to loop through and pull all associated URLs that contain email addresses. I then want to store the URLs and Email Addresses into a csv file.

For example, if I went to 10torr.com, the program should find each of the sites within the main URL (ie: 10torr.com/about) and pull any emails.

Below is a list of 5 example websites that are currently in a data frame format when run through my code. They are saved under the variable small_site.

A helpful answer will include the use of the user defined function listed below called get_info(). Hard coding the the websites is into the Spider itself is not a feasible option as this will be used by many other people with different website list lengths.

    Website
    http://10torr.com/
    https://www.10000drops.com/
    https://www.11wells.com/
    https://117westspirits.com/
    https://www.onpointdistillery.com/

Below is the code that I am running. The spider seems to run, but there is no output in my csv file.


import os
import pandas as pd
import re
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor

small_site = site.head()


#%% Start Spider
class MailSpider(scrapy.Spider):

    name = 'email'

    def parse(self, response):

        links = LxmlLinkExtractor(allow=()).extract_links(response)
        links = [str(link.url) for link in links]
        links.append(str(response.url))

        for link in links:
            yield scrapy.Request(url=link, callback=self.parse_link) 

    def parse_link(self, response):

        for word in self.reject:
            if word in str(response.url):
                return

        html_text = str(response.text)
        mail_list = re.findall('\w+@\w+\.{1}\w+', html_text)

        dic = {'email': mail_list, 'link': str(response.url)}
        df = pd.DataFrame(dic)

        df.to_csv(self.path, mode='a', header=False)
        df.to_csv(self.path, mode='a', header=False)


#%% Preps a CSV File
def ask_user(question):
    response = input(question + ' y/n' + '\n')
    if response == 'y':
        return True
    else:
        return False
def create_file(path):
    response = False
    if os.path.exists(path):
        response = ask_user('File already exists, replace?')
        if response == False: return 

    with open(path, 'wb') as file: 
        file.close()


#%% Defines function that will extract emails and enter it into CSV
def get_info(url_list, path, reject=[]):

    create_file(path)
    df = pd.DataFrame(columns=['email', 'link'], index=[0])
    df.to_csv(path, mode='w', header=True)


    print('Collecting Google urls...')
    google_urls = url_list


    print('Searching for emails...')
    process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
    process.start() 

    for i in small_site.Website.iteritems():
        print('Searching for emails...')
        process.crawl(MailSpider, start_urls=google_urls, path=path, reject=reject)
        ##process.start()

        print('Cleaning emails...')
        df = pd.read_csv(path, index_col=0)
        df.columns = ['email', 'link']
        df = df.drop_duplicates(subset='email')
        df = df.reset_index(drop=True)
        df.to_csv(path, mode='w', header=True)


    return df


url_list = small_site
path = 'email.csv'

df = get_info(url_list, path)

I am not certain where I am going wrong as I am not getting any error messages. If you need additional information please just ask. I have been trying to get this for almost a month now and I feel like I am just banging my head against the wall at this point.

The majority of this code was found on the article Web scraping to extract contact information— Part 1: Mailing Lists after a few weeks. However, I have not been successful in expanding it to my needs. It worked no problem with one offs while incorporating their google search function to get the base URLs.

Thank you in advance for any assistance you are able to provide.

Solution

It took awhile, but the answer finally came to me. The following is how the final answer came to be. This will work with a changing list as was the original question.

The change ended up being very minor. I needed to add the following user defined function.

def get_urls(io, sheet_name):
    data = pd.read_excel(io, sheet_name)
    urls = data['Website'].to_list()
    return urls

From there, it was a simple change to the get_info() user defined function. We needed to set google_urls in this function to our get_urls function and pass in the list. The full code for this function is below.

def get_info(io, sheet_name, path, reject=[]):
    
    create_file(path)
    df = pd.DataFrame(columns=['email', 'link'], index=[0])
    df.to_csv(path, mode='w', header=True)
    
    print('Collecting Google urls...')
    google_urls = get_urls(io, sheet_name)
    
    print('Searching for emails...')
    process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
    process.crawl(MailSpider, start_urls=google_urls, path=path, reject=reject)
    process.start()
    
    print('Cleaning emails...')
    df = pd.read_csv(path, index_col=0)
    df.columns = ['email', 'link']
    df = df.drop_duplicates(subset='email')
    df = df.reset_index(drop=True)
    df.to_csv(path, mode='w', header=True)
    
    return df

No other changes were needed to get this to run. Hopefully this helps.

Answered By - Chris

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, January 27, 2022

[FIXED] Scraping E-mails from Websites

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels