Issue
I am in the process of building an email scraper and am having trouble when it comes to yielding items. My yield prints as:
{'email': ['[email protected]', '[email protected]', '[email protected]']}
Whenever I export this into CSV I have an email header and then the three emails are listed in the same cell. How would I separate these into individual cells?
class EmailSpider(CrawlSpider):
name = 'emails'
start_urls = ['https://example.com']
parsed_url = urlparse(start_urls[0])
rules = [Rule(LinkExtractor(allow_domains=parsed_url), callback='parse', follow=True)]
def parse(self, response):
# Scrape page for email links
items = EmailscrapeItem()
hrefs = [response.xpath("//a[starts-with(@href, 'mailto')]/text()").getall()]
# Removes hrefs that are empty or None
hrefs = [d for d in hrefs if d]
# TODO: Add code to capture non-mailto emails as well
# hrefs.append(response.xpath("//*[contains(text(), '@')]/text()"))
for href in hrefs:
items['email'] = href
yield items
Solution
Figured out what I did wrong.
I changed my parse to:
for res in response.xpath("//a[starts-with(@href, 'mailto')]/text()"):
item = EmailscrapeItem()
item['email'] = res.get()
yield item
This yielded the proper results.
Answered By - howshotwebs
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.