Issue
I am making an email scraper using Scrapy and I keep getting this error: TypeError: cannot use a string pattern on a bytes-like object
Here is my Python code I am using:
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class EmailSpider(CrawlSpider):
name = 'EmailScraper'
emailHistory = {}
custom_settings = {
'ROBOTSTXT_OBEY': False
# ,'DEPTH_LIMIT' : 6
}
emailRegex = re.compile((r"([a-zA-Z0-9_{|}~-]+(?:\.[a-zA-Z0-9_"
r"{|}~-]+)*(@)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9]){2,}?(\."
r"))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"))
def __init__(self, url=None, *args, **kwargs):
super(EmailSpider, self).__init__(*args, **kwargs)
self.start_urls = [url]
self.allowed_domains = [url.replace(
"http://", "").replace("www.", "").replace("/", "")]
rules = (Rule(LinkExtractor(), callback="parse_item", follow=True),)
def parse_item(self, response):
emails = re.findall(EmailSpider.emailRegex, response._body)
for email in emails:
if email[0] in EmailSpider.emailHistory:
continue
else:
EmailSpider.emailHistory[email[0]] = True
yield {
'site': response.url,
'email': email[0]
}
I have seen a lot of answers but I am very new to python so Im not sure how I would implement the code given into my code.
So if you don't mind could also tell me were to put the code in.
Thanks, Jude Wilson
Solution
response._body
is not a str
(string object), so you cannot use re
(regex) on it. If you look for its object type you will find out it is a bytes
(bytes object).
>>> type(response._body)
<class 'bytes'>
By decoding it to something like UTF-8 the problem should be solved.
>>> type(response._body.decode('utf-8'))
<class 'str'>
Final re
would be like this:
emails = re.findall(EmailSpider.emailRegex, response._body.decode('utf-8'))
Answered By - Moein Kameli
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.