Wednesday, October 20, 2021

[FIXED] TypeError: cannot use a string pattern on a bytes-like object in Python

October 20, 2021 python, scrapy No comments

Issue

I am making an email scraper using Scrapy and I keep getting this error: TypeError: cannot use a string pattern on a bytes-like object

Here is my Python code I am using:

import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class EmailSpider(CrawlSpider):
    name = 'EmailScraper'
    emailHistory = {}
    custom_settings = {
        'ROBOTSTXT_OBEY': False
        #  ,'DEPTH_LIMIT' : 6
    }

emailRegex = re.compile((r"([a-zA-Z0-9_{|}~-]+(?:\.[a-zA-Z0-9_"
                         r"{|}~-]+)*(@)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9]){2,}?(\."
                         r"))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"))

def __init__(self, url=None, *args, **kwargs):
    super(EmailSpider, self).__init__(*args, **kwargs)
    self.start_urls = [url]
    self.allowed_domains = [url.replace(
        "http://", "").replace("www.", "").replace("/", "")]
rules = (Rule(LinkExtractor(), callback="parse_item", follow=True),)

def parse_item(self, response):
    emails = re.findall(EmailSpider.emailRegex, response._body)
    for email in emails:
        if email[0] in EmailSpider.emailHistory:
            continue
        else:
            EmailSpider.emailHistory[email[0]] = True
            yield {
                'site': response.url,
                'email': email[0]
            }

I have seen a lot of answers but I am very new to python so Im not sure how I would implement the code given into my code.

So if you don't mind could also tell me were to put the code in.

Thanks, Jude Wilson

Solution

response._body is not a str(string object), so you cannot use re(regex) on it. If you look for its object type you will find out it is a bytes(bytes object).

>>> type(response._body)
<class 'bytes'>

By decoding it to something like UTF-8 the problem should be solved.

>>> type(response._body.decode('utf-8'))
<class 'str'>

Final re would be like this:

emails = re.findall(EmailSpider.emailRegex, response._body.decode('utf-8'))

Answered By - Moein Kameli

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, October 20, 2021

[FIXED] TypeError: cannot use a string pattern on a bytes-like object in Python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels