Issue
I want to check the log message from this logger: [scrapy.spidermiddlewares.httperror] and based on it the function will do a specific action so basically I want to assign the message to a variable as a string and then find a keyword in that string
In the documentation I didn't find a way to do that it's all about formatting the logs
import scrapy
class spider1(scrapy.Spider):
name = 'spider1'
allowed_domains = []
custom_settings = {'CONCURRENT_REQUESTS_PER_DOMAIN': 2}
start_urls = ['https://quotes.toscrape.com/']
def parse(self, response):
print(response.text)
Log example
2022-02-03 03:11:42 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <402 https://quotes.toscrape.com/>: HTTP status code is not handled or not allowed
I want to assign the above log message to a variable
I know I can output the whole log to a .txt file but as I'll have multiple spiders run in an infinit loop there will be a huge amount of data to iterate
Solution
You can use a logging filter and apply it to the specific scrapy.spidermiddlewares.httperror
logger. You can then use regex to capture the exact type of error you would like to filter for and then write it to a file. See below sample code:
import scrapy
import logging
import re
class ContentFilter(logging.Filter):
def filter(self, record):
match = re.search(r'Ignoring response <.*> HTTP status code is not handled or not allowed', record.msg)
if match:
with open("logged_messages.log", "a") as f:
f.write(record.msg + '\n')
return True
class spider1(scrapy.Spider):
name = 'spider1'
allowed_domains = []
custom_settings = {'CONCURRENT_REQUESTS_PER_DOMAIN': 2}
start_urls = ['https://quotes.toscrape.com/']
def __init__(self, *args, **kwargs):
logger = logging.getLogger('scrapy.spidermiddlewares.httperror')
logger.addFilter(ContentFilter())
def parse(self, response):
yield {
"title": response.css("title::text").get()
}
Read more about the logging module and the customization you can do from the scrapy docs
Answered By - msenior_
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.