Issue
I am trying to use the scrapy
library to run a broad crawl - crawl where I parse millions of websites. The spider is connected to a PostgreSQL database. This is how I load unprocessed urls before starting the spider:
def get_unprocessed_urls(self, suffix):
"""
Fetch unprocessed urls.
"""
print(f'Fetching unprocessed urls for suffix {suffix}...')
cursor = self.connection.cursor('unprocessed_urls_cursor', withhold=True)
cursor.itersize = 1000
cursor.execute(f"""
SELECT su.id, su.url FROM seed_url su
LEFT JOIN footer_seed_url_status fsus ON su.id = fsus.seed_url_id
WHERE su.url LIKE \'%.{suffix}\' AND fsus.seed_url_id IS NULL;
""")
ID = 0
URL = 1
urls = [Url(url_row[ID], self.validate_url(url_row[URL])) for url_row in cursor]
print('len urls:', len(urls))
return urls
This is my spider:
class FooterSpider(scrapy.Spider):
...
def start_requests(self):
urls = self.handler.get_unprocessed_urls(self.suffix)
for url in urls:
yield scrapy.Request(
url=url.url,
callback=self.parse,
errback=self.errback,
meta={
'seed_url_id': url.id,
}
)
def parse(self, response):
try:
seed_url_id = response.meta.get('seed_url_id')
print(response.url)
soup = BeautifulSoup(response.text, 'html.parser')
footer = soup.find('footer')
item = FooterItem(
seed_url_id=seed_url_id,
html=str(footer) if footer is not None else None,
url=response.url
)
yield item
print(f'Successfully processed url {response.url}')
except Exception as e:
print('Error while processing url', response.url)
print(e)
seed_url_id = response.meta.get('seed_url_id')
cursor = self.handler.connection.cursor()
cursor.execute(
"INSERT INTO footer_seed_url_status(seed_url_id, status) VALUES(%s, %s)",
(seed_url_id, str(e)))
self.handler.connection.commit()
def errback(self, failure):
print(failure.value)
try:
error = repr(failure.value)
request = failure.request
seed_url_id = request.meta.get('seed_url_id')
cursor = self.handler.connection.cursor()
cursor.execute(
"INSERT INTO footer_seed_url_status(seed_url_id, status) VALUES(%s, %s)",
(seed_url_id, error))
self.handler.connection.commit()
except Exception as e:
print(e)
These are my custom settings for the crawl (taken from broad crawl documentation page above):
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
CONCURRENT_REQUESTS = 100
CONCURRENT_ITEMS=1000
SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.DownloaderAwarePriorityQueue'
REACTOR_THREADPOOL_MAXSIZE = 20
COOKIES_ENABLED = False
DOWNLOAD_DELAY = 0.2
My problem is: the spider does not crawl all urls but stops after crawling only a few hundred (or few thousand, this number seems to vary). No warnings or errors are shown in the logs. These are example logs after "finishing" the crawling:
{'downloader/exception_count': 2,
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 1,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
'downloader/request_bytes': 345073,
'downloader/request_count': 1481,
'downloader/request_method_count/GET': 1481,
'downloader/response_bytes': 1977255,
'downloader/response_count': 1479,
'downloader/response_status_count/200': 46,
'downloader/response_status_count/301': 791,
'downloader/response_status_count/302': 512,
'downloader/response_status_count/303': 104,
'downloader/response_status_count/308': 2,
'downloader/response_status_count/403': 2,
'downloader/response_status_count/404': 22,
'dupefilter/filtered': 64,
'elapsed_time_seconds': 113.895788,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 8, 3, 11, 46, 31, 889491),
'httpcompression/response_bytes': 136378,
'httpcompression/response_count': 46,
'log_count/ERROR': 3,
'log_count/INFO': 11,
'log_count/WARNING': 7,
'response_received_count': 43,
"robotstxt/exception_count/<class 'twisted.internet.error.DNSLookupError'>": 1,
"robotstxt/exception_count/<class 'twisted.web._newclient.ResponseNeverReceived'>": 1,
'robotstxt/request_count': 105,
'robotstxt/response_count': 43,
'robotstxt/response_status_count/200': 21,
'robotstxt/response_status_count/403': 2,
'robotstxt/response_status_count/404': 20,
'scheduler/dequeued': 151,
'scheduler/dequeued/memory': 151,
'scheduler/enqueued': 151,
'scheduler/enqueued/memory': 151,
'start_time': datetime.datetime(2023, 8, 3, 11, 44, 37, 993703)}
2023-08-03 11:46:31 [scrapy.core.engine] INFO: Spider closed (finished)
Peculiarly enough, this problem seems to appear only on one of the two machines I tried to use for crawling. When I run crawling locally on my PC (Windows 11), the crawling does not stop. However, when I run the code on our company's server (Microsoft Azure Windows 10 machine), the crawling stops prematurely, as described above.
EDIT: Full logs can be found here. In this case the process stops after a few urls.
Solution
I finally found the problem. Scrapy requires all start urls to have an HTTP schema, e.g. stackoverflow.com
would not work, but https://stackoverflow.com
would.
I used following code to validate whether a url contains a schema:
if not url.startswith("http"):
url = 'http://' + url
However, this validation is wrong. My data contained millions of urls and some of them were obviously degenerate or unconventional (http.gay seems to be a valid redirect domain) such as:
httpsf52u5bids65u.xyz
httppollenmap.com
http.gay
These urls would pass my scheme check even though they do not contain a schema, and they would break my crawling process.
I changed the validation to this and the problem disappeared:
if not (url.startswith("http://") or url.startswith('https://')):
url = 'http://' + url
Answered By - druskacik
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.