Issue
I've got a scrapy spider hosted on Zyte using Smart Proxies.
My spider is fairly simple as it crawls starts from a list of URLs.
the parse method uses a simple linkextractor to extract links on the domain and then crawls those links.
Simplified parse method:
def parse(self, response):
internal_le = LinkExtractor(
allow_domains=tld_t, # try to stay on domain (this is a tldextract of response.url)
unique=True, # de-dup
#deny_extensions=self.deny_extensions
)
in_links = internal_le.extract_links(response)
for link in in_links:
if link.url:
yield Request(
link.url,
callback=self.parse,
)
Because deny_extensions defaults to scrapy.DENY_EXTENSIONS which includes PDF files, I assumed it would not crawl a PDF link. But, I have internal links that are redirected to externally hosted PDF files.
Here are some extracts from logs with examples:
33: 2023-11-27 23:41:01 ERROR [scrapy.core.scraper] Spider error processing <GET https://resources.finalsite.net/images/v1691073836/usd262net/renyendq5njmpmol8iko/2023-2024USD262ElementarySchoolStudentHandbookFinaldocx.pdf> (referer: https://west.usd262.net/about) More
34: 2023-11-27 23:41:02 ERROR [scrapy.core.scraper] Spider error processing <GET https://resources.finalsite.net/files/v1676910235/usd262net/kgtnfuk7buzu8zthtixk/102422RevisedSpanish22-23ElementaryHandbookSP4.docx> (referer: https://west.usd262.net/about) More
35: 2023-11-27 23:41:05 ERROR [scrapy.core.scraper] Spider error processing <GET https://resources.finalsite.net/images/v1676649887/usd262net/adlo2wuxxpqa7pmnxmkx/MiddleSchoolBellSchedule22_23docx.pdf> (referer: https://vcms.usd262.net/about) More
36: 2023-11-27 23:41:10 ERROR [scrapy.core.scraper] Spider error processing <GET https://resources.finalsite.net/images/v1691073617/usd262net/zjuysts6fymaf5gjumlc/VCMSStudentHandbook23-24Finaldocx.pdf> (referer: https://vcms.usd262.net/about) More
And here is a single trace:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/scrapy/utils/defer.py", line 279, in iter_errback
yield next(it)
^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/utils/python.py", line 350, in __next__
return next(self.data)
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/utils/python.py", line 350, in __next__
return next(self.data)
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/usr/local/lib/python3.11/site-packages/sh_scrapy/middlewares.py", line 30, in process_spider_output
for x in result:
File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/offsite.py", line 28, in <genexpr>
return (r for r in result or () if self._filter(r, spider))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/referer.py", line 352, in <genexpr>
return (self._set_referer(r, response) for r in result or ())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/urllength.py", line 27, in <genexpr>
return (r for r in result or () if self._filter(r, spider))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/depth.py", line 31, in <genexpr>
return (r for r in result or () if self._filter(r, response, spider))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/tmp/unpacked-eggs/__main__.egg/edtech/spiders/edcrawler.py", line 117, in parse
ex_links = external_le.extract_links(response)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/linkextractors/lxmlhtml.py", line 239, in extract_links
base_url = get_base_url(response)
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/utils/response.py", line 26, in get_base_url
text = response.text[0:4096]
^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/http/response/__init__.py", line 137, in text
raise AttributeError("Response content isn't text")
AttributeError: Response content isn't text
I've tried various approaches to change my link extractor but presumably the link looks fine to the link extractor. Its the redirect that has the PDF file which gets downloaded and produces the error.
Example start url start url
link on that page extracted into 'in_links' extracted internal link
redirect redirect to a pdf document on web host
The only thing I can think of to fix this issue is a custom middleware piece that replaces the redirect and looks for r".pdf$" in the request.url.
Am I missing something? using latest scrapy 2.11.0. also, logged issue on scrapy github github/6159.
1: scrapy docs.redirect middleware
Solution
I think your best option in this situation would be to subclass the RedirectMiddleware
and simply add in a few lines that check the Location header of the initial response for the .pdf
extension and raise the IgnoreRequest
Exception if it is found.
This can all be done in just a handful of lines.
Example:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.downloadermiddlewares.redirect import RedirectMiddleware
from scrapy.exceptions import IgnoreRequest
class PDFRedirect(RedirectMiddleware):
def process_response(self, request, response, spider):
location = response.headers.get("Location", b"").decode()
if location.lower().endswith(".pdf") or location.lower().endswith(".docx"):
print(f"IGNORING PDF {location}")
raise IgnoreRequest("max redirections reached")
return super().process_response(request, response, spider)
class PdfRedirectSpider(scrapy.Spider):
name = 'nopdfs'
allowed_domains = ['west.usd262.net']
start_urls = ['https://west.usd262.net/about']
custom_settings = {
"DOWNLOADER_MIDDLEWARES" : {
"scrapy.downloadermiddlewares.redirect.RedirectMiddleware":None,
PDFRedirect: 600,
}
}
def parse(self, response):
internal_le = LinkExtractor(unique=True)
in_links = internal_le.extract_links(response)
for link in in_links:
if link.url:
yield scrapy.Request(link.url, callback=self.parse)
OUTPUT
2023-11-30 15:00:35 [scrapy.core.engine] INFO: Spider opened
2023-11-30 15:00:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-11-30 15:00:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-11-30 15:00:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about> (referer: None)
2023-11-30 15:00:37 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://west.usd262.net/about> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.usd262.net': <GET https://www.usd262.net/staff-links1>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'abilene.usd262.net': <GET https://abilene.usd262.net>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'wheatland.usd262.net': <GET https://wheatland.usd262.net>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'vcis.usd262.net': <GET https://vcis.usd262.net>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'vcms.usd262.net': <GET https://vcms.usd262.net>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'vchs.usd262.net': <GET https://vchs.usd262.net>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'tlc.usd262.net': <GET https://tlc.usd262.net>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.facebook.com': <GET https://www.facebook.com/profile.php?id=100061273524317>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'twitter.com': <GET https://twitter.com/USD262>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.youtube.com': <GET https://www.youtube.com/channel/UCD8AdyKpM44gpFzqIqBG9tw>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262net-22-us-central1-01.preview.finalsitecdn.com': <GET https://usd262net-22-us-central1-01.preview.finalsitecdn.com/about/calendar1>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.finalsite.com': <GET https://www.finalsite.com>
2023-11-30 15:00:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about#fsPageContent> (referer: https://west.usd262.net/about)
IGNORING PDF https://resources.finalsite.net/files/v1676910235/usd262net/kgtnfuk7buzu8zthtixk/102422RevisedSpanish22-23ElementaryHandbookSP4.docx
2023-11-30 15:00:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/privacy-policy> (referer: https://west.usd262.net/about)
2023-11-30 15:00:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/> (referer: https://west.usd262.net/about)
IGNORING PDF https://resources.finalsite.net/images/v1686234716/usd262net/hdkhsv6qg1jzbobmkrxs/23-24elementaryschoolsupplylist8511in.pdf
2023-11-30 15:00:37 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.usd262.net/about/contact645-clone> from <GET https://west.usd262.net/fs/pages/3813>
2023-11-30 15:00:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/accessibility-statement> (referer: https://west.usd262.net/about)
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.valleycenterhornets.net': <GET https://www.valleycenterhornets.net>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'sideline.bsnsports.com': <GET https://sideline.bsnsports.com/schools/kansas/valleycenter/valley-center-high-school/design/picker>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262net-34-us-central1-01.preview.finalsitecdn.com': <GET https://usd262net-34-us-central1-01.preview.finalsitecdn.com/about>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'calendar.google.com': <GET https://calendar.google.com/calendar/embed?src=usd262.net_b07qmrijq7dq09a7s93u4qq7u0%40group.calendar.google.com&ctz=America%2FChicago>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'datacentral.ksde.org': <GET https://datacentral.ksde.org/accountability.aspx>
IGNORING PDF https://resources.finalsite.net/images/v1691073836/usd262net/renyendq5njmpmol8iko/2023-2024USD262ElementarySchoolStudentHandbookFinaldocx.pdf
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.w3.org': <GET http://www.w3.org/TR/WCAG/>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'accessibilitystatementgenerator.com': <GET http://accessibilitystatementgenerator.com>
2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/parent756> (referer: https://west.usd262.net/about)
2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/pto> (referer: https://west.usd262.net/about)
2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/site-map> (referer: https://west.usd262.net/about)
2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/footer-links> (referer: https://west.usd262.net/about)
2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262.infinitecampus.org': <GET https://usd262.infinitecampus.org/campus/portal/valleycenter.jsp>
2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262net.finalsite.com': <GET https://usd262net.finalsite.com/fs/resource-manager/view/383a8f18-5ef9-4f48-815e-030300759293>
2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'docs.google.com': <GET https://docs.google.com/spreadsheets/u/1/d/e/2PACX-1vRi840waukqIIVzL9eM4X9EoxwIsGKyuwsu83A852Mv6dMnPmjQSF0HKFRrMmpw1g/pubhtml>
2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262.incidentiq.com': <GET https://usd262.incidentiq.com/>
2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'educatekansas.org': <GET https://educatekansas.org/>
2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/volunteering> (referer: https://west.usd262.net/about)
2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/ymca-childcare> (referer: https://west.usd262.net/about)
2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'ymcawichita.org': <GET https://ymcawichita.org/programs/child-care-and-camps/before-and-after-school>
2023-11-30 15:00:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/emergency-safety-interventions-bullying> (referer: https://west.usd262.net/about)
2023-11-30 15:00:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/librarymedia-center> (referer: https://west.usd262.net/about)
2023-11-30 15:00:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/volunteer-information> (referer: https://west.usd262.net/about)
2023-11-30 15:00:40 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'search.follettsoftware.com': <GET https://search.follettsoftware.com/metasearch/ui/43691>
2023-11-30 15:00:40 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'bookfairs.scholastic.com': <GET https://bookfairs.scholastic.com/bf/westelementaryschool11>
2023-11-30 15:00:40 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.commonsensemedia.org': <GET https://www.commonsensemedia.org/>
2023-11-30 15:00:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.usd262.net/about/news> from <GET https://west.usd262.net/fs/pages/3814>
IGNORING PDF https://resources.finalsite.net/images/v1680193574/usd262net/skenieqeiwealjrpl210/33023ActivationInstructionforCampusPortal3.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673804004/usd262net/i0mi93dw4rp63jsem0jt/PTOMeetingMinutes1220docx.pdf
2023-11-30 15:00:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.usd262.net/about/contact645-clone> (referer: https://west.usd262.net/about)
2023-11-30 15:00:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/sraff-directory> (referer: https://west.usd262.net/about)
2023-11-30 15:00:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.usd262.net/schools> from <GET https://west.usd262.net/fs/pages/2799>
IGNORING PDF https://resources.finalsite.net/images/v1673803989/usd262net/bvokssior5jikny5ggwk/PTOMeetingMinutes2120docx.pdf
2023-11-30 15:00:41 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.usd262.net/about/report-bullying-safety-concerns> from <GET https://west.usd262.net/fs/pages/3560>
2023-11-30 15:00:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/counseling> (referer: https://west.usd262.net/about)
2023-11-30 15:00:41 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.ksde.org': <GET http://www.ksde.org/Default.aspx?tabid=149>
2023-11-30 15:00:41 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.homeworkkansas.org': <GET http://www.homeworkkansas.org/>
IGNORING PDF https://resources.finalsite.net/images/v1673803943/usd262net/okuntylyovx2hn260gmt/PTOMeetingMinutes1919docx.pdf
2023-11-30 15:00:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/nurses-page> (referer: https://west.usd262.net/about)
2023-11-30 15:00:41 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.kidshealth.org': <GET http://www.kidshealth.org/parent/firstaid_safe/>
IGNORING PDF https://resources.finalsite.net/images/v1673803972/usd262net/s8sipel9qrbd1kwqrklg/FebPTOMeetingMinutes1120docx.pdf
2023-11-30 15:00:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/document-library> (referer: https://west.usd262.net/about)
IGNORING PDF https://resources.finalsite.net/images/v1673803909/usd262net/zcygtqo4nk94alxapei2/PTOMeetingMinutes1719.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803957/usd262net/kpvyrmpdxbbwic1o9mkw/1-21-20PTOMeetingMinutes21201docx.pdf
2023-11-30 15:00:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/administration> (referer: https://west.usd262.net/about)
IGNORING PDF https://resources.finalsite.net/images/v1673803928/usd262net/aprcr3g9v0x76agcz81m/PTOMeetingMinutes2219docx.pdf
2023-11-30 15:00:41 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://west.usd262.net/about/sraff-directory> from <GET https://west.usd262.net/staff-directory>
2023-11-30 15:00:42 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://west.usd262.net> from <GET https://west.usd262.net/fs/resource-manager/view/446cdd83-e743-495f-b0f1-91318deef052>
IGNORING PDF https://resources.finalsite.net/images/v1673803888/usd262net/sun8frlao9rk4gftotnp/PTOMeetingMinutes2719.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803867/usd262net/cev2livmjpacfgyq0qrc/4202021PTOMeetingminutes.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803137/usd262net/km3nodsbggl5taziszk3/MicrosoftWord-TotallyCoolElementarySchool_1.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803819/usd262net/k5xboy8whfnymanvvuyk/MeetingminutesFeb.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803121/usd262net/u5ctbelnubnhgz9gw6wa/WestElementaryCounselingBrochurefinal-2008_1.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673784917/usd262net/lzwphtnhcoqjp9thds6n/FactSheet-TitleI-ParentInvolvement.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803778/usd262net/rwb5tlbdaap8e1wjiizl/NovemberPTOMeetingMinutes.pdf
2023-11-30 15:00:42 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.usd262.net/enrollment/student-health-information> from <GET https://west.usd262.net/fs/pages/3541>
IGNORING PDF https://resources.finalsite.net/images/v1673784914/usd262net/dkc6smzfpcylihjyl0mx/ESIBoardPolicies-19.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803487/usd262net/eyojl1bd1qdj3lp8bjki/RICE-RestIceCompresionElevation_1.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673784913/usd262net/mlrg9xwsotm3a6ccmazy/ESI-DocumentsforWebsite-19.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803799/usd262net/rczvldr6kah713hisfwx/JanuaryPTOMeetingMinutes.pdf
2023-11-30 15:00:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.usd262.net/about/report-bullying-safety-concerns> (referer: https://west.usd262.net/about/emergency-safety-interventions-bullying)
IGNORING PDF https://resources.finalsite.net/images/v1673784915/usd262net/u4efohzm82jnbzzsqxd3/FERPANotificationofRights.pdf
2023-11-30 15:00:43 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.p3tips.com': <GET https://www.p3tips.com/tipform.aspx?ID=217>
2023-11-30 15:00:43 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.crisistextline.org': <GET https://www.crisistextline.org/texting-in/>
2023-11-30 15:00:43 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.kbi.ks.gov': <GET https://www.kbi.ks.gov/sar>
2023-11-30 15:00:43 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262.onlinesafetyhub.io': <GET https://usd262.onlinesafetyhub.io/>
IGNORING PDF https://resources.finalsite.net/images/v1673803764/usd262net/yswpmxj1ivn5dr4onfue/OctoberPTOmeeting.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803749/usd262net/cfcylzqvhzvvsacorltx/SeptemberPTOmeetingnotes.pdf
2023-11-30 15:00:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.usd262.net/about/news> (referer: https://west.usd262.net/)
2023-11-30 15:00:43 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://west.usd262.net/fs/pages/3508> (referer: https://west.usd262.net/about/document-library)
2023-11-30 15:00:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.usd262.net/schools> (referer: https://west.usd262.net/)
IGNORING PDF https://resources.finalsite.net/images/v1673803706/usd262net/doln0ockhdm39lkfntxm/NovPTOmeetingminutes162021.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803735/usd262net/zltnhhnyt2jz1fi8k8gy/MarchPTOMeetingMinutes222022.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803693/usd262net/fel78cnko0opxf96lefx/OctPTOMeetingminutes192021.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803721/usd262net/idowphs1sgrl2xrnellg/JanPTOMeetingMinutes1820221.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803679/usd262net/e6jc2mep0odspayjzxmo/SeptPTOMeetingMinutes.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803649/usd262net/eultmjehz33n29yf5nqt/PTOMeetingMinutes2020221.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803664/usd262net/x6uh9b9s0lxpm3h8nmdx/AugustthPTOMinutes.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803634/usd262net/izumknwsghgbzuouu4ui/PTOMeetingMinutes2320221.pdf
2023-11-30 15:00:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.usd262.net/enrollment/student-health-information> (referer: https://west.usd262.net/about/nurses-page)
2023-11-30 15:00:44 [scrapy.core.engine] INFO: Closing spider (finished)
2023-11-30 15:00:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 38365,
'downloader/request_count': 65,
'downloader/request_method_count/GET': 65,
'downloader/response_bytes': 248536,
'downloader/response_count': 65,
'downloader/response_status_count/200': 24,
'downloader/response_status_count/301': 6,
'downloader/response_status_count/302': 34,
'downloader/response_status_count/404': 1,
'dupefilter/filtered': 402,
'elapsed_time_seconds': 8.931808,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 11, 30, 23, 0, 44, 907376),
'httpcompression/response_bytes': 795436,
'httpcompression/response_count': 25,
'log_count/DEBUG': 69,
'log_count/INFO': 10,
'offsite/domains': 35,
'offsite/filtered': 962,
'request_depth_max': 3,
'response_received_count': 25,
'scheduler/dequeued': 65,
'scheduler/dequeued/memory': 65,
'scheduler/enqueued': 65,
'scheduler/enqueued/memory': 65,
'start_time': datetime.datetime(2023, 11, 30, 23, 0, 35, 975568)}
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.