Issue
I am attempting to extract bid information from this site. I am a Scrapy newbie, and bit stuck as to why I don't getting any output, instead, I get Crawled (200)...(referer: None) and no output. I am unable to figure out what I am missing or need to change. I really don't know where the problem is. Can anyone please help figure this out?
Thank you!!
Here is my spider code:
from ..items import GovernmentItem
import scrapy, urllib.parse
class GeorgiaSpider(scrapy.Spider):
name = 'georgia'
allowed_domains = ['ssl.doas.state.ga.us']
def start_requests(self):
url = 'https://ssl.doas.state.ga.us/gpr/'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for row in response.xpath('//*[@class="table table-striped table-bordered"]//tbody//tr'):
item = GovernmentItem()
item['description'] = row.xpath('./td[@class=" all"][2]').extract_first()
item['begin_date'] = row.xpath('./td[@class=" desktop"]').extract_first()
item['end_date'] = row.xpath('./td[@class="desktop tablet mobile sorting_1"]').extract_first()
item['file_urls'] = row.xpath('./td[@class=" all]/a//@href').extract_first()
yield item
Here is my my crawl log file:
2021-07-23 05:49:13 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: government)
2021-07-23 05:49:13 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.8.10 (default, Jun 2 2021, 10:49:15) - [GCC 9.4.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), cryptography 3.4.7, Platform Linux-5.8.0-63-generic-x86_64-with-glibc2.29
2021-07-23 05:49:13 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2021-07-23 05:49:13 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'government',
'DOWNLOAD_DELAY': 1,
'NEWSPIDER_MODULE': 'government.spiders',
'SPIDER_MODULES': ['government.spiders']}
2021-07-23 05:49:13 [scrapy.extensions.telnet] INFO: Telnet Password: 1196e88aa45a90c1
2021-07-23 05:49:13 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2021-07-23 05:49:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-07-23 05:49:13 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-07-23 05:49:13 [scrapy.middleware] INFO: Enabled item pipelines:
['government.pipelines.GovernmentPipeline',
'scrapy.pipelines.files.FilesPipeline']
2021-07-23 05:49:13 [scrapy.core.engine] INFO: Spider opened
2021-07-23 05:49:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-07-23 05:49:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-07-23 05:49:14 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://ssl.doas.state.ga.us/gpr/unsupported?browser=> from <GET https://ssl.doas.state.ga.us/gpr/>
2021-07-23 05:49:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ssl.doas.state.ga.us/gpr/unsupported?browser=> (referer: None)
2021-07-23 05:49:15 [scrapy.core.engine] INFO: Closing spider (finished)
2021-07-23 05:49:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 468,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 6169,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 1,
'elapsed_time_seconds': 1.564505,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 7, 23, 10, 49, 15, 561300),
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'memusage/max': 55824384,
'memusage/startup': 55824384,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2021, 7, 23, 10, 49, 13, 996795)}
2021-07-23 05:49:15 [scrapy.core.engine] INFO: Spider closed (finished)
Solution
As mentioned by SuperUser, your original URL is getting redirected because website expects the request from real browser. To mimic the same behaviour like browser through scrapy you should pass user-agent
either by setting.py
or as a header in your spider.py
file which would return you page source html.
Your XPath
still won't work because the data you are looking for is generated dynamically. So, you should reproduce the request using browser dev tools to get the API and then utilise that to get the desired results.
You will get the JSON response from below code. For the demonstration, I have extracted only one field. Similarly, you can get other fields.
Code
import scrapy
import json
from ..items import GovernmentItem
class Test(scrapy.Spider):
name = 'test'
headers = {
"authority": "ssl.doas.state.ga.us",
"pragma": "no-cache",
"cache-control": "no-cache",
"sec-ch-ua": "\"Chromium\";v=\"92\", \" Not A;Brand\";v=\"99\", \"Google Chrome\";v=\"92\"",
"accept": "application/json, text/javascript, */*; q=0.01",
"x-requested-with": "XMLHttpRequest",
"sec-ch-ua-mobile": "?0",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36",
"content-type": "application/x-www-form-urlencoded; charset=UTF-8",
"origin": "https://ssl.doas.state.ga.us",
"sec-fetch-site": "same-origin",
"sec-fetch-mode": "cors",
"sec-fetch-dest": "empty",
"referer": "https://ssl.doas.state.ga.us/gpr/",
"accept-language": "en-US,en;q=0.9"
}
body = 'draw=1&columns%5B0%5D%5Bdata%5D=function&columns%5B0%5D%5Bname%5D=&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=function&columns%5B1%5D%5Bname%5D=&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=true&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=title&columns%5B2%5D%5Bname%5D=&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=true&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=agencyName&columns%5B3%5D%5Bname%5D=&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=true&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B4%5D%5Bdata%5D=postingDateStr&columns%5B4%5D%5Bname%5D=&columns%5B4%5D%5Bsearchable%5D=true&columns%5B4%5D%5Borderable%5D=true&columns%5B4%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B4%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B5%5D%5Bdata%5D=closingDateStr&columns%5B5%5D%5Bname%5D=&columns%5B5%5D%5Bsearchable%5D=true&columns%5B5%5D%5Borderable%5D=true&columns%5B5%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B5%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B6%5D%5Bdata%5D=function&columns%5B6%5D%5Bname%5D=&columns%5B6%5D%5Bsearchable%5D=true&columns%5B6%5D%5Borderable%5D=false&columns%5B6%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B6%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B7%5D%5Bdata%5D=status&columns%5B7%5D%5Bname%5D=&columns%5B7%5D%5Bsearchable%5D=true&columns%5B7%5D%5Borderable%5D=false&columns%5B7%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B7%5D%5Bsearch%5D%5Bregex%5D=false&order%5B0%5D%5Bcolumn%5D=5&order%5B0%5D%5Bdir%5D=asc&start=0&length=50&search%5Bvalue%5D=&search%5Bregex%5D=false&responseType=ALL&eventStatus=OPEN&eventIdTitle=&govType=ALL&govEntity=&eventProcessType=ALL&dateRangeType=&rangeStartDate=&rangeEndDate=&isReset=false&persisted=&refreshSearchData=false'
def start_requests(self):
url = 'https://ssl.doas.state.ga.us/gpr/eventSearch'
yield scrapy.Request(url=url,method='POST', headers=self.headers,body=self.body, callback=self.parse)
def parse(self,response):
item = GovernmentItem()
response = json.loads(response.body)
for i in response.get('data'):
item['title'] = i.get('title')
yield item
Answered By - Shivam
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.