Issue
Actually my intension is to achieve the Next from "href="javascript:submitAction_win0(document.win0,'HRS_APPL_WRK_HRS_LST_NEXT')"
, so Just for an example I am taking [this url][1]. From this url as you can see the Next at the end of the page, so if you observe html of that they are written through href="javascript:submitAction_win0(document.win0,'HRS_APPL_WRK_HRS_LST_NEXT')
which has href
tags as #
, I am just trying to collect that href tags even though they are #
.
def parse(self,response):
selector = Selector(response)
links = []
for link in selector.css('span.PSEDITBOX_DISPONLY').re('.*>(\d+)<.*'):
#intjid = selector.css('span.PSEDITBOX_DISPONLY').re('.*>(\d+)<.*')
abc = 'xxxx'
#print abc
yield Request(abc,callback=self.parse_listing_page,dont_filter=True)
#meta={"use_splash": False}
# )
nav_page = selector.css('div#win0divHRS_APPL_WRK_HRS_LST_NEXT a').extract()
print nav_page
for nav_page in nav_page:
## To pass the url to parse function
yield Request(urljoin('xxx',nav_page),self.parse,dont_filter=True)
When I run the above code I am getting the result as " HTTP status code is not handled or not allowed"
, I mean none, can anyone tell me how to achieve the Next through that ""href="javascript:submitAction_win0(document.win0,'HRS_APPL_WRK_HRS_LST_NEXT')""
functions and why the result is empty. I am observing some kind of wierd in html, for example one of the page in Next has anchor tag as "<a id="HRS_APPL_WRK_HRS_LST_NEXT" class="PSHYPERLINK" href="javascript:submitAction_win0(document.win0,'HRS_APPL_WRK_HRS_LST_NEXT');" tabindex="74" ptlinktgt="pt_replace" name="HRS_APPL_WRK_HRS_LST_NEXT"></a>"
Thanks in advance
output :
[u'<a name="HRS_APPL_WRK_HRS_LST_NEXT" id="HRS_APPL_WRK_HRS_LST_NEXT" ptlinktgt="pt_replace" tabindex="74" href="javascript:submitAction_win0(document.win0,\'HRS_APPL_WRK_HRS_LST_NEXT\');" class="PSHYPERLINK">Next</a>']
Solution
Scrapy Doesn't support java script call by itself. But there are a couple of mechanisms that you can use for facing java-script.
- Splash - Splash is a javascript rendering service with an HTTP API. It's a lightweight browser with an HTTP API, implemented in Python using Twisted and QT
- Scrapyjs - This library provides Scrapy-Javascript integration through two different mechanisms: a Scrapy download handler, a Scrapy downloader middlware
- SpiderMonkey - Execute arbitrary JavaScript code from Python. Allows you to reference arbitrary Python objects and functions in the JavaScript VM
- spynner - Spynner is a stateful programmatic web browser module for Python. It is based upon PyQT and WebKit. It supports Javascript, AJAX, and every other technology that !WebKit is able to handle (Flash, SVG, ...). Spynner takes advantage of JQuery. a powerful Javascript library that makes the interaction with pages and event simulation really easy
Answered By - Nima Soroush
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.