Issue
I want to scrape a webform site but the links aren't regular hrefs they are like below: and I want to have scrapy get that link and go there
< a id="ctl00_ContentPlaceHolder1_DtGrdAttraf_ctl06_LnkBtnDisplayHadith" title="some title" class="Txt" onmouseover="changeStyle(this, 'lnk')" onmouseout="changeStyle(this, 'Txt TxtSmall')" href="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$ContentPlaceHolder1$DtGrdAttraf$ctl06$LnkBtnDisplayHadith", "", false, "", "http://www.sonnaonline.com/DisplayResults.aspx?Menu=1&ParentID=13&Flag=dbID&Selid=8483", false, true))">the link text</ a>
Solution
Asp.net is a form-driven framework. So, you have to fill in the form and manually post it to get to the page directs?
How to do that?
At first, you can have a look at here, my scrapy code. https://github.com/Timezone-design/python-scrapy-asp-net/blob/master/scrapy_spider/spiders/burzarada_spider.py
You should first find out what WebForm_DoPostBackWithOptions() do in the page. You can just search the function by Ctrl+U, from the page source.
You will soon find out what it does, where does it fill these informations "ctl00$ContentPlaceHolder1$DtGrdAttraf$ctl06$LnkBtnDisplayHadith", "", false, "", "http://www.sonnaonline.com/DisplayResults.aspx?Menu=1&ParentID=13&Flag=dbID&Selid=8483", false, true
in.
Then, the thing is clear.
You extract the href of the a tag to a string by
response.css('... a ::attr(href)').extract()[0].href # assuming there are many <a>s there
Then split the string "ctl00$ContentPlaceHolder1$DtGrdAttraf$ctl06$LnkBtnDisplayHadith", "", false, "", "http://www.sonnaonline.com/DisplayResults.aspx?Menu=1&ParentID=13&Flag=dbID&Selid=8483", false, true
by commas, and, fill them in proper input elements and post it by scrapy.FormRequest
.
yield scrapy.FormRequest(
'https://burzarada.hzz.hr/Posloprimac_RadnaMjesta.aspx',
formdata = {
'__EVENTTARGET': eventTarget,
'__EVENTARGUMENT': eventArgument,
'__LASTFOCUS': lastFocus,
'__VIEWSTATE': viewState,
'__VIEWSTATEGENERATOR': viewStateGenerator,
'__VIEWSTATEENCRYPTED': viewStateEncrypted,
'ctl00$MainContent$ddlPageSize': pageSize,
'ctl00$MainContent$ddlSort': sort,
},
callback=self.parse_multiple_pages
)
Explanation:
https://burzarada.hzz.hr/Posloprimac_RadnaMjesta.aspx # url to post the form.
formdata # form data as json. keys are input names.
callback # function to get the response and do next things.
Viola! You can get into the page and the response can be got as an argument in function you gave as callback.
You can see some examples in the link above.
Answered By - Nikita
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.