Issue
I'm crawling this page with scrapy and I'm trying to extract all the rows of the main table.
The following XPath expression should give me the wanted result:
//div[@id='TableWithRules']//tbody/tr
Testing with the scrap shell made me notice that this expression does return an empty array:
#This response is empty: []
response.xpath("//div[@id='TableWithRules']//tbody").extract()
#This one is not:
response.xpath("//div[@id='TableWithRules']//thead").extract()
I guess the website owners tries to limit the scraping of the table data but is there any way to find a work-around?
Solution
This is happening because you are trying to query a non-existing element. The tbody
element is often injected into the html by the browser and actually doesn't exist in the source html prior to being rendered. You can see this if you inspect the page source.
A possible workaround to get all of the rows would be to simply bypass the tbody
tag and query the rows directly:
Example:
scrapy shell https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=hp
In [1]: rows = response.xpath("//div[@id='TableWithRules']//tr")
In [2]: len(rows)
Out[2]: 3366
Or if you wanted to skip the header row then you could do.
In [1]: rows = response.xpath("//div[@id='TableWithRules']//tr[td]")
In [2]: len(rows)
Out[2]: 3365
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.