Issue
Let's say I have this structure
<div data-next="link0">
<a href="link1"/>
<a href="link2"/>
<a href="link3"/>
<a href="link4"/>
</div>
and with my rule object I want to access only link0, without accessing link1, link2, link3, link4.
How can I do that?
I tried
Rule(LinkExtractor(restrict_xpaths=('//div[@data-next]/@data-next')), callback='parse_item'),
but it won't work, because I need a reference to an element, not the link directly. But if I remove @data-next, link1, link2, link3, link4 will be scraped too.
So, is there any way to scrape just link0 using the Rule object in this context?
Solution
Rule(LinkExtractor(restrict_xpaths='//div[@data-next]', tags='div', attrs='data-next'), callback='parse_item'),
LinkExtractor
looks for <a>
tags and @href
attr by default. In this case, you have specify which tags and attributes it should include in the search. More on that from Scrapy docs:
Parameters:
(...)
tags (str or list) – a tag or a list of tags to consider when extracting links. Defaults to
('a', 'area')
.attrs (list) – an attribute or list of attributes which should be considered when looking for links to extract (only for those tags specified in the tags parameter). Defaults to
('href',)
Answered By - Thiago Curvelo
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.