Issue
Trying to follow this github page in order to learn crawl nested divs in facebook. https://github.com/talhashraf/major-scrapy-spiders/blob/master/mss/spiders/facebook_profile.py
parse_info_text_only
or parse_info_has_image
in the file works fine getting the span information
I have a similar page that I am trying to get the result_id
from a nested div, however result_id
is in div itself.
From what I understand div I am trying to scrape is in 2nd row, so I try something like
def parse_profile(self, response):
item["BrowseResultsContainer"] = self.parse_info_has_id(response.css('#BrowseResultsContainer'))
return item
def parse_info_has_id(self, css_path):
text = css_path.xpath('div/div').extract()
text = [t.strip() for t in text]
text = [t for t in text if re.search('result_id', t)]
return "\n".join(text)
How can I get the data-xt from above nested div?
Solution
with css:
import json
...
def parse_info_has_id(self, css_path):
text = css_path.xpath('div::attr(data-gt)').extract_first()
d = json.loads(text)
return d['result_id']
Answered By - eLRuLL
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.