Issue
using scrapy I want to extract the data that is shown in a dynamic table on the webpage. As the table is a dynamic one - scrapy's response xpath to tbody-tag doesn't return any data
In [1]: response.xpath('//table/tbody').getall()
Out[1]: ['<tbody></tbody>']
On the other hand scrapy's response xpath to table-tag actually already contains all data - even in a structured way:
In [2]: response.xpath('//table').getall()
Out[2]: ['<table class="table icms-dt rs_preserve" cellspacing="0" width="100%" id="publikation" data-webpack-module="datatables" data-entity-type="publikation" data-entities="{"emptyColumns":["privatKategorie","_thumbnail"],"data":[{"name":"<a href=\\"\\/_rte\\/publikation\\/35897\\">Nutzungsbedingungen<\\/a>","name-sort":"nutzungsbedingungen","herausgeber":"Informatikdienst","herausgeber-sort":"informatikdienst","datum":"16.12.2010","datum-sort":"2010-12-16","kategorieId":"publikation","kategorieId-sort":"publikation","privatKategorie":"","privatKategorie-sort":"","_thumbnail":"","_downloadBtn
I want to extract the table data in a structured way - e.g. by row and column. Is there a way with BeautifulSoup for instance? Any idea & help are highly appreciated.
The table can be examined with scrapy shell as follows:
scrapy shell "rapperswil-jona.ch/publikationen"
Solution
Here you go:
import json
raw_data =response.xpath('//table/@data-entities').get()
data = json.loads(raw_data)
The data is in the data-entities
attribute. You can extract that using the XPath as above. This returns a string.
This string can then be converted to a dict
using json.loads()
.
Expanding this further, the actual data is in the key data
. If you access it, you will get a list. You can run a loop, export to CSV, or process it further as you wish:
for item in data['data']:
print(item['name-sort'])
Answered By - Upendra
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.