Issue
I'm trying to scrape every title and score from that page https://myanimelist.net/animelist/MoonlessMidnite?status=7 and return data in that form :
{"user" : moonlessmidnite, "anime" : A, "score" : x
"user" : moonlessmidnite, "anime" : B, "score" : x
"user" : moonlessmidnite, "anime" : C, "score" : x }
...ect
I managed to get table
table = response.xpath('.//tr[@class = "list-table-data"]')
score = table.xpath('.//td[@class = "data score"]//a/text()').extract()
title = table.xpath('.//td//a[@class = "link sort"]').extract()
but when i'm trying to scrape title or score i got some weird ouput like :
['\n ', '\n ', '${ item.anime_title }']
Solution
Look at the raw HTML of the website:
You see that it indeed contains ${ item.anime_title }
.
That indicates that the content is generated via Javascript. There's no easy solution for that, you'll have to look at the XHR requests that are being done and see if you can get something meaningful.
If you look closely at the HTML, you will see that the data is contained in a big JSON string in the table data-item
attrbute.
Try this in the scrapy shell:
fetch('https://myanimelist.net/animelist/MoonlessMidnite?status=7')
import json
json.loads(response.xpath('//table[@class="list-table"]/@data-items').extract_first()
This outputs something like this:
{'status': 2,
'score': 0,
'tags': '',
'is_rewatching': 0,
'num_watched_episodes': 1,
'anime_title': 'Hidan no Aria Special',
'anime_num_episodes': 1,
'anime_airing_status': 2,
'anime_id': 10604,
'anime_studios': None,
'anime_licensors': None,
'anime_season': None,
'has_episode_video': False,
'has_promotion_video': True,
'has_video': True,
'video_url': '/anime/10604/Hidan_no_Aria_Special/video',
'anime_url': '/anime/10604/Hidan_no_Aria_Special',
'anime_image_path': 'https://cdn.myanimelist.net/r/96x136/images/anime/2/29138.jpg?s=90cb8381c58c92d39862ac700c43f7b5',
'is_added_to_list': False,
'anime_media_type_string': 'Special',
'anime_mpaa_rating_string': 'PG-13',
'start_date_string': None,
'finish_date_string': None,
'anime_start_date_string': '12-21-11',
'anime_end_date_string': '12-21-11',
'days_string': None,
'storage_string': '',
'priority_string': 'Low'},
{'status': 6,
'score': 0,
'tags': '',
'is_rewatching': 0,
'num_watched_episodes': 0,
'anime_title': '.hack//Roots',
'anime_num_episodes': 26,
'anime_airing_status': 2,
'anime_id': 873,
'anime_studios': None,
'anime_licensors': None,
'anime_season': None,
'has_episode_video': False,
'has_promotion_video': True,
'has_video': True,
'video_url': '/anime/873/hack__Roots/video',
'anime_url': '/anime/873/hack__Roots',
'anime_image_path': 'https://cdn.myanimelist.net/r/96x136/images/anime/3/13050.jpg?s=db9ff70bf19742172f1d0140c95c4a65',
'is_added_to_list': False,
'anime_media_type_string': 'TV',
'anime_mpaa_rating_string': 'PG-13',
'start_date_string': None,
'finish_date_string': None,
'anime_start_date_string': '04-06-06',
'anime_end_date_string': '09-28-06',
'days_string': None,
'storage_string': '',
'priority_string': 'Low'}
You then just have to use this dict to get the info that you need.
Answered By - Guillaume
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.