Thursday, January 27, 2022

[FIXED] Scraping a list with scrapy and structure it

January 27, 2022 python, scrapy, web-scraping No comments

Issue

I'm trying to scrape every title and score from that page https://myanimelist.net/animelist/MoonlessMidnite?status=7 and return data in that form :

{"user" : moonlessmidnite, "anime" : A, "score" : x 
"user" : moonlessmidnite, "anime" : B, "score" : x 
"user" : moonlessmidnite, "anime" : C, "score" : x }

...ect

I managed to get table

table = response.xpath('.//tr[@class = "list-table-data"]')

score = table.xpath('.//td[@class =  "data score"]//a/text()').extract()
title = table.xpath('.//td//a[@class = "link sort"]').extract()

but when i'm trying to scrape title or score i got some weird ouput like :

['\n            ', '\n          ', '${ item.anime_title }']

Solution

Look at the raw HTML of the website:

You see that it indeed contains ${ item.anime_title }.

That indicates that the content is generated via Javascript. There's no easy solution for that, you'll have to look at the XHR requests that are being done and see if you can get something meaningful.

If you look closely at the HTML, you will see that the data is contained in a big JSON string in the table data-item attrbute.

Try this in the scrapy shell:

fetch('https://myanimelist.net/animelist/MoonlessMidnite?status=7')
import json
json.loads(response.xpath('//table[@class="list-table"]/@data-items').extract_first()

This outputs something like this:

{'status': 2,
  'score': 0,
  'tags': '',
  'is_rewatching': 0,
  'num_watched_episodes': 1,
  'anime_title': 'Hidan no Aria Special',
  'anime_num_episodes': 1,
  'anime_airing_status': 2,
  'anime_id': 10604,
  'anime_studios': None,
  'anime_licensors': None,
  'anime_season': None,
  'has_episode_video': False,
  'has_promotion_video': True,
  'has_video': True,
  'video_url': '/anime/10604/Hidan_no_Aria_Special/video',
  'anime_url': '/anime/10604/Hidan_no_Aria_Special',
  'anime_image_path': 'https://cdn.myanimelist.net/r/96x136/images/anime/2/29138.jpg?s=90cb8381c58c92d39862ac700c43f7b5',
  'is_added_to_list': False,
  'anime_media_type_string': 'Special',
  'anime_mpaa_rating_string': 'PG-13',
  'start_date_string': None,
  'finish_date_string': None,
  'anime_start_date_string': '12-21-11',
  'anime_end_date_string': '12-21-11',
  'days_string': None,
  'storage_string': '',
  'priority_string': 'Low'},
 {'status': 6,
  'score': 0,
  'tags': '',
  'is_rewatching': 0,
  'num_watched_episodes': 0,
  'anime_title': '.hack//Roots',
  'anime_num_episodes': 26,
  'anime_airing_status': 2,
  'anime_id': 873,
  'anime_studios': None,
  'anime_licensors': None,
  'anime_season': None,
  'has_episode_video': False,
  'has_promotion_video': True,
  'has_video': True,
  'video_url': '/anime/873/hack__Roots/video',
  'anime_url': '/anime/873/hack__Roots',
  'anime_image_path': 'https://cdn.myanimelist.net/r/96x136/images/anime/3/13050.jpg?s=db9ff70bf19742172f1d0140c95c4a65',
  'is_added_to_list': False,
  'anime_media_type_string': 'TV',
  'anime_mpaa_rating_string': 'PG-13',
  'start_date_string': None,
  'finish_date_string': None,
  'anime_start_date_string': '04-06-06',
  'anime_end_date_string': '09-28-06',
  'days_string': None,
  'storage_string': '',
  'priority_string': 'Low'}

You then just have to use this dict to get the info that you need.

Answered By - Guillaume

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, January 27, 2022

[FIXED] Scraping a list with scrapy and structure it

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels