Issue
I was trying to scrape data from this website of mihoyo and ran into a problem. I wanted to get data from the attribute 'data-src' in the first 'div' tag but I couldn't.
<a href="[not important]" target="_blank" class="collection-avatar__item" data-v-51c84696="">
<div class="collection-avatar__icon" data-v-51c84696="" data-src="[link to a png image that I need]"
lazy="loaded" style="[not important]">
<div class="red-point" data-v-51c84696="">
<!---->
</div>
</div>
</a>
My code was:
url = "https://bbs.mihoyo.com/bh3/wiki/channel/map/17/18?bbs_presentation_style=no_header"
result = requests.get(url).text
soup = BeautifulSoup(result, 'lxml')
a = soup.find('a' , class_ = 'collection-avatar__item')
b = a.find('div' , class_ = 'collection-avatar__icon')['data_src']
print(b)
It didn't print out anything.
Turned out the problem was in the 'div' tag. I printed out the whole 'div' tag:
<div class="collection-avatar__icon" data-v-51c84696=""><div class="red-point" data-v-51c84696=""><!-- --></div></div>
with the code:
print(a.find('div' , class_ = 'collection-avatar__icon'))
You can see that the 'data-src', 'style' and 'lazy' attributes in the'div' tag are all gone. It seems like the 'data-v-51c84696' blocks everything behind it but idk if it's true or not.
How can I get the 'data-src'?
P/s: If you want to try it yourself:
- Go to this website right click on this character and click "Inspect". It will open the html and get you to the said 'div' tag.
- Use this exact code. The code will print out the 'div' tag:
import requests
from bs4 import BeautifulSoup
url = "https://bbs.mihoyo.com/bh3/wiki/channel/map/17/18?bbs_presentation_style=no_header"
result = requests.get(url).text
soup = BeautifulSoup(result, 'lxml')
a = soup.find('a' , class_ = 'collection-avatar__item')
b = a.find('div' , class_ = 'collection-avatar__icon')
print(b)
Solution
All icons are located in the script tag. We can write a small regular expression to search. But unfortunately this ll also include other unnecessary icons that require additional scraping.
import requests
import codecs
import re
url = "https://bbs.mihoyo.com/bh3/wiki/channel/map/17/18?bbs_presentation_style=no_header"
result = requests.get(url).text
urls = re.findall(r'icon:"https://uploadstatic.mihoyo.com/bh3-wiki/[\w/\-?=%.]+\.[\w/\-&?=%.]+\"',
codecs.decode(result, 'unicode-escape'))
for url in urls:
print(url.split("\"")[1])
OUTPUT:
...
https://uploadstatic.mihoyo.com/bh3-wiki/2021/09/30/50494840/15dea92b7fb5cd87ad6627875a266385_9916730472908248.png
https://uploadstatic.mihoyo.com/bh3-wiki/2021/09/30/50494840/5945dff0282af271e16e57c7ad15a4ae_7519164841930847555.png
https://uploadstatic.mihoyo.com/bh3-wiki/2021/09/30/50494840/fcacefd801349fa71e5273e96cc054ba_2447454218864254521.png
https://uploadstatic.mihoyo.com/bh3-wiki/2021/09/30/50494840/7a6aeae8cdb4c91f02d771b741c6904a_845801527787009961.png
https://uploadstatic.mihoyo.com/bh3-wiki/2021/09/30/50494840/d4bd24ff7816eab1d1bd86499eb4c2c7_9095240068801747016.png
https://uploadstatic.mihoyo.com/bh3-wiki/2021/10/02/50494840/07af6da5cccbed3086f815cfb69cdcdb_5033364458283548140.png
https://uploadstatic.mihoyo.com/bh3-wiki/2021/10/02/50494840/9e99b98bc5f6ace0254fd87e59cb084e_4185942565707793730.png
https://uploadstatic.mihoyo.com/bh3-wiki/2021/10/02/50494840/881a3e55c864ea2e5b54831882efc41b_6066979385842423374.png
https://uploadstatic.mihoyo.com/bh3-wiki/2021/09/17/91006211/cb701147b58fa90ce2d1c697a71db069_1970544194740941078.png
https://uploadstatic.mihoyo.com/bh3-wiki/2021/09/17/91006211/421f0e3c83f15f60a082c06a829fd993_8194132893785979350.png
https://uploadstatic.mihoyo.com/bh3-wiki/2021/09/17/91006211/bb2cfcd7fcf39c6895b3e7df38d9e248_5099264011974935251.png
https://uploadstatic.mihoyo.com/bh3-wiki/2021/09/17/91006211/65334dde046add34464b650b7e9e8bba_6688145594601648953.png
https://uploadstatic.mihoyo.com/bh3-wiki/2021/09/17/91006211/0fbbf8a476c36f28fa89ff4d42dabf26_3569844437201314583.png
https://uploadstatic.mihoyo.com/bh3-wiki/2021/09/17/91006211/d5208b782e51b3b3f8a0bfa5b23eeefe_7243168838449411667.png
https://uploadstatic.mihoyo.com/bh3-wiki/2021/09/17/91006211/2a482ec9445fbf87d2d094f02c2b60bb_6707519352437890350.png
https://uploadstatic.mihoyo.com/bh3-wiki/2021/09/17/91006211/3a8e1937b4345b0e3ebf72e18aaca621_3484536741467014552.png
https://uploadstatic.mihoyo.com/bh3-wiki/2021/09/17/91006211/008c7f4a5dc30b3c7b1939ce3bf1f4ac_4846612495464831821.png
https://uploadstatic.mihoyo.com/bh3-wiki/2021/09/16/91006211/0e58c3e59fa446455981d085bf1ea025_8229117182978558973.png
https://uploadstatic.mihoyo.com/bh3-wiki/2021/09/16/91006211/021c1c9b6eaacd70d65afe7b6a70410f_334979533936426827.png
https://uploadstatic.mihoyo.com/bh3-wiki/2021/09/16/91006211/9fbe7444ed947d553b832d349954decf_1841039999904029926.png
https://uploadstatic.mihoyo.com/bh3-wiki/2021/09/17/91006211/a7a0b9f6e047a0697911f3fc53d6ee35_6693951999225202842.png
https://uploadstatic.mihoyo.com/bh3-wiki/2021/09/17/91006211/c85dd72ee95e8a4f67f107ae1cc4927a_3991707354687390950.png
...
Answered By - Sergey K
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.