Issue
I was trying to learn web scraping using beautiful soup and tried to get html file, but it got only few contents and I don't know why, please help me.
import requests
from bs4 import BeautifulSoup
response = requests.get('https://comic.naver.com/webtoon')
soup = BeautifulSoup(response.text, 'lxml').prettify()
print(soup)
Is it impossible to get it with python beautiful soup? If yes, what should I use?
I searched a lot on google, but couldn't find the right answer...
Solution
$ curl -i -s 'https://comic.naver.com/webtoon' | egrep '^<script'
<script type="text/javascript" src="/runtime-2adfe8d0e350c84f0a29.js"></script>
<script type="text/javascript" src="/vendor-react-d37d9c657a271200d9cf.js"></script>
<script type="text/javascript" src="/vendor-react-common-39f644b98f3af612d766.js"></script>
<script type="text/javascript" src="/vendor-common-4c04532899aecf03d14c.js"></script>
<script type="text/javascript" src="/vendor-log-feb99cf7b041c7e3b64d.js"></script>
<script type="text/javascript" src="/router-b4aa7bff56fd79446adc.js"></script>
This is a React SPA, which heavily relies on JS. You never {fetched, executed} the router script, nor any of the others. The content you seek is produced by JS modifying the DOM.
Consider using Selenium to crawl this page, since it can support the necessary JS evaluation.
Answered By - J_H
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.