Issue
I am trying to speed up web scraping by removing blocking I/O, so I decided to change the requests package to aiohttp.
Unfortunately after switch to aiohttp, websites built with Angular give me the response without dynamic content.
So, I have the following 2 questions,
- Why the requests module gives me proper (rendered) content if it doesn't run JS like selenium, but aiohttp not?
- How can I fix the code to get proper content with aiohttp?
import aiohttp
import asyncio
import requests
URL = 'https://justjoin.it/'
async def fetch_async(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
return await resp.text()
async def main():
content_async = await fetch_async(URL)
content_requests = requests.get(URL).text
print('Are equal: ', content_async == content_requests)
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(main())
finally:
loop.close()
Solution
I've solved my problem. Iain suggested that I should invistigate headers send to server, after playing with headers I discovered that returned content depends on User-Agent.
When i send aiohttp request with 'USER-AGENT' :'python-requests/2.22.0' I got rendered content, same for 'Google Bot' but if User-Agent was set to 'Python/3.6 aiohttp/3.6.2' or 'Firefox' i got not rendered content.
So for some user-agents server makes server side rendering.
Solution:
async def fetch_async(url):
async with aiohttp.ClientSession() as session:
async with session.get(url, headers={'User-Agent': 'python-requests/2.22.0'}) as resp:
print('AIOHTTP headers: ', dict(resp.request_info.headers))
return await resp.text()
Answered By - Bartosz
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.