Issue
So i've been scraping a website (www.cardsphere.com) protected pages with requests, using session, like so:
import requests
payload = {
'email': <enter-email-here>,
'password': <enter-site-password-here>
}
with requests.Session() as request:
requests.get(<site-login-page>)
request.post(<site-login-here>, data=payload)
request.get(<site-protected-page1>)
save-stuff-from-page1
request.get(<site-protected-page2>)
save-stuff-from-page2
.
.
.
request.get(<site-protected-pageN>)
save-stuff-from-pageN
the-end
Now since it's quite a bit of pages i wanted to speed it up with Aiohttp + asyncio...but i'm missing something. I've been able to more or less use it to scrape unprotected pages, like so:
import asyncio
import aiohttp
async def get_cards(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
data = await resp.text()
<do-stuff-with-data>
urls = [
'https://www.<url1>.com'
'https://www.<url2>.com'
.
.
.
'https://www.<urlN>.com'
]
loop = asyncio.get_event_loop()
loop.run_until_complete(
asyncio.gather(
*(get_cards(url) for url in urls)
)
)
That gave some results but how do i do it for pages that require login? I tried adding session.post(<login-url>,data=payload)
inside the async function but that obviously didn't work out well, it will just keep logging in. Is there a way to "set" an aiohttp ClientSession before the loop function? As i need to login first and then, on the same session, get data from a bunch of protected links with asyncio + aiohttp?
Still rather new to python, async even more so, i'm missing some key concept here. If anybody would point me in the right direction i'll greatly appreciate it.
Solution
This is the simplest I can come up with, depending on what you do in <do-stuff-with-data>
you may run into some other troubles regarding concurrency, down the rabbit hole you go... just kidding, its a little bit more complicated to wrap your head around coros and promises and tasks but once you get it is as simple as sequential programming
import asyncio
import aiohttp
async def get_cards(url, session, sem):
async with sem, session.get(url) as resp:
data = await resp.text()
# <do-stuff-with-data>
urls = [
'https://www.<url1>.com',
'https://www.<url2>.com',
'https://www.<urlN>.com'
]
async def main():
sem = asyncio.Semaphore(100)
async with aiohttp.ClientSession() as session:
await session.get('auth_url')
await session.post('auth_url', data={'user': None, 'pass': None})
tasks = [asyncio.create_task(get_cards(url, session, sem)) for url in urls]
results = await asyncio.gather(*tasks)
return results
asyncio.run(main())
Answered By - Dalvenjia
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.