Issue
----- EDIT ---- Rewrote the topic + content based on previous findings
I am scraping using a proxy service that rotates my ip. In order to obtain a new ip, the connection needs to be closed with my proxy service, and a new one opened with the new request.
For instance, if I go to http://ipinfo.io/ip with Chrome and through my proxy service, refreshing the page will give me the same ip, while closing Chrome and reopening + sending a new request will rotate the ip.
Similarly, sending several curl
command will provide new ip as the connection is closed. For instance sending several consecutive curl -x proxy_adress:proxy_port ipinfo.io/ip
gave me :
38.15.135.170
144.168.222.130
45.72.34.109
With scrapy now, I don't understand how to forcefuly close/reopen the Session between each requests.
Let's say I am using the following scraper that sends http & curl requests to ipinfo.io/ip (the proxy is setup in the middleware) :
class IpSpider(scrapy.Spider):
name = "ip"
use_proxy = True
http = 0
curl = 0
def start_requests(self):
yield scrapy.Request(
"http://ipinfo.io/ip",
callback=self.parse_ip,
dont_filter=True
)
yield scrapy.Request.from_curl(
"curl ipinfo.io/ip",
callback=self.parse_curl_ip,
dont_filter=True
)
def parse_ip(self, response):
self.logger.info(f"http {response.body}")
if self.http < 9:
self.http += 1
yield scrapy.Request(
"http://ipinfo.io/ip",
callback=self.parse_ip,
dont_filter=True
)
def parse_curl_ip(self, response):
self.logger.info(f"curl {response.body}")
if self.curl < 9:
self.curl += 1
yield scrapy.Request.from_curl(
"curl ipinfo.io/ip",
callback=self.parse_curl_ip,
dont_filter=True
)
I would expect a different ip on each requests but I have :
2022-09-04 13:07:10 [ip] INFO: curl b'143.137.164.56'
2022-09-04 13:07:10 [ip] INFO: http b'161.0.28.170'
2022-09-04 13:07:10 [ip] INFO: curl b'143.137.164.56'
2022-09-04 13:07:11 [ip] INFO: curl b'143.137.164.56'
2022-09-04 13:07:11 [ip] INFO: http b'161.0.28.170'
2022-09-04 13:07:11 [ip] INFO: http b'161.0.28.170'
2022-09-04 13:07:12 [ip] INFO: curl b'143.137.164.56'
This is very similar to the kind of results I would have if I were using requests.Session
: as Session is persistent, I would need to create a new one (which is not so straightfoward, but I easily doable)
The thing is that it does not look like Scrapy implements requests.Session
, hence I can't find how to renew the Session.
-------- EDIT --------
After testing @gangabass answer I tried to call several other websites before going back to the ip. While it did not work, it provided some surprising results : I misspell one of the website, ending in a 404 response. But what's surprising is while it killed the http loop, the curl loop which was not modified "recycled" the http context : from that moment, the curl requests provided alternatively the first ip it got, and then the ip from the first http request that went trough...
2022-09-04 19:40:28 [ip] INFO: curl b'209.127.104.51'
2022-09-04 19:40:28 [ip] INFO: http b'141.193.20.232'
2022-09-04 19:40:28 [ip] INFO: curl b'209.127.104.51'
2022-09-04 19:40:29 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 http://www.lefigaro.fr>: HTTP status code is not handled or not allowed
2022-09-04 19:40:29 [ip] INFO: curl b'209.127.104.51'
2022-09-04 19:40:29 [ip] INFO: curl b'141.193.20.232'
2022-09-04 19:40:30 [ip] INFO: curl b'209.127.104.51'
----- EDIT 2 -----
From the documentation I understand (wrongly?) that the cookiejar may help.
There is support for keeping multiple cookie sessions per spider by using the cookiejar Request meta key. By default it uses a single cookie jar (session), but you can pass an identifier to use different ones.
So I tried including in my yield a meta={'cookiejar': self.http},
in order to create a cookiejar per request, but with no luck.
----- EDIT 3 -----
What looks like a promising way is using https://github.com/ThomasAitken/scrapy-sessions (though not maintained anymore).
This library includes a specific middleware and should extend the scraper with a .sessions.clear()
method (among other)
Though I guess I do not understand how to use it as I get an error message :
AttributeError: 'IpSpider' object has no attribute 'sessions'
I guess I should reference it somewhere else than just adding the middleware, but I do not understand how/where.
Solution
---- Solution 1 ----
scrapy-playwright offers the ability to ask for a new context from the request. I actually tested it before, but I got messed up by how my middleware was injecting the proxy in the request. What was done was working with the standard request but not with scrapy-playwright (and I did not recognized my own ip...)
The meta should be :
meta={
"playwright": True,
"playwright_context": "some_unique_new_string",
"playwright_context_kwargs": {
"proxy": {
"server": "http://proxy_ip:proxy_port",
"username": "user",
"password": "pass",
},
},
},
The main drawback of this solution is that I am forced to use playwright. While it's a great tool and offer JS interpretation, it's way heavier than the standard scrapy downloader (I have cases where a page takes a minute to get with playwright and below 5 sec with the normal downloader)
---- Solution 2 ----
Second solution is a bit more complex. The way the default downloader open connection is set in scrapy.core.downloader.handlers.http11.py
More specifically, session are managed through the HTTPConnectionPool
(which itself is from twisted.web.client
) and is set by default, and without possibility for parametrization, at persistent = True
Hence it is possible to create a custom downloader :
- copy the whole
http11.py
- paste it into a
handler.py
(name is for clarity, but you can choose whatever you wish) that you'll put in the same folder as middlewares.py & settings.py - change the
HTTP11DownloadHandler
class to some custom name, such asCustomDownloadHandler
- Change
self._pool = HTTPConnectionPool(reactor, persistent=True)
toself._pool = HTTPConnectionPool(reactor, persistent=False)
- reference it into your
DOWNLOAD_HANDLERS
of thesetting.py
file
It should look like:
DOWNLOAD_HANDLERS = {
"http": "[scraper_name].handler.CustomDownloadHandler",
"https": "[scraper_name].handler.CustomDownloadHandler",
}
There is two main issues here:
- DOWNLOAD_HANDLERS can only accept one type of handler by method (http, https, ...). If you want to use
playwright
on top of this custom handler, well you can't at least until scrapy does what is suggested here - Every new request will be sent over a new connection. While it match the expectation for a rotating proxy, it can be problematic if you'd like to maintain the connection for some consecutive requests. I tried to include a specific parameter in meta to call the
close()
method on demand but it does not actually reset the pool of connection (and anyway it would close all connections and not only the current one)
Answered By - samuel guedon
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.