Issue
How do I make sure I am getting new ip in each scrapy request? I tried with both stormproxies and smartproxies but the ip it gives is same for a session.
However, the ip is new on each run. But for a single session, the ip is same.
My code below:
import json
import uuid
import scrapy
from scrapy.crawler import CrawlerProcess
class IpTest(scrapy.Spider):
name = 'IP_test'
previous_ip = ''
count = 1
ip_url = 'https://ifconfig.me/all.json'
def start_requests(self,):
yield scrapy.Request(
self.ip_url,
dont_filter=True,
meta={
'cookiejar': uuid.uuid4().hex,
'proxy': MY_ROTATING_PROXY # either stormproxy or smartproxy
}
)
def parse(self, response):
ip_address = json.loads(response.text)['ip_addr']
self.logger.info(f"IP: {ip_address}")
if self.count < 10:
self.count += 1
yield from self.start_requests()
settings = {
'DOWNLOAD_DELAY': 1,
'CONCURRENT_REQUESTS': 1,
}
process = CrawlerProcess(settings)
process.crawl(IpTest)
process.start()
Output logs:
2020-12-27 21:15:52 [scrapy.core.engine] INFO: Spider opened
2020-12-27 21:15:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-12-27 21:15:52 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-12-27 21:15:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: None)
2020-12-27 21:15:55 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:15:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:15:56 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:15:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:15:57 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:15:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:15:59 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:00 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:01 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:03 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:04 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:06 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:07 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:07 [scrapy.core.engine] INFO: Closing spider (finished)
What am I doing wrong here?
I even tried disabling cookies (COOKIES_ENABLED = False
), removing cookiejar from request.meta. But no luck.
Solution
It was hard, but I found the answer. For Storm you need pass headers with 'Connection': 'close'. In this case you will get new proxy for each request. For example:
HEADERS = {'Connection': 'close'}
yield Request(url=url, callback=self.parse, body=body, headers=HEADERS)
In this case Storm will close connection ang give you new IP per request
Answered By - Aleks Kisel
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.