Issue
I'm using scrapy to crawl a website this is how I maintain the cookie jar after login
def start_requests(self):
return [scrapy.Request("https://www.address.com", meta = {'cookiejar' : 1}, callback = self.post_login)]
def post_login(self, response):
print('Preparing login')
return [FormRequest.from_response(response, #"http://www.zhihu.com/login",
meta = {'cookiejar' : response.meta['cookiejar']},
headers = self.headers,
formdata = {
'username': 'user',
'password': 'pass123'
},
callback = self.after_login,
)]
then, each request I will need to
yield scrapy.Request(curr, meta={'cookiejar':response.meta['cookiejar']}, callback=self.parse_detail)
Everything goes well until I need to crawl the image from the site. I will need to use urllib.request.urlretrive(), imagePipeline of scrapy or similar tools to open the image_url.
but how can I pass my cookie jar with it? otherwise, it will be redirect to login page.
or is there a way to download the image directly with scrapy request?
Thank you, eLRuLL, for solving the problem for me but the code need to be modified a little in python3
from io import BytesIO instead of from StringIO import StringIO and then use the BytesIO in the following code.
Solution
the response.body
has the information you need, you can later parse it to what it is.
I am not entirely sure this will work for every image file type, but you can get more information in the response.headers['content-type']
so you can know which file type it actually is and use a respective python module to handle that file type:
from PIL import Image
from StringIO import StringIO
...
def parse_image(self, response):
i = Image.open(StringIO(response.body))
i.save("imagefile.png")
...
with that you made a scrapy request and saved the image (this is saving in the same directory as your project).
Install PIL
with pip install Pillow
Answered By - eLRuLL
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.