Saturday, February 12, 2022

[FIXED] how to use Scrapy to download image that needs cookie

February 12, 2022 cookies, python, python-requests, request, scrapy No comments

Issue

I'm using scrapy to crawl a website this is how I maintain the cookie jar after login

def start_requests(self):
    return [scrapy.Request("https://www.address.com", meta = {'cookiejar' : 1}, callback = self.post_login)]


def post_login(self, response):
    print('Preparing login')
    return [FormRequest.from_response(response,   #"http://www.zhihu.com/login",
                            meta = {'cookiejar' : response.meta['cookiejar']}, 
                            headers = self.headers,
                            formdata = {
                                'username': 'user',
                                'password': 'pass123'
                            },
                            callback = self.after_login,
                        )]

then, each request I will need to

yield scrapy.Request(curr, meta={'cookiejar':response.meta['cookiejar']}, callback=self.parse_detail)

Everything goes well until I need to crawl the image from the site. I will need to use urllib.request.urlretrive(), imagePipeline of scrapy or similar tools to open the image_url.

but how can I pass my cookie jar with it? otherwise, it will be redirect to login page.

or is there a way to download the image directly with scrapy request?

Thank you, eLRuLL, for solving the problem for me but the code need to be modified a little in python3

from io import BytesIO instead of from StringIO import StringIO and then use the BytesIO in the following code.

Solution

the response.body has the information you need, you can later parse it to what it is.

I am not entirely sure this will work for every image file type, but you can get more information in the response.headers['content-type'] so you can know which file type it actually is and use a respective python module to handle that file type:

from PIL import Image
from StringIO import StringIO

...

    def parse_image(self, response):
        i = Image.open(StringIO(response.body))
        i.save("imagefile.png")
        ...

with that you made a scrapy request and saved the image (this is saving in the same directory as your project).

Install PIL with pip install Pillow

Answered By - eLRuLL

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, February 12, 2022

[FIXED] how to use Scrapy to download image that needs cookie

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels