Tuesday, January 30, 2024

[FIXED] How to get image src url in source code with xpath in python

January 30, 2024 beautifulsoup, html, lxml, python, xpath No comments

Issue

So, i'm developing a program to download some images from sites, and i have to somehow get the "src" part of the img tag. I was able to do this with selenium, but i had to adapt the code and now i'm using BeautifulSoup4 and lxml. I currently have the whole source code of the page (site) in a variable "mystr", and i wanted to give an xpath and find that xpath inside that variable? Is it possible? (probably) The reason i'm posting this question is because i can't seem to parse the variable to lxml and use it's function .xpath()

--READ FOR MORE CONTEXT OF THE PROBLEM-- I'm reading some data from an excel file (reference values and url's), i want to open the url, download the product image, and rename it for it's reference. I can do this with multiple images, but when the url only has 1 image, i wanted to go with the xpath to download the image and i didn't want to use selenium again.

Thanks in advance. I think this is the part of the code that matters for this question.

try: #Extrair o html
    fp = urllib.request.urlopen(links[i])
    mybytes = fp.read()
    mystr = mybytes.decode("utf8")
    fp.close()
except Exception as ex: #Exceção do html
    print("Não foi possivel extrair o HTML deste url")
    erros.append(i)
    continue                
try: #Passar para Beautiful soup 4
    soup = BeautifulSoup(mystr, "lxml")
    #print(mystr, file = open("teste.txt", "a"))
except Exception as ex: # Exceção do Beautiful soup 4
    print("Não foi possivel converter o HTML para bs4\n\n" + ex)
    erros.append(i)
    continue
try: #Navegar até ao DIV dentro do html extraido
    main_div = soup.find_all("div", {"id": div_id})
    if len(main_div) == 0:
        parser = etree.HTMLParser()
        tree = etree.parse(mybytes, parser)
        #print(tree, file=open("tree.txt", "a"))
        #image = tree.xpath('//*[@id="image"]')
        image = tree.xpath("/html/body/div[1]/div/div/div/div[1]/div[1]/div[1]/a/img")
        print(image[0].tag)
        #input("--------------------------------------------------")
except Exception as ex: #Exceção se não existir um div dentro do HTML extraido com o ID fornecido
    print("Não existe nenhum DIV com o id fornecidon\n\n" + ex)
    erros.append(i)
    continue

Solution

One BeautifulSoup way:

img_src=soup.find("img")["src"]

One lxml etree way:

img_src=tree.xpath('//img')[0].attrib.get('src')

Answered By - 0buz

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 30, 2024

[FIXED] How to get image src url in source code with xpath in python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels