Issue
So, i'm developing a program to download some images from sites, and i have to somehow get the "src" part of the img tag. I was able to do this with selenium, but i had to adapt the code and now i'm using BeautifulSoup4 and lxml. I currently have the whole source code of the page (site) in a variable "mystr", and i wanted to give an xpath and find that xpath inside that variable? Is it possible? (probably) The reason i'm posting this question is because i can't seem to parse the variable to lxml and use it's function .xpath()
--READ FOR MORE CONTEXT OF THE PROBLEM-- I'm reading some data from an excel file (reference values and url's), i want to open the url, download the product image, and rename it for it's reference. I can do this with multiple images, but when the url only has 1 image, i wanted to go with the xpath to download the image and i didn't want to use selenium again.
Thanks in advance. I think this is the part of the code that matters for this question.
try: #Extrair o html
fp = urllib.request.urlopen(links[i])
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
except Exception as ex: #Exceção do html
print("Não foi possivel extrair o HTML deste url")
erros.append(i)
continue
try: #Passar para Beautiful soup 4
soup = BeautifulSoup(mystr, "lxml")
#print(mystr, file = open("teste.txt", "a"))
except Exception as ex: # Exceção do Beautiful soup 4
print("Não foi possivel converter o HTML para bs4\n\n" + ex)
erros.append(i)
continue
try: #Navegar até ao DIV dentro do html extraido
main_div = soup.find_all("div", {"id": div_id})
if len(main_div) == 0:
parser = etree.HTMLParser()
tree = etree.parse(mybytes, parser)
#print(tree, file=open("tree.txt", "a"))
#image = tree.xpath('//*[@id="image"]')
image = tree.xpath("/html/body/div[1]/div/div/div/div[1]/div[1]/div[1]/a/img")
print(image[0].tag)
#input("--------------------------------------------------")
except Exception as ex: #Exceção se não existir um div dentro do HTML extraido com o ID fornecido
print("Não existe nenhum DIV com o id fornecidon\n\n" + ex)
erros.append(i)
continue
Solution
One BeautifulSoup
way:
img_src=soup.find("img")["src"]
One lxml etree
way:
img_src=tree.xpath('//img')[0].attrib.get('src')
Answered By - 0buz
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.