Issue
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
start_url = 'https://www.example.com'
downloaded_html = requests.get(start_url)
soup = BeautifulSoup(downloaded_html.text, "lxml")
full_header = soup.select('div.reference-image')
full_header
The Output of the above code is;
[<div class="reference-image"><img src="Content/image/all/reference/c101.jpg"/></div>,
<div class="reference-image"><img src="Content/image/all/reference/c102.jpg"/></div>,
<div class="reference-image"><img src="Content/image/all/reference/c102.jpg"/></div>]
I would like to extract the img src
content as below;
["Content/image/all/reference/c101.jpg",
"Content/image/all/reference/c102.jpg",
"Content/image/all/reference/c102.jpg"]
How can I extract it?
Solution
To get that, just iterate through the result:
img_srcs = []
for i in full_header:
img_srcs.append(i.find('img')['src'])
This gives:
['Content/image/all/reference/c101.jpg', 'Content/image/all/reference/c102.jpg', 'Content/image/all/reference/c102.jpg']
Here is a one-liner for this:
img_srcs = [i.find('img')['src'] for i in full_header]
Answered By - Joshua
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.