Issue
I tried searching for this here but couldn't find an answer to be honest as this should be fairly easy to do with Selenium but since performance is an important factor, I was thinking on doing this with Beautifulsoup instead.
Scenario: I need to scrape the prices of different items which are generated in a random fashion depending on user input, see code below:
<div class="sk-expander-content" style="display: block;">
<ul>
<li>
<span>Third Party Liability</span>
<span>€756.62</span>
</li>
<li>
<span>Fire & Theft</span>
<span>€15.59</span>
</li>
</ul>
</div>
If these options were static and would always be displayed in the same position within the html, it would be easy to scrape the prices but since these could be placed anywhere within the div sk-expander-content
, I'm not sure how to find these in a dynamic way.
The best approach would be to write a method to pass in the text of the span we are looking for and return the value in Euro. The structure of the span tags is always the same, the first span is always the name of the item and the second one is always the price.
The first thing that came to mind is the following code, but I'm not sure if this is even robust enough or if it makes sense:
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
div_i_need = soup.find_all("div", class_="sk-expander-content")[1]
def price_scraper(text_to_find):
for el in div_i_need.find_all(['ul', 'li', 'span']):
if el.name == 'span':
if el[0].text == text_to_find:
return(el[1].text)
Your help will be much appreciated.
Solution
Use regular expression.
import re
html='''<div class="sk-expander-content" style="display: block;">
<ul>
<li>
<span>Third Party Liability</span>
<span>€756.62</span>
</li>
<li>
<span>Fire & Theft</span>
<span>€15.59</span>
</li>
</ul>
</div>
<div class="sk-expander-content" style="display: block;">
<ul>
<li>
<span>Fire & Theft</span>
<span>€756.62</span>
</li>
<li>
<span>Third Party Liability</span>
<span>€15.59</span>
</li>
</ul>
</div>'''
soup = BeautifulSoup(html, "html.parser")
for item in soup.find_all(class_="sk-expander-content"):
for span in item.find_all('span',text=re.compile("€(\d+).(\d+)")):
print(span.find_previous_sibling('span').text)
print(span.text)
Output:
Third Party Liability
€756.62
Fire & Theft
€15.59
Fire & Theft
€756.62
Third Party Liability
€15.59
UPDATE:
If you want to get first node value.Then use find()
instead of find_all()
.
import re
html='''<div class="sk-expander-content" style="display: block;">
<ul>
<li>
<span>Third Party Liability</span>
<span>€756.62</span>
</li>
<li>
<span>Fire & Theft</span>
<span>€15.59</span>
</li>
</ul>
</div>
<div class="sk-expander-content" style="display: block;">
<ul>
<li>
<span>Fire & Theft</span>
<span>€756.62</span>
</li>
<li>
<span>Third Party Liability</span>
<span>€15.59</span>
</li>
</ul>
</div>'''
soup = BeautifulSoup(html, "html.parser")
for span in soup.find(class_="sk-expander-content").find_all('span',text=re.compile("€(\d+).(\d+)")):
print(span.find_previous_sibling('span').text)
print(span.text)
Answered By - KunduK
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.