Issue
In a private project (learning python scripting), i needed to retrieve only the rpm package of the scrapped page. I spotted that all package links (.msi, .deb, .rpm) has an attribute called data-link inside 'a' balise.
I also taylored my own regex (https://regexr.com/6rqd2) to match only the package i need.
According to documentation, it seems that this kind of attribute (data-*) is a non-standard attribute in HTML 5.
So i tried the attrs argument and passed into find_all() but with no success.
Unsuccessfull Code below
#!/usr/bin/env python3
import re
from bs4 import BeautifulSoup
url = "https://www.splunk.com/en_us/download/splunk-enterprise.html"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
pattern = re.compile("(?<=data-link=\")[^ ]+rpm")
package = soup.find_all(attrs={"data-link": pattern})
print(package)
Thank you in advance for your help
Solution
Another solution, using CSS selectors:
import requests
from bs4 import BeautifulSoup
url = "https://www.splunk.com/en_us/download/splunk-enterprise.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a in soup.select('a[data-link$=".rpm"]'):
print(a["data-link"])
Prints:
https://download.splunk.com/products/splunk/releases/9.0.0.1/linux/splunk-9.0.0.1-9e907cedecb1-linux-2.6-x86_64.rpm
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.