Issue
Im new in scraping, And am doing some scraping project and I trying to get value from the Html Below:
<div class="buttons_zoom"><div class="full_prod"><a href="javascript:void(0)" onclick="js:getProdID('https://www.XXXXXXX.co.il','{31F93B1D-449F-4AD7-BFB0-97A0A8E068F6}','379104')" title="לחם אחיד פרוס אנג'ל 750 גרם - פרטים נוספים"><img alt="פרטים נוספים" border="0" src="template/images/new_site/icon-view-prod-cartpage.png"/></a></div></div>
i want to get this value : 379104 which located in onclick im using BeautifulSoup The code:
for i in page_content.find_all('div', attrs={'class':'prodPrice'}):
temp = i.parent.parent.contents[0]
temp return list of objects and temp= to the Html Above can someone help to extract this id thanks!!
Edit****** Wow guys thanks for amazing explanation!!!!! but i have 2 issues 1.retry mechanism that no working i set it to timeout=1 in order to make it fail but once its fail its return:
requests.exceptions.RetryError: HTTPSConnectionPool(host='www.XXXXX.il', port=443): Max retries exceeded with url: /default.asp?catid=%7B2234C62C-BD68-4641-ABF4-3C225D7E3D81%7D (Caused by ResponseError('too many redirects',))
can you please help me with retry mechanism code below : 2. perfromance issues witout the retry mechanism when im set timeout=6 scraping duration of 8000 items taking 15 minutes how i can improve this code performance ? Code below:
def get_items(self, dict):
itemdict = {}
for k, v in dict.items():
boolean = True
# here, we fetch the content from the url, using the requests library
while (boolean):
try:
a =requests.Session()
retries = Retry(total=3, backoff_factor=0.1, status_forcelist=[301,500, 502, 503, 504])
a.mount(('https://'), HTTPAdapter(max_retries=retries))
page_response = a.get('https://www.XXXXXXX.il' + v, timeout=1)
except requests.exceptions.Timeout:
print ("Timeout occurred")
logging.basicConfig(level=logging.DEBUG)
else:
boolean = False
# we use the html parser to parse the url content and store it in a variable.
page_content = BeautifulSoup(page_response.content, "html.parser")
for i in page_content.find_all('div', attrs={'class':'prodPrice'}):
parent = i.parent.parent.contents[0]
getparentfunc= parent.find("a", attrs={"href": "javascript:void(0)"})
itemid = re.search(".*'(\d+)'.*", getparentfunc.attrs['onclick']).groups()[0]
itemName = re.sub(r'\W+', ' ', i.parent.contents[0].text)
priceitem = re.sub(r'[\D.]+ ', ' ', i.text)
itemdict[itemid] = [itemName, priceitem]
Solution
from bs4 import BeautifulSoup as bs
import re
txt = """<div class="buttons_zoom"><div class="full_prod"><a href="javascript:void(0)" onclick="js:getProdID('https://www.XXXXXXX.co.il','{31F93B1D-449F-4AD7-BFB0-97A0A8E068F6}','379104')" title="לחם אחיד פרוס אנג'ל 750 גרם - פרטים נוספים"><img alt="פרטים נוספים" border="0" src="template/images/new_site/icon-view-prod-cartpage.png"/></a></div></div>"""
soup = bs(txt,'html.parser')
a = soup.find("a", attrs={"href":"javascript:void(0)"})
r = re.search(".*'(\d+)'.*", data).groups()[0]
print(r) # will print '379104'
Edit
Replaced ".*\}.*,.*'(\d+)'\).*"
with ".*'(\d+)'.*"
. They produce the same result but the latter is much cleaner.
Explanation : Soup
find
the (first) element w/ an a
tag where the attribute "href" has "javascript:void(0)" as its value. More about beautiful soup keyword arguments here.
a = soup.find("a", attrs={"href":"javascript:void(0)"})
This is equivalent to
a = soup.find("a", href="javascript:void(0)")
In older versions of Beautiful Soup, which don’t have the class_ shortcut, you can use the attrs trick mentioned above. Create a dictionary whose value for “class” is the string (or regular expression, or whatever) you want to search for. -- see beautiful soup documentation about "attrs"
a
points to an element of type <class 'bs4.element.Tag'>
. We can access the tag attributes like we would do for a dictionary via the property a.attrs
(more about that at beautiful soup attributes). That's what we do in the following statement.
a_tag_attributes = a.attrs # that's the dictionary of attributes in question...
The dictionary keys are named after the tags attributes. Here we have the following keys/attributes name : 'title', 'href' and 'onclick'.
We can check that out for ourselves by printing them.
print(a_tag_attributes.keys()) # equivalent to print(a.attrs.keys())
This will output
dict_keys(['title', 'href', 'onclick']) # those are the attributes names (the keys to our dictionary)
From here, we need to get the data we are interested in. The key to our data is "onclick" (it's named after the html attribute where the data we seek lays).
data = a_tag_attributes["onclick"] # equivalent to data = a.attrs["onclick"]
data
now holds the following string.
"js:getProdID('https://www.XXXXXXX.co.il','{31F93B1D-449F-4AD7-BFB0-97A0A8E068F6}','379104')"
Explanation : Regex
Now that we have isolated the piece that contains the data we want, we're going to extract just the portion we need.
We'll do so by using a regular expression (this site is an excellent resource if you want to know more about Regex, good stuff).
To use regular expression in Python we
must import the Regex module re
. More about the "re" module here, good good stuff.
import re
Regex lets us search a string that matches a pattern.
Here the string is our data, and the pattern is ".*'(\d+)'.*"
(which is also a string as you can tell by the use of the double quotes).
You can think of regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as
*.txt
to find all text files in a file manager. The regex equivalent is^.*\.txt$
.
Best you read about regular expressions to further understand what it is about. Here's a quick start, good good good stuff.
Here we search
for a string. We describe the string as having none or an infinite number of characters. Those characters are followed by some digits (at least one) and an enclosed in single quotes. Then we have some more characters.
The parenthesis is used to extract a group (that's called capturing in regex), we capture just the part that's a number.
By placing part of a regular expression inside round brackets or parentheses, you can group that part of the regular expression together. This allows you to apply a quantifier to the entire group or to restrict alternations to part of the regex.
Only parentheses can be used for grouping. Square brackets define a character class, and curly braces are used by a quantifier with specific limits. -- Use Parentheses for Grouping and Capturing
r = re.search(".*'(\d+)'.*", data)
Defining the symbols :
.* matches any character (except for line terminators), * means there can be none or infinite amount
' matches the character '
\d+ matches a least one digit (equal to [0-9]); that's the part we capture
(\d+) Capturing Group; this means capture the part of the string where a digit is repeated at least one
() are used for capturing, the part that match the pattern within the parentheses are saved.
The part captured (if any) can later be access with a call to r.groups()
on the result of a re.search
.
This returns a tuple containing what was captured or None
(r
refers to the results of the re.search
function call).
In our case the first (and only) item of the tuple are the digits...
captured_group = r.groups()[0] # that's the tuple containing our data (we captured...)
We can now access our data which is at the first index of the tuple (we only captured one group)
print(captured_group[0]) # this will print out '379104'
Answered By - Remy J
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.