Issue
Im working on a new project and i try to crawl link
What i did
First of all i tried to get some informations in my shell, to work things out correctly.
code i wrote in my shell: response.xpath(//div[@class="product-wrapper col-xs-6 col-md-4"]/text()').get()
With this code i just want to print out the title of the product, but i get some very weird output:
my first problem was something with the robots.txt so i change my settings.py user agent and now it works, i guess we can determine that the error come from that change, right? correct me if im wrong.
After a bit of research i found out that this comes from wrong formatting and you can determine this error with something like that:
response.xpath('normalize-space(//div[@class="product-wrapper col-xs-6 col-md-4"]/text())')
but this didnt help me at all.
What can i do now?
Solution
You may want to double check your XPath
. Here's my take on it:
import requests
from lxml import html
html.fromstring(requests.get("https://www.karton.eu/einwellig-ab-100-mm").content).xpath("//*[@class='title']/a/text()")
What the code does is it takes an html
content of the requested page, parses it to a string and applies an XPath
selector, which searches all items of class title
, walks down to an anchor tag a
and extracts the text value.
The code above outputs:
['113x113x100 mm einwellige Kartons', '140x140x100 mm einwellige Kartons', '150x100x80 mm einwellige Kartons', '150x150x150 mm einwellige Kartons', '170x150x100 mm einwellige Kartons', '190x180x100 mm einwellige Kartons']
Answered By - baduker
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.