Issue
I have been trying to scrape a site that is not ideally structured. Information within one set of tags is required to understand information in another set of tags, but the second set of tags are not nested within the first; nor is there a containing tag to group sets of data. Rather they are placed sequentially, and on some of the pages there are a lot of tables of data that need to be married up to the headers, which are in separate list tags.
Hence using a 'for loop' of the kind:
for myitem in response.xpath('*') does not quite do the trick.
if I do the following in scrapy shell
>>> products = response.xpath('//*[@class="r"]//*')
>>> len(products)
271
>>> products[5]
<Selector xpath='//*[@class="r"]//*' data='<ul class="list">\n ...'>
>>> products[5].xpath('@class')
I get:
[<Selector xpath='@class' data='list'>]
if I then try an if statement:
>>> if products[5].xpath('@class') == 'list': "it works!"
...
>>>
It doesn't work; similarly, the following also does not work, and I have tried many other things:
>>> if products[5].xpath('@class') == '<Selector xpath='@class' data='list'>': "it works!"
File "<console>", line 1
if products[5].xpath('@class') == '<Selector xpath='@class' data='list'>': "it works!"
^^^^^
SyntaxError: invalid syntax
>>>
What I would want to do, is to loop through products with and for each html tag, set it against an if clause; such as if div and class= 'a' then do this, if table and id='b' then do that. But I am unable. Any help would be appreciated, thanks.
Solution
So to query the class you almost got it, all you need to do is add the .get()
to the xpath query to extract the text for it to work the way you want. To query the html tag the only way I know of to do this is to use the Selector.re
method to extract the tag name.
Here is an example that demonstrates both.
In [1]: fetch("https://quotes.toscrape.com")
2023-07-30 21:57:21 [scrapy.core.engine] INFO: Spider opened
2023-07-30 21:57:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com> (referer: None)
In [2]: for element in response.xpath("//*"):
...: print("element = ", element)
...: print("element class = ", element.xpath("./@class").get())
...: print("element tag = ", element.re(r"^<(\w+?)\s"))
...: print()
...:
element = <Selector query='//*' data='<html lang="en">\n<head>\n\t<meta charse...'>
element class = None
element tag = ['html']
element = <Selector query='//*' data='<head>\n\t<meta charset="UTF-8">\n\t<titl...'>
element class = None
element tag = []
element = <Selector query='//*' data='<meta charset="UTF-8">'>
element class = None
element tag = ['meta']
element = <Selector query='//*' data='<title>Quotes to Scrape</title>'>
element class = None
element tag = []
element = <Selector query='//*' data='<link rel="stylesheet" href="/static/...'>
element class = None
element tag = ['link']
element = <Selector query='//*' data='<link rel="stylesheet" href="/static/...'>
element class = None
element tag = ['link']
element = <Selector query='//*' data='<body>\n <div class="container">\n ...'>
element class = None
element tag = []
element = <Selector query='//*' data='<div class="container">\n <div ...'>
element class = container
element tag = ['div']
element = <Selector query='//*' data='<div class="row header-box">\n ...'>
element class = row header-box
element tag = ['div']
element = <Selector query='//*' data='<div class="col-md-8">\n ...'>
element class = col-md-8
element tag = ['div']
element = <Selector query='//*' data='<h1>\n <a href="/" ...'>
element class = None
element tag = []
element = <Selector query='//*' data='<a href="/" style="text-decoration: n...'>
element class = None
element tag = ['a']
element = <Selector query='//*' data='<div class="col-md-4">\n ...'>
element class = col-md-4
element tag = ['div']
element = <Selector query='//*' data='<p>\n \n ...'>
element class = None
element tag = []
element = <Selector query='//*' data='<a href="/login">Login</a>'>
element class = None
element tag = ['a']
element = <Selector query='//*' data='<div class="row">\n <div class="col...'>
element class = row
element tag = ['div']
element = <Selector query='//*' data='<div class="col-md-8">\n\n <div clas...'>
element class = col-md-8
element tag = ['div']
element = <Selector query='//*' data='<div class="quote" itemscope itemtype...'>
element class = quote
element tag = ['div']
element = <Selector query='//*' data='<span class="text" itemprop="text">“T...'>
element class = text
element tag = ['span']
element = <Selector query='//*' data='<span>by <small class="author" itempr...'>
element class = None
element tag = []
element = <Selector query='//*' data='<small class="author" itemprop="autho...'>
element class = author
element tag = ['small']
element = <Selector query='//*' data='<a href="/author/Albert-Einstein">(ab...'>
element class = None
element tag = ['a']
element = <Selector query='//*' data='<div class="tags">\n Tags:\n...'>
element class = tags
element tag = ['div']
element = <Selector query='//*' data='<meta class="keywords" itemprop="keyw...'>
element class = keywords
element tag = ['meta']
element = <Selector query='//*' data='<a class="tag" href="/tag/change/page...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<a class="tag" href="/tag/deep-though...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<a class="tag" href="/tag/thinking/pa...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<a class="tag" href="/tag/world/page/...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<div class="quote" itemscope itemtype...'>
element class = quote
element tag = ['div']
element = <Selector query='//*' data='<span class="text" itemprop="text">“I...'>
element class = text
element tag = ['span']
element = <Selector query='//*' data='<span>by <small class="author" itempr...'>
element class = None
element tag = []
element = <Selector query='//*' data='<small class="author" itemprop="autho...'>
element class = author
element tag = ['small']
element = <Selector query='//*' data='<a href="/author/J-K-Rowling">(about)...'>
element class = None
element tag = ['a']
element = <Selector query='//*' data='<div class="tags">\n Tags:\n...'>
element class = tags
element tag = ['div']
element = <Selector query='//*' data='<meta class="keywords" itemprop="keyw...'>
element class = keywords
element tag = ['meta']
element = <Selector query='//*' data='<a class="tag" href="/tag/abilities/p...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<a class="tag" href="/tag/choices/pag...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<div class="quote" itemscope itemtype...'>
element class = quote
element tag = ['div']
element = <Selector query='//*' data='<span class="text" itemprop="text">“T...'>
element class = text
element tag = ['span']
element = <Selector query='//*' data='<span>by <small class="author" itempr...'>
element class = None
element tag = []
element = <Selector query='//*' data='<small class="author" itemprop="autho...'>
element class = author
element tag = ['small']
element = <Selector query='//*' data='<a href="/author/Albert-Einstein">(ab...'>
element class = None
element tag = ['a']
element = <Selector query='//*' data='<div class="tags">\n Tags:\n...'>
element class = tags
element tag = ['div']
element = <Selector query='//*' data='<meta class="keywords" itemprop="keyw...'>
element class = keywords
element tag = ['meta']
element = <Selector query='//*' data='<a class="tag" href="/tag/inspiration...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<a class="tag" href="/tag/life/page/1...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<a class="tag" href="/tag/live/page/1...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<a class="tag" href="/tag/miracle/pag...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<a class="tag" href="/tag/miracles/pa...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<div class="quote" itemscope itemtype...'>
element class = quote
element tag = ['div']
element = <Selector query='//*' data='<span class="text" itemprop="text">“T...'>
element class = text
element tag = ['span']
element = <Selector query='//*' data='<span>by <small class="author" itempr...'>
element class = None
element tag = []
element = <Selector query='//*' data='<small class="author" itemprop="autho...'>
element class = author
element tag = ['small']
element = <Selector query='//*' data='<a href="/author/Jane-Austen">(about)...'>
element class = None
element tag = ['a']
element = <Selector query='//*' data='<div class="tags">\n Tags:\n...'>
element class = tags
element tag = ['div']
element = <Selector query='//*' data='<meta class="keywords" itemprop="keyw...'>
element class = keywords
element tag = ['meta']
element = <Selector query='//*' data='<a class="tag" href="/tag/aliteracy/p...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<a class="tag" href="/tag/books/page/...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<a class="tag" href="/tag/classic/pag...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<a class="tag" href="/tag/humor/page/...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<div class="quote" itemscope itemtype...'>
element class = quote
element tag = ['div']
element = <Selector query='//*' data='<span class="text" itemprop="text">“I...'>
element class = text
element tag = ['span']
element = <Selector query='//*' data='<span>by <small class="author" itempr...'>
element class = None
element tag = []
element = <Selector query='//*' data='<small class="author" itemprop="autho...'>
element class = author
element tag = ['small']
element = <Selector query='//*' data='<a href="/author/Marilyn-Monroe">(abo...'>
element class = None
element tag = ['a']
element = <Selector query='//*' data='<div class="tags">\n Tags:\n...'>
element class = tags
element tag = ['div']
element = <Selector query='//*' data='<meta class="keywords" itemprop="keyw...'>
element class = keywords
element tag = ['meta']
element = <Selector query='//*' data='<a class="tag" href="/tag/be-yourself...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<a class="tag" href="/tag/inspiration...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<div class="quote" itemscope itemtype...'>
element class = quote
element tag = ['div']
element = <Selector query='//*' data='<span class="text" itemprop="text">“T...'>
element class = text
element tag = ['span']
element = <Selector query='//*' data='<span>by <small class="author" itempr...'>
element class = None
element tag = []
element = <Selector query='//*' data='<small class="author" itemprop="autho...'>
element class = author
element tag = ['small']
element = <Selector query='//*' data='<a href="/author/Albert-Einstein">(ab...'>
element class = None
element tag = ['a']
element = <Selector query='//*' data='<div class="tags">\n Tags:\n...'>
element class = tags
element tag = ['div']
element = <Selector query='//*' data='<meta class="keywords" itemprop="keyw...'>
element class = keywords
element tag = ['meta']
element = <Selector query='//*' data='<a class="tag" href="/tag/adulthood/p...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<a class="tag" href="/tag/success/pag...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<a class="tag" href="/tag/value/page/...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<div class="quote" itemscope itemtype...'>
element class = quote
element tag = ['div']
element = <Selector query='//*' data='<span class="text" itemprop="text">“I...'>
element class = text
element tag = ['span']
element = <Selector query='//*' data='<span>by <small class="author" itempr...'>
element class = None
element tag = []
element = <Selector query='//*' data='<small class="author" itemprop="autho...'>
element class = author
element tag = ['small']
element = <Selector query='//*' data='<a href="/author/Andre-Gide">(about)</a>'>
element class = None
element tag = ['a']
element = <Selector query='//*' data='<div class="tags">\n Tags:\n...'>
element class = tags
element tag = ['div']
element = <Selector query='//*' data='<meta class="keywords" itemprop="keyw...'>
element class = keywords
element tag = ['meta']
element = <Selector query='//*' data='<a class="tag" href="/tag/life/page/1...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<a class="tag" href="/tag/love/page/1...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<div class="quote" itemscope itemtype...'>
element class = quote
element tag = ['div']
element = <Selector query='//*' data='<span class="text" itemprop="text">“I...'>
element class = text
element tag = ['span']
element = <Selector query='//*' data='<span>by <small class="author" itempr...'>
element class = None
element tag = []
element = <Selector query='//*' data='<small class="author" itemprop="autho...'>
element class = author
element tag = ['small']
element = <Selector query='//*' data='<a href="/author/Thomas-A-Edison">(ab...'>
element class = None
element tag = ['a']
element = <Selector query='//*' data='<div class="tags">\n Tags:\n...'>
element class = tags
element tag = ['div']
element = <Selector query='//*' data='<meta class="keywords" itemprop="keyw...'>
element class = keywords
element tag = ['meta']
element = <Selector query='//*' data='<a class="tag" href="/tag/edison/page...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<a class="tag" href="/tag/failure/pag...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<a class="tag" href="/tag/inspiration...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<a class="tag" href="/tag/paraphrased...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<div class="quote" itemscope itemtype...'>
element class = quote
element tag = ['div']
element = <Selector query='//*' data='<span class="text" itemprop="text">“A...'>
element class = text
element tag = ['span']
element = <Selector query='//*' data='<span>by <small class="author" itempr...'>
element class = None
element tag = []
element = <Selector query='//*' data='<small class="author" itemprop="autho...'>
element class = author
element tag = ['small']
element = <Selector query='//*' data='<a href="/author/Eleanor-Roosevelt">(...'>
element class = None
element tag = ['a']
element = <Selector query='//*' data='<div class="tags">\n Tags:\n...'>
element class = tags
element tag = ['div']
element = <Selector query='//*' data='<meta class="keywords" itemprop="keyw...'>
element class = keywords
element tag = ['meta']
element = <Selector query='//*' data='<a class="tag" href="/tag/misattribut...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<div class="quote" itemscope itemtype...'>
element class = quote
element tag = ['div']
element = <Selector query='//*' data='<span class="text" itemprop="text">“A...'>
element class = text
element tag = ['span']
element = <Selector query='//*' data='<span>by <small class="author" itempr...'>
element class = None
element tag = []
element = <Selector query='//*' data='<small class="author" itemprop="autho...'>
element class = author
element tag = ['small']
element = <Selector query='//*' data='<a href="/author/Steve-Martin">(about...'>
element class = None
element tag = ['a']
element = <Selector query='//*' data='<div class="tags">\n Tags:\n...'>
element class = tags
element tag = ['div']
element = <Selector query='//*' data='<meta class="keywords" itemprop="keyw...'>
element class = keywords
element tag = ['meta']
element = <Selector query='//*' data='<a class="tag" href="/tag/humor/page/...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<a class="tag" href="/tag/obvious/pag...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<a class="tag" href="/tag/simile/page...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<nav>\n <ul class="pager">\n ...'>
element class = None
element tag = []
element = <Selector query='//*' data='<ul class="pager">\n \n ...'>
element class = pager
element tag = ['ul']
element = <Selector query='//*' data='<li class="next">\n <a ...'>
element class = next
element tag = ['li']
element = <Selector query='//*' data='<a href="/page/2/">Next <span aria-hi...'>
element class = None
element tag = ['a']
element = <Selector query='//*' data='<span aria-hidden="true">→</span>'>
element class = None
element tag = ['span']
element = <Selector query='//*' data='<div class="col-md-4 tags-box">\n ...'>
element class = col-md-4 tags-box
element tag = ['div']
element = <Selector query='//*' data='<h2>Top Ten tags</h2>'>
element class = None
element tag = []
element = <Selector query='//*' data='<span class="tag-item">\n <...'>
element class = tag-item
element tag = ['span']
element = <Selector query='//*' data='<a class="tag" style="font-size: 28px...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<span class="tag-item">\n <...'>
element class = tag-item
element tag = ['span']
element = <Selector query='//*' data='<a class="tag" style="font-size: 26px...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<span class="tag-item">\n <...'>
element class = tag-item
element tag = ['span']
element = <Selector query='//*' data='<a class="tag" style="font-size: 26px...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<span class="tag-item">\n <...'>
element class = tag-item
element tag = ['span']
element = <Selector query='//*' data='<a class="tag" style="font-size: 24px...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<span class="tag-item">\n <...'>
element class = tag-item
element tag = ['span']
element = <Selector query='//*' data='<a class="tag" style="font-size: 22px...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<span class="tag-item">\n <...'>
element class = tag-item
element tag = ['span']
element = <Selector query='//*' data='<a class="tag" style="font-size: 14px...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<span class="tag-item">\n <...'>
element class = tag-item
element tag = ['span']
element = <Selector query='//*' data='<a class="tag" style="font-size: 10px...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<span class="tag-item">\n <...'>
element class = tag-item
element tag = ['span']
element = <Selector query='//*' data='<a class="tag" style="font-size: 8px"...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<span class="tag-item">\n <...'>
element class = tag-item
element tag = ['span']
element = <Selector query='//*' data='<a class="tag" style="font-size: 8px"...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<span class="tag-item">\n <...'>
element class = tag-item
element tag = ['span']
element = <Selector query='//*' data='<a class="tag" style="font-size: 6px"...'>
element class = tag
element tag = ['a']
element = <Selector query='//*' data='<footer class="footer">\n <div ...'>
element class = footer
element tag = ['footer']
element = <Selector query='//*' data='<div class="container">\n <...'>
element class = container
element tag = ['div']
element = <Selector query='//*' data='<p class="text-muted">\n ...'>
element class = text-muted
element tag = ['p']
element = <Selector query='//*' data='<a href="https://www.goodreads.com/qu...'>
element class = None
element tag = ['a']
element = <Selector query='//*' data='<p class="copyright">\n ...'>
element class = copyright
element tag = ['p']
element = <Selector query='//*' data='<span class="zyte">❤</span>'>
element class = zyte
element tag = ['span']
element = <Selector query='//*' data='<a class="zyte" href="https://www.zyt...'>
element class = zyte
element tag = ['a']
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.