Issue
I have pages like this:
<?xml version="1.0" encoding="utf-8"?>\r\n<HTMLReturn xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://gccwebapps/PROWWS/">\r\n <Result>OK</Result>\r\n <ErrorMessageNewLine>\n</ErrorMessageNewLine>\r\n <ErrorMessage />\r\n <ID />\r\n <HTML><div id=\'DivPROWContainer\' class=\'PROWContainer\'>\n<div id=\'DivTableGCCDocsHolder\' class=\'TableGCCDocsHolder\'>\n<table id=\'TableDisplayTable\' class=\'DisplayTable DisplayGCCDocsTable HtmlDataTable\'>\n<tbody>\n<tr class=\'DisplayTableHeaderRow HtmlDataTableHeaderRow DisplayTableTopRow\'>\n<th colspan=\'5\'>Documents available for the planning Application</th>\n</tr>\n<tr class=\'DisplayTableHeaderRow HtmlDataTableHeaderRow\'>\n<th>Application Number</th>\n<th>Plan number</th>\n<th>Document type</th>\n<th>Description</th>\n<th>Date Entered</th>\n</tr>\n<tr class=\'DisplayTableDataRow HtmlDataTableRow ResultRowAlternative\'>\n<td><a id=\'AFormLink_APP_NO\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_DEC_LET.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'>22/0001/NONMAT\n</a></td>\n<td></td>\n<td>Text</td>\n<td><a id=\'AFormLink_DESCRIPTION\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_DEC_LET.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'>Decision Letter\n</a></td>\n<td>26/01/2022</td>\n</tr>\n<tr class=\'DisplayTableDataRow HtmlDataTableRow ResultRowAlternative\'>\n<td><a id=\'AFormLink_APP_NO\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_APP_FORM_RED.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'>22/0001/NONMAT\n</a></td>\n<td></td>\n<td>Plan</td>\n<td><a id=\'AFormLink_DESCRIPTION\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_APP_FORM_RED.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'>Application Form 9Redacted)\n</a></td>\n<td>10/01/2022</td>\n</tr>\n<tr class=\'DisplayTableDataRow HtmlDataTableRow ResultRowAlternative\'>\n<td><a id=\'AFormLink_APP_NO\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_LAND_PLAN_P20_2956_05D.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'>22/0001/NONMAT\n</a></td>\n<td>P20_2956_05D</td>\n<td>Text</td>\n<td><a id=\'AFormLink_DESCRIPTION\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_LAND_PLAN_P20_2956_05D.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'>Landscape MasterPlan 04.01.22\n</a></td>\n<td>10/01/2022</td>\n</tr>\n<tr class=\'DisplayTableDataRow HtmlDataTableRow ResultRowAlternative\'>\n<td><a id=\'AFormLink_APP_NO\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_ELEC_SERV_190123_SC_XX_XX_DR_E_600.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'>22/0001/NONMAT\n</a></td>\n<td>190123_SC_XX_XX_DR_E_600</td>\n<td>Plan</td>\n<td><a id=\'AFormLink_DESCRIPTION\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_ELEC_SERV_190123_SC_XX_XX_DR_E_600.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'>Electrical Services Site Wide\n</a></td>\n<td>10/01/2022</td>\n</tr>\n</tbody>\n\n</table>\n</div>\n<div class=\'PROWDefaultFooter\'>\n<div class=\'PROWFooter1\'>© 2014-21 Gloucestershire County Council, Shire Hall, Westgate Street, Gloucester GL1 2TG.\n</div>\n<div class=\'PROWFooter2\'><STRONG>Telephone:</STRONG>+44(0)1452 425000 - <STRONG> Out of hours:</STRONG> +44(0)845 6677788\n</div>\n<div class=\'PROWFooter2\'>\n<a id=\'AGCCLink\' class=\'GCCFooterLink\' href=\'http://www.gloucestershire.gov.uk\' data-DisableMeWhenSomethingChanged=\'1\'>www.gloucestershire.gov.uk\n</a>\n</div>\n</div>\n</div>\n</HTML>\r\n <Script>gcc_docs_startScreenSetup();</Script>\r\n</HTMLReturn>
I need to find elements in it using xpath (without namespaces). I tried different variants, but I receive something very short and empty as an output (5-6 bytes):
That's the variants I tried. As you can see - none of them works.
import lxml.html as html
res = html.fromstring(sec_response.body)
len(res)
5
res.xpath('//div')
[]
import xml.etree.ElementTree as ET
xhtml = ET.fromstring(sec_response.text)
len(xhtml)
6
xhtml.xpath('//div')
*** AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'xpath'
from lxml import etree
xslt_root = etree.XML(sec_response.body)
len(xslt_root)
6
xslt_root.xpath('//div')
[]
sec_response.selector.remove_namespaces()
sec_response.xpath('//td')
[]
sec_response.xpath('//tr')
[]
Please, show the way to transform it, so that xpath may be used to it (I need to look for //tr or //td or //a elements and FIND it).
Solution
scrapy shell file:///....../temp.xml # your page's code
In [1]: response.xpath('//div')
Out[1]: []
In [2]: import html
In [3]: from scrapy.selector import Selector
In [4]: response.selector.remove_namespaces()
In [5]: text = html.unescape(response.text)
In [6]: sel = Selector(text=text)
In [7]: sel.xpath('//div')
Out[7]:
[<Selector xpath='//div' data='<div id="\\\'DivPROWContainer\\\'" class=...'>,
<Selector xpath='//div' data='<div id="\\\'DivTableGCCDocsHolder\\\'" c...'>,
<Selector xpath='//div' data='<div class="\\\'PROWDefaultFooter\\\'">\\n...'>,
<Selector xpath='//div' data='<div class="\\\'PROWFooter1\\\'">© 2014-2...'>,
<Selector xpath='//div' data='<div class="\\\'PROWFooter2\\\'"><strong>...'>,
<Selector xpath='//div' data='<div class="\\\'PROWFooter2\\\'">\\n<a id=...'>]
Answered By - SuperUser
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.