Issue
I am having an issue where Scrapy is behaving unexpectedly.
I wrote a simple function months ago that returns a list of items at a given xpath.
def get_html(response,path):
sel = Selector(text = response.page_source)
time.sleep(.2)
items = sel.xpath(path).getall()
return items
Usage Example:
<body>
<div id="1">Some Text</div>
<div id="2">Different Text</div>
<a href="#">Some link</a>
</body>
If I wanted to get all of the div elements, I would write this:
get_html(response,'//div')
I expect, and have previously received, this output
['<div id="1">Some Text</div>',
'<div id="2">Different Text</div>']
However, now when I call this method, I receive this output
['<div id="1">Some Text</div><div id="2">Different Text</div><a href="#">Some link</a></body>',
'<div id="2">Different Text</div><a href="#">Some link</a></body>']
The problem isn't due to a change in the webpage I was scraping, I saved the source code when I originally scraped and it is identical to the source code I see on the webpage today. This problem exists across multiple websites I've tried to scrape. I'm not sure what the problem is, or how to fix it. I either need to fix the problem, or replace the function with another function that behaves identically.
I understand there are ways I could split the strings and remove the unwanted data, however I have used this function in 100+ modules, and do not want to risk breaking those by hardcoding a solution like that. I need to understand why the output of the function has changed, despite nothing about the source code changing.
Edit:
Per comments below, here is exactly what I enter into the console to produce this result. Please let me know how I can begin to diagnose why this is happening if it's not reproduceable for others. I am using Spyder version 4.2.5, Python 3.8.5, Scrapy 2.4.1.
In[1]: from scrapy.selector import Selector
In[2]: text = """<body>
<div id="1">Some Text</div>
<div id="2">Different Text</div>
<a href="#">Some link</a>
</body>"""
In[3]: sel = Selector(text=text)
In[4]: items = sel.xpath('//div').getall()
In[5]: items
Out[5]:
['<div id="1">Some Text</div>\n <div id="2">Different Text</div>\n <a href="#">Some link</a>\n </body></html>\n',
'<div id="2">Different Text</div>\n <a href="#">Some link</a>\n </body></html>\n']
Solution
Problem appears to be fixed after a fresh install of Anaconda. Not sure what caused it to appear in the first place, here's hoping it doesn't happen again.
Answered By - Madison Ashbach
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.