Monday, March 14, 2022

[FIXED] Python: Scrapy returning all html following element instead of just html of element

March 14, 2022 html, python, scrapy No comments

Issue

I am having an issue where Scrapy is behaving unexpectedly.

I wrote a simple function months ago that returns a list of items at a given xpath.

def get_html(response,path):
    sel = Selector(text = response.page_source)
    time.sleep(.2)
    items = sel.xpath(path).getall()
    return items

Usage Example:

<body>
    <div id="1">Some Text</div>
    <div id="2">Different Text</div>
    <a href="#">Some link</a>
</body>

If I wanted to get all of the div elements, I would write this:

get_html(response,'//div')

I expect, and have previously received, this output

['<div id="1">Some Text</div>',
 '<div id="2">Different Text</div>']

However, now when I call this method, I receive this output

['<div id="1">Some Text</div><div id="2">Different Text</div><a href="#">Some link</a></body>',
 '<div id="2">Different Text</div><a href="#">Some link</a></body>']

The problem isn't due to a change in the webpage I was scraping, I saved the source code when I originally scraped and it is identical to the source code I see on the webpage today. This problem exists across multiple websites I've tried to scrape. I'm not sure what the problem is, or how to fix it. I either need to fix the problem, or replace the function with another function that behaves identically.

I understand there are ways I could split the strings and remove the unwanted data, however I have used this function in 100+ modules, and do not want to risk breaking those by hardcoding a solution like that. I need to understand why the output of the function has changed, despite nothing about the source code changing.

Edit:

Per comments below, here is exactly what I enter into the console to produce this result. Please let me know how I can begin to diagnose why this is happening if it's not reproduceable for others. I am using Spyder version 4.2.5, Python 3.8.5, Scrapy 2.4.1.

In[1]: from scrapy.selector import Selector

In[2]: text = """<body>
        <div id="1">Some Text</div>
        <div id="2">Different Text</div>
        <a href="#">Some link</a>
    </body>"""

In[3]: sel = Selector(text=text)

In[4]: items = sel.xpath('//div').getall()

In[5]: items
Out[5]: 
['<div id="1">Some Text</div>\n        <div id="2">Different Text</div>\n        <a href="#">Some link</a>\n    </body></html>\n',
 '<div id="2">Different Text</div>\n        <a href="#">Some link</a>\n    </body></html>\n']

Solution

Problem appears to be fixed after a fresh install of Anaconda. Not sure what caused it to appear in the first place, here's hoping it doesn't happen again.

Answered By - Madison Ashbach

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, March 14, 2022

[FIXED] Python: Scrapy returning all html following element instead of just html of element

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels