Issue
I'm trying to build a scraper to retrieve translations from wiktionary.
I'm calling this function that should return a list with all the translations of the argument word, but it returns an empty list.
The command response.css('ol').re(r'(?<=>)\w+(?=<)')
is working on scrappy shell, though.
The word I'm using as a test is "Hallo"
def scrape_translation(word):
url = "https://en.wiktionary.org/wiki/" + word
response = HtmlResponse(url=url)
translation_list = response.css('ol').re(r'(?<=>)\w+(?=<)')
print(translation_list)
I'm using Python 3.6.4
Solution
HtmlResponse is used to convert HTML string to HtmlResponse object. So you need to add HTML string as argument body:
import requests
def scrape_translation(word):
url = "https://en.wiktionary.org/wiki/" + word
r = requests.get(url)
response = HtmlResponse(url=url, body = r.content)
translation_list = response.css('ol').re(r'(?<=>)\w+(?=<)')
print(translation_list)
scrape_translation('Hallo')
I used requests library, but there are other python modules which can extract HTML from URL.
Answered By - ands
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.