Issue
This is my simple google search result crawler using scrapy.
class GoogleBotsSpider(scrapy.Spider):
name = 'GoogleScrapyBot'
allowed_domains = ['google.com']
start_urls = ['https://www.google.com/search?q=apple&hl=en&rlz=&start=0']
def parse(self, response):
titles = response.xpath('//*[@id="main"]/div/div/div/a/h3/div//text()').extract()
links = response.xpath('//*[@id="main"]/div/div/div/a/@href').extract()
items = []
for idx in range(len(titles)):
item = GoogleScraperItem()
item['title'] = titles[idx]
item['link'] = links[idx]
items.append(item)
return items
However, there is "/url?q=" at some item['link'] as blows
{'link': '/url?q=https://www.apple.com/&sa=U&ved=2ahUKEwj398Kv177xAhUFUKwKHZ_qAKkQFjAAegQICBAB&usg=AOvVaw1rYEJO8-kDCh7A5C3AggNG', 'title': 'Apple Inc. - Wikipedia'}
I'd like to remove "/url?q.=" using ".lstrip("/url?q=")"but I don't know where to put this.
Solution
You should use .lstrip
as the following:
for idx in range(len(titles)):
item = GoogleScraperItem()
item['title'] = titles[idx]
item['link'] = links[idx].lstrip("/url?q=")
items.append(item)
Answered By - Hassan Ibraheem
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.