Issue
I am trying to scrape some search results from this company register, but when i try to scrape the company name my results dont seem to return properly, its like the company name item is split into 2 html items based of the search keyword.
Is there a way to join these together? This is my spider
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']
def parse(self, response):
for i in response.css('ul.results-list'):
yield {
'company_name': i.css('li.type-company h3 a::text').extract(),
'address': i.css('li.type-company p::text').extract(),
}
My results as you can see its missing some parts..
Hope any of you see whats going on.. thank you!
Solution
As I see, you want to fetch all the texts within a
and p
tags and there is many tags
within this tags.
Try this one and remove the unnecessary spaces through regex
:
import scrapy
import re
class QuotesSpider(scrapy.Spider):
name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']
def parse(self, response):
for i in response.css('ul.results-list'):
yield {
'company_name': re.sub('\s+',' ',''.join(i.css('li.type-company h3 a ::text').extract())),
'address': re.sub('\s+',' ',''.join(i.css('li.type-company p ::text').extract())),
}
Answered By - Pankaj
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.