Issue
I created the following spider which leads to these issues when running it:
- The headline is "cut" — probably due to the
<em>
tag inside - The location contains spaces and
\n
Currently struggling to find a solution for these two remaining issues.
class GitHubSpider(scrapy.Spider):
name = "github"
start_urls = [
"https://github.com/search?p=1&q=React+Django&type=Users",
]
def parse(self, response):
for github in response.css(".Box-row"):
yield {
"github_link": github.css(".mr-1::attr(href)").get(),
"name": github.css(".mr-1::text").get(),
"headline": github.css(".mb-1::text").get(),
"location": github.css(".mr-3:nth-child(1)::text").get(),
}
Expected result
# 2021-08-07 11:59:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{
'github_link': '/djangofan',
'name': 'Jon Austen',
'headline': 'Software Engineer interested in Java, Python, Ruby, Groovy, Bash, Clojure, React-Native, and Docker. Focus: Testing, CI, and Micro-Services.',
'location': 'Portland, OR'
}
# 2021-08-07 11:59:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{
'github_link': '/django-wong',
'name': ' Wong',
'headline': 'PHP / Node.js / Dart (Flutter) / React Native / Scala',
'location': 'China'
}
[...]
Actual Result
# 2021-08-07 11:59:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{
'github_link': '/djangofan',
'name': 'Jon Austen',
'headline': 'Software Engineer interested in Java, Python, Ruby, Groovy, Bash, Clojure, ',
'location': '\n Portland, OR\n '
}
# 2021-08-07 11:59:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{
'github_link': '/django-wong',
'name': ' Wong',
'headline': 'PHP / Node.js / Dart (Flutter) / ',
'location': '\n China\n '
}
[...]
Solution
The first problem can be fixed with xpath and string().
The second problem can be fixed with strip().
class GitHubSpider(scrapy.Spider):
name = "github"
start_urls = [
"https://github.com/search?p=1&q=React+Django&type=Users",
]
def strip_string(self, string):
if string is not None:
return string.strip()
def parse(self, response):
for github in response.css(".Box-row"):
github_link = self.strip_string(github.css(".mr-1::attr(href)").get())
name = self.strip_string(github.css(".mr-1::text").get())
headline = self.strip_string(github.xpath('string(//p[@class="mb-1"])').get())
location = self.strip_string(github.css(".mr-3:nth-child(1)::text").get())
yield {
"github_link": github_link,
"name": name,
"headline": headline,
"location": location
}
Answered By - SuperUser
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.