Issue
I am a beginner with Scrapy framework and I have 2 questions/problems:
- I made a "scrapy.Spider" for a website, but it stops after 960 elements retrieved, how can I increase this value, I need to retrieve about ~1600 elements .... :/
- Is it possible to launch scrapy infinitely by adding a waiting time for each "scrapy.Spider"?
UPDATED
class Spell(scrapy.Item):
name = scrapy.Field()
level = scrapy.Field()
components = scrapy.Field()
resistance = scrapy.Field()
class Pathfinder2Spider(scrapy.Spider):
name = "Pathfinder2"
allowed_domains = ["d20pfsrd.com"]
start_urls = ["https://www.d20pfsrd.com/magic/spell-lists-and-domains/spell-lists-sorcerer-and-wizard/"]
def parse(self, response):
# Recovering all wizard's spell links
spells_links = response.xpath('//div/table/tbody/tr/td/a[has-class("spell")]')
print("len(spells_links) : ", len(spells_links))
for spell_link in spells_links:
url = spell_link.xpath('@href').get().strip()
# Recovering all spell information
yield response.follow(url, self.parse_spell)
def parse_spell(self, response):
# Getting all content from spell
article = response.xpath('//article[has-class("magic")]')
contents = article.xpath('//div[has-class("article-content")]')
# Extract useful information
all_names = article.xpath("h1/text()").getall()
all_contents = contents.get()
all_levels = RE_LEVEL.findall(all_contents)
all_components = RE_COMPONENTS.findall(all_contents)
all_resistances = RE_RESISTANCE.findall(all_contents)
for name, level, components, resistance in zip(all_names, all_levels, all_components, all_resistances):
# Treatment here ...
yield Spell(
name=spell_name,
level=spell_level,
components=spell_components,
resistance=spell_resistance,
)
There are total of 1600 links
len(spells_links) : 1565
BUT Only 953 scraped
'httperror/response_ignored_count': 2,
'httperror/response_ignored_status_count/404': 2,
'item_scraped_count': 953,
I run spider with this command
Scrapy crawl Pathfinder2 -O XXX.json"
Thanking you in advance !
Solution
First check the amount of urls:
In [3]: len(response.xpath("//span[@id='ctl00_MainContent_DataListTypes_ctl00_LabelName']/b/a"))
Out[3]: 1073
So you have 1073 urls, each one of the, is a "spell" page, so you have a total of 1073 spells, not 2000.
After running your code I get this:
'downloader/request_count': 1074,
'downloader/request_method_count/GET': 1074,
'downloader/response_bytes': 11368517,
'downloader/response_count': 1074,
'downloader/response_status_count/200': 1074,
'elapsed_time_seconds': 31.657692,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 9, 29, 7, 17, 2, 877042),
'httpcompression/response_bytes': 31520000,
'httpcompression/response_count': 1074,
'item_scraped_count': 1073,
It scraped 1073 so no problem with the spider.
BUT I removed this part:
all_levels = RE_LEVEL.findall(all_contents)
all_components = RE_COMPONENTS.findall(all_contents)
all_resistances = RE_RESISTANCE.findall(all_contents)
If you get errors check this part again.
EDIT:
Some of the links appear more than once:
So the number of links is bigger than the number of items.
Answered By - SuperUser
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.