Issue
I am using scrapy to crawl a website and extract data from it, scrapy uses regex-based rules to check if a page has to be parsed, or a link has to be followed.
I am implementing a resume feature for my spider, so it could continue crawling from the last visited page. For this, I get the last followed link from a database when the spider is launched.
My site urls look like http://foobar.com/page1.html
, so, usually, the rule's regex to follow every link like this would be something like /page\d+\.html
.
But how can I write a regex so it would match, for example, page 15 and more? Also, as I don't know the starting point in advance, how could I generate this regex at runtime?
Solution
Try this:
def digit_match_greater(n):
digits = str(n)
variations = []
# Anything with more than len(digits) digits is a match:
variations.append(r"\d{%d,}" % (len(digits)+1))
# Now match numbers with len(digits) digits.
# (Generate, e.g, for 15, "1[6-9]", "[2-9]\d")
# 9s can be skipped -- e.g. for >19 we only need [2-9]\d.
for i, d in enumerate(digits):
if d != "9":
pattern = list(digits)
pattern[i] = "[%d-9]" % (int(d) + 1)
for j in range(i+1, len(digits)):
pattern[j] = r"\d"
variations.append("".join(pattern))
return "(?:%s)" % "|".join("(?:%s)" % v for v in variations)
It turned out easier to make it match numbers greater than the parameter, so if you give it 15, it'll return a string for matching numbers 16 and greater, specifically...
(?:(?:\d{3,})|(?:[2-9]\d)|(?:1[6-9]))
You can then substitute this into your expression instead of \d+
, like so:
exp = re.compile(r"page%s\.html" % digit_match_greater(last_page_visited))
Answered By - Martin Stone
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.