Issue
I have some local HTML files and need to extract some elements from it. I am used to writing Scrapy and extract elements using its buit-in selectors with xpath
and css
and .extract()
and .extract_first()
.
Is there a library that can do this?
I have checked BeautifulSoup
and lxml
but their syntax are different from Scrapy
.
For example, I'd like to do something like this:
sample_file = "../raw_html_text/sample.html"
with open(sample_file, 'r', encoding='utf-8-sig', newline='') as f:
page = f.read()
html_object = # convert string to html or something
print(html_object.css("h2 ::text").extract_first())
Solution
I usually import scrapy selectors in other projects since I like them so much. Just import the Selector class and pass it a string and it will work just like in Scrapy.
from scrapy import Selector
sample_file = "../raw_html_text/sample.html"
with open(sample_file, 'r', encoding='utf-8-sig', newline='') as f:
page = f.read()
data = Selector(text=str(page))
title = data.css('h2::text').get()
# used to be data.css('h2::text').extract_first()
Answered By - ThePyGuy
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.