Issue
I am looking for framework in Python to extract key information from thousands of different websites - such as "office location", "CEO", etc. Ideally, the script would read in a website url, identify some "key terms" such as "location", "offices", "team members", etc and print the corresponding metrics.
My only related experience is in using Scrapy is in extracting information that follows a pattern on one specific webpage (i.e. extracting tables from Wikipedia), but not sure if Scrapy or BeautifulSoup would work for this sort of project. Was wondering if Scrapy would be my best bet, and if so, what would be the correct syntax to use for this type of project. I've already tried some variations of
import scrapy
from bs4 import BeautifulSoup
import urllib
class OurfirstbotSpider(scrapy.Spider):
name = 'ourfirstbot'
start_urls = [
'https://en.wikipedia.org/wiki/List_of_common_misconceptions',
]
def parse(self, response):
#yield response
headings = response.css('.mw-headline').extract()
datas = response.css('ul').extract()
for item in zip(headings, datas):
all_items = {
'headings' : BeautifulSoup(item[0]).text,
'datas' : BeautifulSoup(item[1]).text,
}
yield all_items
with no avail, due to each site having a different layout and none of them following a specific pattern. Any help would be appreciated.
Solution
I believe your definition of "key information" is not precise. How do you evaluate the performance of such a program/model? Once you scrape the text from a webpage then you can give it to a fine-tuned NER to extract such information.
For example, given a text, this NER model returns these four entities in it with ~95% accuracy: location (LOC), organizations (ORG), person (PER) and Miscellaneous (MISC).
This model by Stanford NLP group appears to be recognizing more types of entities:
Adding the regexner annotator and using the supplied RegexNER pattern files adds support for the fine-grained and additional entity classes EMAIL, URL, CITY, STATE_OR_PROVINCE, COUNTRY, NATIONALITY, RELIGION, (job) TITLE, IDEOLOGY, CRIMINAL_CHARGE, CAUSE_OF_DEATH, (Twitter, etc.) HANDLE (12 classes) for a total of 24 classes.
Answered By - tozCSS
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.