Sunday, December 31, 2023

[FIXED] Framework in Python for extracting key information from websites

December 31, 2023 beautifulsoup, python, scrapy, web-scraping No comments

Issue

I am looking for framework in Python to extract key information from thousands of different websites - such as "office location", "CEO", etc. Ideally, the script would read in a website url, identify some "key terms" such as "location", "offices", "team members", etc and print the corresponding metrics.

My only related experience is in using Scrapy is in extracting information that follows a pattern on one specific webpage (i.e. extracting tables from Wikipedia), but not sure if Scrapy or BeautifulSoup would work for this sort of project. Was wondering if Scrapy would be my best bet, and if so, what would be the correct syntax to use for this type of project. I've already tried some variations of

import scrapy
from bs4 import BeautifulSoup
import urllib


class OurfirstbotSpider(scrapy.Spider):
    name = 'ourfirstbot'
    start_urls = [
        'https://en.wikipedia.org/wiki/List_of_common_misconceptions',
    ]

    def parse(self, response):
        #yield response
        headings = response.css('.mw-headline').extract()       
        datas = response.css('ul').extract()       

        
        for item in zip(headings, datas):
            all_items = {
                'headings' : BeautifulSoup(item[0]).text,
                'datas' : BeautifulSoup(item[1]).text,


            }


            yield all_items

with no avail, due to each site having a different layout and none of them following a specific pattern. Any help would be appreciated.

Solution

I believe your definition of "key information" is not precise. How do you evaluate the performance of such a program/model? Once you scrape the text from a webpage then you can give it to a fine-tuned NER to extract such information.

For example, given a text, this NER model returns these four entities in it with ~95% accuracy: location (LOC), organizations (ORG), person (PER) and Miscellaneous (MISC).

This model by Stanford NLP group appears to be recognizing more types of entities:

Adding the regexner annotator and using the supplied RegexNER pattern files adds support for the fine-grained and additional entity classes EMAIL, URL, CITY, STATE_OR_PROVINCE, COUNTRY, NATIONALITY, RELIGION, (job) TITLE, IDEOLOGY, CRIMINAL_CHARGE, CAUSE_OF_DEATH, (Twitter, etc.) HANDLE (12 classes) for a total of 24 classes.

Answered By - tozCSS

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, December 31, 2023

[FIXED] Framework in Python for extracting key information from websites

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels