Thursday, December 9, 2021

[FIXED] How to Join Multiple Lists for Python - BeautifulSoup NLTK Analysis

December 09, 2021 beautifulsoup, join, list, nltk, python No comments

Issue

Python newbie here, working on my first web scraping/word frequency analysis using BeautifulSoup and NLTK.

I'm scraping Texas' Dept of Justice archive of offenders last statements.

I've gotten to the point where I'm able to extract the text I wish to analyze from each offender's page and tokenize the words of all of the paragraphs, but it is returning a list of tokenized words per paragraph. I wish to combine the lists and return one list of tokenized words to analyze per offender.

I initially thought using .join would solve my problem, but it is still returning one list per paragraph. I've also tried itertools. No luck.

Here's all of the code to find the most common word in an offender's statement, but it is returning the most common word from each paragraph. Any help would be greatly appreciated!

from bs4 import BeautifulSoup
import urllib.request
import re
import nltk
from nltk import FreqDist
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

resp = urllib.request.urlopen
("https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html")
soup = BeautifulSoup(resp,"lxml", 
from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=re.compile('last'))[1:2]:
    lastlist = 'https://www.tdcj.state.tx.us/death_row/'+link['href']
    resp2 = urllib.request.urlopen(lastlist)
    soup2 = BeautifulSoup(resp2,"lxml", 
from_encoding=resp2.info().get_param('charset'))
    body = soup2.body

    for paragraph in body.find_all('p')[4:5]:
        name = paragraph.text
        print(name)

    for paragraph in body.find_all('p')[6:]:
        tokens = word_tokenize(paragraph.text)
        addWords = 
        ['I',',','Yes','.','\'m','n\'t','?',':',
        'None','To','would','y\'all',')','Last','\'s']
        stopWords = set(stopwords.words('english')+addWords)
        wordsFiltered = []

        for w in tokens:
            if w not in stopWords:
                wordsFiltered.append(w)

        fdist1 = FreqDist(wordsFiltered)
        common = fdist1.most_common(1)
        print(common)

Solution

from bs4 import BeautifulSoup
import urllib.request
import re
import nltk
from nltk import FreqDist
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

resp = urllib.request.urlopen("https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html")
soup = BeautifulSoup(resp,"lxml", from_encoding=resp.info().get_param('charset'))
wordsFiltered = []
stopwords_list = stopwords.words('english')
for link in soup.find_all('a', href=re.compile('last'))[1:2]:
    lastlist = 'https://www.tdcj.state.tx.us/death_row/'+link['href']
    resp2 = urllib.request.urlopen(lastlist)
    soup2 = BeautifulSoup(resp2,"lxml", from_encoding=resp2.info().get_param('charset'))    
    body = soup2.body

    for paragraph in body.find_all('p')[4:5]:
        name = paragraph.text
        print(name)
    

    for paragraph in body.find_all('p')[6:]:
        tokens = word_tokenize(paragraph.text)
        addWords = ['I',',','Yes','.','\'m','n\'t','?',':','None','To','would','y\'all',')','Last','\'s']
        stopWords = set(stopwords_list + addWords)
        

        for w in tokens:
            if w not in stopWords:
                wordsFiltered.append(w)

fdist1 = FreqDist(wordsFiltered)
common = fdist1.most_common(1)
print(common)

I have edited your code to get most common word per statement. Feel free to comment if you don't understand something. Also, always keep in mind not to declare lists inside a loop if you are appending to it in each iteration.

Answered By - Devaraj Phukan

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, December 9, 2021

[FIXED] How to Join Multiple Lists for Python - BeautifulSoup NLTK Analysis

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels