Issue
Python newbie here, working on my first web scraping/word frequency analysis using BeautifulSoup and NLTK.
I'm scraping Texas' Dept of Justice archive of offenders last statements.
I've gotten to the point where I'm able to extract the text I wish to analyze from each offender's page and tokenize the words of all of the paragraphs, but it is returning a list of tokenized words per paragraph. I wish to combine the lists and return one list of tokenized words to analyze per offender.
I initially thought using .join would solve my problem, but it is still returning one list per paragraph. I've also tried itertools. No luck.
Here's all of the code to find the most common word in an offender's statement, but it is returning the most common word from each paragraph. Any help would be greatly appreciated!
from bs4 import BeautifulSoup
import urllib.request
import re
import nltk
from nltk import FreqDist
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
resp = urllib.request.urlopen
("https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html")
soup = BeautifulSoup(resp,"lxml",
from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=re.compile('last'))[1:2]:
lastlist = 'https://www.tdcj.state.tx.us/death_row/'+link['href']
resp2 = urllib.request.urlopen(lastlist)
soup2 = BeautifulSoup(resp2,"lxml",
from_encoding=resp2.info().get_param('charset'))
body = soup2.body
for paragraph in body.find_all('p')[4:5]:
name = paragraph.text
print(name)
for paragraph in body.find_all('p')[6:]:
tokens = word_tokenize(paragraph.text)
addWords =
['I',',','Yes','.','\'m','n\'t','?',':',
'None','To','would','y\'all',')','Last','\'s']
stopWords = set(stopwords.words('english')+addWords)
wordsFiltered = []
for w in tokens:
if w not in stopWords:
wordsFiltered.append(w)
fdist1 = FreqDist(wordsFiltered)
common = fdist1.most_common(1)
print(common)
Solution
from bs4 import BeautifulSoup
import urllib.request
import re
import nltk
from nltk import FreqDist
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
resp = urllib.request.urlopen("https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html")
soup = BeautifulSoup(resp,"lxml", from_encoding=resp.info().get_param('charset'))
wordsFiltered = []
stopwords_list = stopwords.words('english')
for link in soup.find_all('a', href=re.compile('last'))[1:2]:
lastlist = 'https://www.tdcj.state.tx.us/death_row/'+link['href']
resp2 = urllib.request.urlopen(lastlist)
soup2 = BeautifulSoup(resp2,"lxml", from_encoding=resp2.info().get_param('charset'))
body = soup2.body
for paragraph in body.find_all('p')[4:5]:
name = paragraph.text
print(name)
for paragraph in body.find_all('p')[6:]:
tokens = word_tokenize(paragraph.text)
addWords = ['I',',','Yes','.','\'m','n\'t','?',':','None','To','would','y\'all',')','Last','\'s']
stopWords = set(stopwords_list + addWords)
for w in tokens:
if w not in stopWords:
wordsFiltered.append(w)
fdist1 = FreqDist(wordsFiltered)
common = fdist1.most_common(1)
print(common)
I have edited your code to get most common word per statement. Feel free to comment if you don't understand something. Also, always keep in mind not to declare lists inside a loop if you are appending to it in each iteration.
Answered By - Devaraj Phukan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.