Issue
The idea is, to check the last 3 pages of a german medical-news-page. On each of these pages are 5 with links to separate articles. The program checks, if the "href" of each is already existing in a data.csv. If not, it adds the "href" to the data.csv, follows the link and saves the content to an .html-file.
the content of each article-page is:
<html>
..
..
<div class="newstext">
<p> article-piece 1</p>
<p> article-piece 2</p>
<p> article-piece 3</p>
<div class="URLkastenWrapper">
<div class="newsKasten URLkasten newsKastenLinks">
<p> not wanted stuff</p>
</div>
</div>
<p> article-piece 4</p>
<p> article-piece 5</p>
</div>
I want to save the "article-pieces" to the html and exclude the "not wanted stuff".
I tried to use recursive=False
as shown in my code.
As far as my research goes, this is the way to go, to reach my goal, right?
But for some reason, it does not work :(
import requests
from bs4 import BeautifulSoup
import mechanicalsoup
# this requests the first 3 news-Pages; each of them contains 5 articles
scan_med_news = ['https://www.aerzteblatt.de/nachrichten/Medizin?page=1', 'https://www.aerzteblatt.de/nachrichten/Medizin?page=2', 'https://www.aerzteblatt.de/nachrichten/Medizin?page=3']
# This function is ment to create an html-file with the Article-pices of the web-site.
def article_html_create(title, url):
with open(title+'.html', 'a+') as article:
article.write('<h1>'+title+'</h1>\n\n')
subpage = BeautifulSoup(requests.get(url).text, 'html5lib')
for line in subpage.select('.newstext p', recursive=False):
#this recursive:False is not working as i wish
article.write(line.text+'<br><br>')
# this piece of code takes the URLs of allready saved articles and puts them from an .csv in a list
contentlist = []
with open('data.csv', "r") as file:
for line in file:
for item in line.strip().split(','):
contentlist.append(item)
# for every article on these pages, it checks, if the url is in the contenlist, created from the date.csv
with open('data.csv', 'a') as file:
for page in scan_med_news:
doc = requests.get(page)
doc.encoding = 'utf-8'
soup = BeautifulSoup(doc.text, 'html5lib')
for h2 in soup.find_all('h2'):
for a in h2.find_all('a',):
if a['href'] in contentlist:
# if the url is already in the list, it prints "Already existing"
print('Already existing')
else:
# if the url is not in the list, it adds the url to the date.csv and starts the article_html_create-function to save the content of the article
file.write(a['href']+',')
article_html_create(a.text, 'https://www.aerzteblatt.de'+a['href'])
print('Added to the file!')
Solution
You can select the parent div
node of the not-wanted p
node, and set the string
property to empty string, which will make the children of parent to be removed from the soup. Then you can make the selections in a regular manner.
Example:
In [17]: soup = BeautifulSoup(html, 'lxml')
In [18]: soup
Out[18]:
<html><body><div class="newstext">
<p> article-piece 1</p>
<p> article-piece 2</p>
<p> article-piece 3</p>
<div class="URLkastenWrapper">
<div class="newsKasten URLkasten newsKastenLinks">
<p> not wanted stuff</p>
</div>
</div>
<p> article-piece 4</p>
<p> article-piece 5</p>
</div></body></html>
In [19]: soup.select_one('.URLkastenWrapper').string = ''
In [20]: soup
Out[20]:
<html><body><div class="newstext">
<p> article-piece 1</p>
<p> article-piece 2</p>
<p> article-piece 3</p>
<div class="URLkastenWrapper"></div>
<p> article-piece 4</p>
<p> article-piece 5</p>
</div></body></html>
In [21]: soup.select('.newstext p')
Out[21]:
[<p> article-piece 1</p>,
<p> article-piece 2</p>,
<p> article-piece 3</p>,
<p> article-piece 4</p>,
<p> article-piece 5</p>]
Answered By - heemayl
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.