Issue
There is html text taken from a telegram. The HTML looks just like that, don't edit it:
π§Ή <b>ΠΡΠΎΡΠ΅ΡΡ Ρ
ΠΈΠΌΡΠΈΡΡΠΊΠΈ ΡΠΈΠ΄Π΅Π½ΠΈΠΉ:</b>
<b>1</b> - <i>Π’ΡΠ°ΡΠ΅Π»ΡΠ½ΠΎ ΠΏΡΠΎΠΏΡΠ»Π΅ΡΠΎΡΠΈΡΡ</i>
<b>2</b> - <i>ΠΠ°Π½Π΅ΡΡΠΈ Ρ
ΠΈΠΌΠΈΡ ΠΈ Π½Π΅ΠΌΠ½ΠΎΠ³ΠΎ ΠΏΠΎΠ΄ΠΎΠΆΠ΄Π°ΡΡ, ΠΏΠΎΠΊΠ° ΠΎΠ½Π° ΠΏΠΎΡΠ°Π±ΠΎΡΠ°Π΅Ρ</i>
<i>
</i><b>3</b> - <i>ΠΡΠΎΠΉΡΠΈΡΡ ΡΠ΅ΡΠΊΠΎΠΉ, ΡΡΠΎΠ±Ρ ΠΏΠΎΠ΄Π½ΡΡΡ Π·Π°Π³ΡΡΠ·Π½Π΅Π½ΠΈΡ</i>
<b>4</b> - <i>Π’ΡΠ°ΡΠ΅Π»ΡΠ½ΠΎ ΠΏΡΠΎΠΏΠΎΠ»ΠΎΡΠΊΠ°ΡΡ ΡΠΈΠ΄Π΅Π½ΠΈΠ΅ ΡΠΊΡΡΡΠ°ΠΊΡΠΎΡΠΎΠΌ Ρ ΠΏΠΎΠ΄Π°ΡΠ΅ΠΉ Π²ΠΎΠ΄Ρ</i>
<i>
</i>#Ρ
ΠΈΠΌΡΠΈΡΡΠΊΠ°ΡΠ°Π»ΠΎΠ½Π° #Ρ
ΠΈΠΌΡΠΈΡΡΠΊΠ°ΡΠΈΠ΄Π΅Π½ΠΈΠΉ
I need to use Beautifulsoup to turn it into such text. But using Beautifulsoup is optional, you can use any other library:
π§Ή ΠΡΠΎΡΠ΅ΡΡ Ρ
ΠΈΠΌΡΠΈΡΡΠΊΠΈ ΡΠΈΠ΄Π΅Π½ΠΈΠΉ:
1 - Π’ΡΠ°ΡΠ΅Π»ΡΠ½ΠΎ ΠΏΡΠΎΠΏΡΠ»Π΅ΡΠΎΡΠΈΡΡ
2 - ΠΠ°Π½Π΅ΡΡΠΈ Ρ
ΠΈΠΌΠΈΡ ΠΈ Π½Π΅ΠΌΠ½ΠΎΠ³ΠΎ ΠΏΠΎΠ΄ΠΎΠΆΠ΄Π°ΡΡ, ΠΏΠΎΠΊΠ° ΠΎΠ½Π° ΠΏΠΎΡΠ°Π±ΠΎΡΠ°Π΅Ρ
3 - ΠΡΠΎΠΉΡΠΈΡΡ ΡΠ΅ΡΠΊΠΎΠΉ, ΡΡΠΎΠ±Ρ ΠΏΠΎΠ΄Π½ΡΡΡ Π·Π°Π³ΡΡΠ·Π½Π΅Π½ΠΈΡ
4 - Π’ΡΠ°ΡΠ΅Π»ΡΠ½ΠΎ ΠΏΡΠΎΠΏΠΎΠ»ΠΎΡΠΊΠ°ΡΡ ΡΠΈΠ΄Π΅Π½ΠΈΠ΅ ΡΠΊΡΡΡΠ°ΠΊΡΠΎΡΠΎΠΌ Ρ ΠΏΠΎΠ΄Π°ΡΠ΅ΠΉ Π²ΠΎΠ΄Ρ
#Ρ
ΠΈΠΌΡΠΈΡΡΠΊΠ°ΡΠ°Π»ΠΎΠ½Π° #Ρ
ΠΈΠΌΡΠΈΡΡΠΊΠ°ΡΠΈΠ΄Π΅Π½ΠΈΠΉ
Here's a little code I sketched out, I tried a lot of things, but they don't work, I ran out of ideas.
from bs4 import BeautifulSoup as bs
def _html_to_text(html) -> str:
hrefs = []
soup = bs(html, 'lxml')
for x in soup.find_all():
if len(x.get_text(strip=True)) == 0:
x.extract()
a_tags = soup.find_all('a', href=True)
for a_tag in a_tags:
href = a_tag.get('href')
if href in hrefs:
continue
hrefs.append(href)
a_tag.append(f' ({href})')
return soup.get_text()
The output of this function is this:
π§Ή ΠΡΠΎΡΠ΅ΡΡ Ρ
ΠΈΠΌΡΠΈΡΡΠΊΠΈ ΡΠΈΠ΄Π΅Π½ΠΈΠΉ:
1 - Π’ΡΠ°ΡΠ΅Π»ΡΠ½ΠΎ ΠΏΡΠΎΠΏΡΠ»Π΅ΡΠΎΡΠΈΡΡ
2 - ΠΠ°Π½Π΅ΡΡΠΈ Ρ
ΠΈΠΌΠΈΡ ΠΈ Π½Π΅ΠΌΠ½ΠΎΠ³ΠΎ ΠΏΠΎΠ΄ΠΎΠΆΠ΄Π°ΡΡ, ΠΏΠΎΠΊΠ° ΠΎΠ½Π° ΠΏΠΎΡΠ°Π±ΠΎΡΠ°Π΅Ρ
3 - ΠΡΠΎΠΉΡΠΈΡΡ ΡΠ΅ΡΠΊΠΎΠΉ, ΡΡΠΎΠ±Ρ ΠΏΠΎΠ΄Π½ΡΡΡ Π·Π°Π³ΡΡΠ·Π½Π΅Π½ΠΈΡ
4 - Π’ΡΠ°ΡΠ΅Π»ΡΠ½ΠΎ ΠΏΡΠΎΠΏΠΎΠ»ΠΎΡΠΊΠ°ΡΡ ΡΠΈΠ΄Π΅Π½ΠΈΠ΅ ΡΠΊΡΡΡΠ°ΠΊΡΠΎΡΠΎΠΌ Ρ ΠΏΠΎΠ΄Π°ΡΠ΅ΠΉ Π²ΠΎΠ΄Ρ
#Ρ
ΠΈΠΌΡΠΈΡΡΠΊΠ°ΡΠ°Π»ΠΎΠ½Π° #Ρ
ΠΈΠΌΡΠΈΡΡΠΊΠ°ΡΠΈΠ΄Π΅Π½ΠΈΠΉ
But the line breaks get knocked down, who knows how to solve the problem?)
Solution
I figured it out :)
def _html_to_text(html) -> str:
hrefs = []
html = html.replace('\n', '||')
soup = bs(html, 'lxml')
for x in soup.find_all():
if len(x.get_text(strip=True)) == 0:
x.extract()
a_tags = soup.find_all('a', href=True)
for a_tag in a_tags:
href = a_tag.get('href')
if href in hrefs:
continue
hrefs.append(href)
a_tag.append(f' ({href})')
text = soup.get_text()
return text.replace('||', '\n')
Answered By - nnekkitt
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.