Issue
I'm parsing a text, which every word is made into a link. Problem is that punctuation marks aren't the content of that tags <a>
, they just lie outside the tags, so I don't know what to do to get punctuation marks too.
<table>
<tbody>
<tr>
<td>
<a href="#">Lorem</a>
", "
<a href="#">Ipsum</a>
": "
<a href="#">dolor</a>
"."
</td>
<td>...</td>
</tr>
<tr>
<td>
<a href="#">sit</a>
"? '"
<a href="#">amet</a>
"' "
<a href="#">consectetur</a>
"..."
</td>
<td>...</td>
</tr>
<tr>
<td>
<a href="#">adipisicing</a>
"-"
<a href="#">elit</a>
"; "
<a href="#">Molestias</a>
"!"
</td>
<td>...</td>
</tr>
</tbody>
</table>
here's the parser
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path="...")
driver.get(url=...)
soup = BeautifulSoup(driver.page_source, 'html.parser')
words = []
for tableRows in soup.select("table > tbody > tr"):
for word in tableRows.find("td").select("a"):
words.append(word.text)
print(words)
Solution
The text content between a
tag elements belongs to the parent td
element itself.
You can directly grab text from td
elements, as following:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path="...")
driver.get(url=...)
soup = BeautifulSoup(driver.page_source, 'html.parser')
words = []
for tableRow in soup.select("table > tbody > tr"):
words.append(tableRow.text)
print(words)
UPD
In case you want to get punctuation marks as separated objects you can split the table row text by spaces. The following code should do that + remove leading and trailing spaces.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path="...")
driver.get(url=...)
soup = BeautifulSoup(driver.page_source, 'html.parser')
words = []
for tableRow in soup.select("table > tbody > tr"):
tableRowtext = tableRow.text
rowTexts = [x.strip() for x in tableRowtext.split(' ')]
words.append(rowTexts)
print(words)
Answered By - Prophet
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.