Issue
I'm using beautifulsoup4 for some web scraping project. However, the code return more stuff than intended. For example the below are also return. How to I eliminate them? I have already tried simple housekeeping methods.
As seen from my code I have already added in 'style' to be removed, but it still appear under the results. I'm not sure why is this so.
# Remove unwanted tag elements:
cleaned_text = ''
blacklist = [
'[document]',
'noscript',
'header',
'html',
'meta',
'head',
'input',
'script',
'style',]
# Then we will loop over every item in the extract text and make sure that the beautifulsoup4 tag
# is NOT in the blacklist
for item in text:
if item.parent.name not in blacklist:
cleaned_text += '{} '.format(item)
# Remove any tab separation and strip the text:
cleaned_text = cleaned_text.replace('\t', '')
return cleaned_text.strip()
below are unnecessary results that need to be removed.
[if !mso]>
<style>
v\:* {behavior:url(#default#VML);}
</style>
<![endif] [if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:RelyOnVML/>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif] [if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:TrackMoves/>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:DoNotPromoteQF/>
<w:LidThemeOther>EN-US</w:LidThemeOther>
<w:LidThemeAsian>X-NONE</w:LidThemeAsian>
<w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
<w:Compatibility>
<w:DontVertAlignInTxbx/>
<w:Word11KerningPairs/>
<w:CachedColBalance/>
</w:Compatibility>
<m:mathPr>
<m:mathFont m:val="Cambria Math"/>
<m:brkBin m:val="before"/>
<m:brkBinSub m:val="--"/>
<m:smallFrac m:val="off"/>
<m:naryLim m:val="undOvr"/>
</m:mathPr></w:WordDocument>
</xml><![endif] [if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
DefSemiHidden="true" DefQFormat="false" DefPriority="99"
LatentStyleCount="267">
<w:LsdException Locked="false" Priority="0" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Normal"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="heading 1"/>
</w:LatentStyles>
</xml><![endif] [if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-para-margin-right:0in;
mso-para-margin-bottom:10.0pt;
mso-para-margin-left:0in;
line-height:115%;
mso-pagination:widow-orphan;
font-size:11.0pt;
font-family:"Calibri","sans-serif";
mso-ascii-font-family:Calibri;
</style>
<![endif]
Solution
To resolve this issue, I print out the item.parent.name and manually inspect each value and remove accordingly. Similar to what diggusbickus mentioned.
Answered By - sirimiri
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.