Issue
I have the following document and I would like to extract all categories flags.
Input: Should be a variable has unstructured text named doc
.
doc = "Like APC , <category="Modifier">APC2</category> regulates the formation of active betacatenin-Tcf
complexes , as demonstrated using transient transcriptional activation assays in APC - / -
<category="Modifier">colon carcinoma</category> cells. Human APC2 maps to chromosome 19p13 . 3 .
APC and APC2 may therefore have comparable functions in development
and <category="SpecificDisease">cancer</category>"
Output: Should be as follows:
{
'Modifier': ['APC2', 'colon carcinoma'],
'SpecificDisease': ['cancer']
}
This should be automated to be able to extract all category tags in a corpus.
I tried the following code:
soup = BeautifulSoup(doc)
contents = soup.find_all('category')
But didn't know how to extract each flag.
Solution
BeautifulSoup cannot parse this type of document. But as a "workaround", you can use re
module, for example:
import re
doc = """Like APC , <category="Modifier">APC2</category> regulates the formation of active betacatenin-Tcf
complexes , as demonstrated using transient transcriptional activation assays in APC - / -
<category="Modifier">colon carcinoma</category> cells. Human APC2 maps to chromosome 19p13 . 3 .
APC and APC2 may therefore have comparable functions in development
and <category="SpecificDisease">cancer</category>"""
out = {}
for c, t in re.findall(r'<category="(.*?)">(.*?)</category>', doc):
out.setdefault(c, []).append(t)
print(out)
Prints:
{'Modifier': ['APC2', 'colon carcinoma'], 'SpecificDisease': ['cancer']}
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.