Issue
I am extracting information about chemical elements from Wikipedia. It contains sentences, and I want each sentence to be added as follows:
Molecule | Sentence1 | Sentence1 and sentence2 | All_sentence |
---|---|---|---|
MgO | this is s1. | this is s1. this is s2. | all_sentence |
CaO | this is s1. | this is s1. this is s2. | all_sentence |
What I've achieved so far
import spacy
import pandas as pd
import wikipediaapi
import csv
wiki_wiki = wikipediaapi.Wikipedia('en')
chemical = input("Write the name of molecule: ")
page_py = wiki_wiki.page(chemical)
sumary = page_py.summary[0:]
nlp = spacy.load('en_core_web_sm')
text_sentences = nlp(sumary)
sent_list = []
for sentence in text_sentences.sents:
sent_list.append(sentence.text)
#print(sent_list)
df = pd.DataFrame(
{'Molecule': chemical,
'Description': sent_list})
print(df.head())
The output looks like:
Molecule | Description |
---|---|
MgO | All sentences are here |
Mgo |
The Molecule columns are shown repeatedly for each line of sentence which is not correct. Please suggest some solution
Solution
It's not clear why you would want to repeat all sentences in each column but you can get to the form you want with pivot
:
import spacy
import pandas as pd
import wikipediaapi
import csv
wiki_wiki = wikipediaapi.Wikipedia('en')
chemical = input("Write the name of molecule: ")
page_py = wiki_wiki.page(chemical)
sumary = page_py.summary[0:]
nlp = spacy.load('en_core_web_sm')
sent_list = [sent.text for sent in nlp(sumary).sents]
#cumul_sent_list = [' '.join(sent_list[:i]) for i in range(1, len(sent_list)+1)] # uncomment to cumulate sentences in columns
df = pd.DataFrame(
{'Molecule': chemical,
'Description': sent_list}) # replace sent_list with cumul_sent_list if cumulating
df["Sentences"] = pd.Series([f"Sentence{i + 1}" for i in range(len(df))]) # replace "Sentence{i+1}" with "Sentence1-{i+1}" if cumulating
df = df.pivot(index="Molecule", columns="Sentences", values="Description")
print(df)
sent_list
can be created using a list comprehension. Create cumul_sent_list
if you want your sentences to be repeated in columns.
Output:
Sentences Sentence1 ... Sentence9
Molecule ...
MgO Magnesium oxide (MgO), or magnesia, is a white... ... According to evolutionary crystal structure pr...
Answered By - Tranbi
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.