Issue
I tag this question with both excel and python is because I want to explore the possibilities of using either approach (by doing simple copy content from the website and paste it to excel with the right format, or use any python library to extract the content and covert to a dataframe).
I need to copy and paste the below table into excel from https://ndcpartnership.org/climate-tools/ndcs .However, The format I need to have in excel file contains row (all records) and 3 columns include "Country", "Latest Submission" and "Latest submission date" like what I see from the table below.
However, when I select all ,copy the entire table and paste over to excel cell, I only get certain rows of records (all records shown in one column). I have tried to inspect the URL and attempted to use bs4 but I do not see all the info (3 columns I mentioned above) I need from the html structure, so I did not have any code posted in this question. Given this is a one time effort, I want to figure out a way to paste the content in a desired format into excel.
Any suggestions and advice are highly appreciated.
Solution
The problem is the table from the above website is not a table but a dynamic web element that automatically updates and refreshes.
Thus, the most efficient solution for this one time task would implementing a semi-automation solution (mixed with excel and python) as below given there is no way to copy and paste all the desired rows and columns into excel at once or using bs4 to web scraping the desired 3 columns:
Step 1: Copy and paste 10 records/rows each time into excel, as pasting more than 11 records do not work in this case, then save the excel file. Be noted, after paste over, each record would transpose from one row and five columns into 5 rows and 1 columns.
Step 2: Use below codes to get the dataframe.
ndc = pd.read_excel("ndc.xsx")
ndc_cont = [] # get a list of country
for i in range(0,984,5):
cont = ndc.loc[i,'column']
ndc_cont.append(cont)
print(len(ndc_cont))
ndc_value = [] # get submission value
for i in range(2,984,5):
sub_val = ndc.loc[i,'column']
ndc_value.append(sub_val)
print(len(ndc_value))
ndc_date = [] # get submission date
for i in range(3,984,5):
dt = ndc.loc[i,'column'] ndc_date.append(dt)
print(len(ndc_date))
df_final = pd.DataFrame( {'country': ndc_cont, 'value': ndc_value,
'year':ndc_date })
Answered By - user032020
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.