Issue
This is my first time using reticulate
. I have 20 multi-page pdf tables I'm pulling data from using camelot
in python (they're not simple tables so I need the more powerful table reader). It creates a list of tables (one table for each page) and makes a TableList
object. I'm able to loop over list and convert the tables to pandas dataframes. Example of doing this with one of the pdfs:
tables2001 = camelot.read_pdf('2001.pdf', flavor='stream', pages='1-end')
df2001 = list()
for t in tables2001:
df = t.df
df2001.append(df)
I can then return to r, and rdf2001 <- py$df2001
gives me a list of r data.frames.
However, if I instead put the python list of dataframes into either a nested list or a dictionary containing lists, the r conversion no longer works, and the resulting nested list still contains pandas data.frames. An attempt to manually convert one of the dfs understandably gives this:
Error in as.data.frame.default(rdf2001_nested[[1]]) :
cannot coerce class ‘c("pandas.core.frame.DataFrame", "pandas.core.generic.NDFrame", ’ to a data.frame
If I pull a single list from from a nested list into r, e.g. df2001_a <- py$df2001[1]
, that converts to a single list of r data.frames. I can't do the same for a dictionary, since the conversion keeps the key as a list so the nesting still exists.
The idea of using a dictionary was to get a named list in r identifying each year, since the tables themselves do not contain that information. I can work around it, but the dictionary to named list would to me the clearest way to do this assuming it would work. Trying nested lists was to figure out if the conversion issue only happened with dictionaries, which it doesn't; it's with any kind of nesting.
I'm trying to understand why this is happening. Can reticulate
only convert a single level of a list? Is there an underlying reason for this or is it just that that ability hasn't been added but in theory could be?
Update with full code:
Pdf tables are here. I extracted the pages covering criminal caseloads for each year which is why pages are listed as 1-end; each has 14 pages. Python code run with repl_python()
- works and gives the outcome I intend for both the list and dictionary:
import camelot
import pandas
# Lists
tables2001 = camelot.read_pdf('2001.pdf', flavor='stream', pages='1-end')
tables2002 = camelot.read_pdf('2002.pdf', flavor='stream', pages='1-end')
tables2003 = camelot.read_pdf('2003.pdf', flavor='stream', pages='1-end')
dflist = list()
tablelist=[tables2001,tables2002,tables2003,tables2004]
for t in tablelist:
df = t.df
dflist.append(df)
# Dictionary - I got help with this from someone who is knows python well
tables = { f'20{str(n).zfill(2)}': camelot.read_pdf(f'20{str(n).zfill(2)}.pdf',
flavor='stream', pages='1-end', table_regions=['50,580,780,50']) for n in range(1,3)}
dfdict = { k: [df.df for df in v] for k, v in tables.items() }
R code:
library(reticulate)
# List
rdflist <- py$dflist
# Dictionary
rdfdict <- py$dfdict
rdflist
is a list of data.frames. rdfdict
is a named nested list, containing 3 lists (2001, 2002, 2003), each with 14 pandas dataframes, i.e. not usable in r.
class(rdflist[[1]])
[1] "data.frame"
class(rdfdict[[1]][[1]])
[1] "pandas.core.frame.DataFrame" "pandas.core.generic.NDFrame"
[3] "pandas.core.base.PandasObject" "pandas.core.base.StringMixin"
[5] "pandas.core.accessor.DirNamesMixin" "pandas.core.base.SelectionMixin"
[7] "python.builtin.object"
Attempt to coerce a single df to data.frame:
as.data.frame(rdfdict[[1]][[1]])
Error in as.data.frame.default(rdfdict[[1]][[1]]) :
cannot coerce class ‘c("pandas.core.frame.DataFrame", "pandas.core.generic.NDFrame", ’ to a data.frame
Solution
Comparing both versions, you run a couple of differences for the dictionary version including an additional argument, table_regions
and an extra nested looping in the dictionary comprehension: [df.df for df in v]
(interestingly did not raise an error in Python).
Consider adjusting for consistency for comparable returned values. By the way, in Python, you can also run list comprehension similar to dict comprehension.
Python
import camelot
import pandas as pd
# LIST COMPREHENSION
pydf_list = [
[tbl.df for tbl in camelot.read_pdf(f'{yr}.pdf', flavor='stream', pages='1-end')]
for yr in range(2001, 2004)
]
# DICT COMPREHENSION
pydf_dict = {
str(yr): [tbl.df for tbl in camelot.read_pdf(f'{yr}.pdf', flavor='stream', pages='1-end')]
for yr in range(2001, 2004)
}
R
library(reticulate)
reticulate::source_python("myscript.py")
# NESTED LIST
rdf_list <- reticulate::py$pydf_list
# NESTED NAMED LIST
rdf_dict <- reticulate::py$pydf_dict
However, as you indicate I do reproduce the problematic dict conversion to named list using a reproducible example. Reporting this issue, one suggestion of maintainer is to use py_to_r
:
rdf_dict2 <- lapply(rdf_dict, function(lst) lapply(lst, py_to_r))
Answered By - Parfait
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.