Issue
I have a pandas.DataFrame
df
and would like to add a new column col
with one single value "hello"
. I would like this column to be of dtype category
with the single category "hello"
. I can do the following.
df["col"] = "hello"
df["col"] = df["col"].astype("catgegory")
- Do I really need to write
df["col"]
three times in order to achieve this? - After the first line I am worried that the intermediate dataframe
df
might take up a lot of space before the new column is converted to categorical. (The dataframe is rather large with millions of rows and the value"hello"
is actually a much longer string.)
Are there any other straightforward, "short and snappy" ways of achieving this while avoiding the above issues?
An alternative solution is
df["col"] = pd.Categorical(itertools.repeat("hello", len(df)))
but it requires itertools
and the use of len(df)
, and I am not sure how memory usage is under the hood.
Solution
We can explicitly build the Series of the correct size and type instead of implicitly doing so via __setitem__
then converting:
df['col'] = pd.Series('hello', index=df.index, dtype='category')
Sample Program:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3]})
df['col'] = pd.Series('hello', index=df.index, dtype='category')
print(df)
print(df.dtypes)
print(df['col'].cat.categories)
a col
0 1 hello
1 2 hello
2 3 hello
a int64
col category
dtype: object
Index(['hello'], dtype='object')
Answered By - Henry Ecker
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.