Issue
I want to plot two distributions of data as weighted histograms with weighted kernel density estimate (KDE) plots, side by side.
The data (length
of DNA fragments, split by categorical variable regions
) are integers in (0, 1e8)
interval. I can plot the default, unweighted, histograms and KDE without a problem, using the python code below. The code plots histograms for the tiny example of the input data in testdata
variable. See the unweighted (default) histograms below.
I want to produce a different plot, where the data in the histograms are weighted by length
(= the X axis numeric variable). I used weights
option (seaborn.histplot — seaborn documentation):
weights
: vector or key indata
If provided, weight the contribution of the corresponding data points towards the count in each bin by these factors.
The histograms changed as expected (see weighted histograms plots below). But the KDE (kernel density estimate) lines did not change.
Question: How can I change the kernel density estimate (KDE) to reflect the fact that I am using weighted histograms?
Unweighted (default) histograms:
Weighted histograms:
Code with the minimal reproducible example:
import io
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
def plot_restriction_digest(df, out_file_base, weights):
# Prevent the python icon from showing in the dock when the script is
# running:
matplotlib.use('Agg')
sns.set_theme(style='ticks')
f, ax = plt.subplots(figsize=(7, 5))
sns.despine(f)
hist = sns.histplot(data=df,
x='length',
hue='regions',
weights=weights,
# Normalize such that the total area of the histogram
# equals 1:
stat='density',
# stat='count',
# Make all histograms visible, otherwise 'captured'
# regions histogram is much smaller than 'all' regions
# one:
common_norm=False,
# Default plots too many very thin bins, which are poorly
# visible in pdf format (OK in png). Note that 10 bins is
# too crude, and 1000 bins makes too many thin bins:
bins=100,
# X axis log scale:
log_scale=True,
# Compute a kernel density estimate to smooth the
# distribution and show on the plot as lines:
kde=True,
)
sns.move_legend(hist, 'upper left')
plt.savefig(f'{out_file_base}.pdf')
return
testdata="""
1 all
1 all
2 all
2 all
2 all
3 all
4 captured
4 captured
5 captured
5 captured
5 captured
8 captured
"""
# Default histograms:
df = pd.read_csv(io.StringIO(testdata), sep='\s+', header=None, names='length regions'.split())
plot_restriction_digest(df, 'test_tiny', None)
# Weighted histograms:
df = pd.read_csv(io.StringIO(testdata), sep='\s+', header=None, names='length regions'.split())
plot_restriction_digest(df, 'test_tiny_weighted', 'length')
print('Done.')
Notes:
- The two distributions of data are DNA fragment lengths for two types of genomic regions: "all" and "captured", but this is irrelevant to this specific question.
- The minimal reproducible example illustrates the question. The real data frame has tens of millions of rows, so the histograms and KDE plots are much more smooth an meaningful. The actual data need the X axis to be log-transformed to better tell the two broad distributions apart.
- I am using these packages and versions:
Python 3.11.6
matplotlib-base 3.8.2 py311hfdba5f6_0 conda-forge
numpy 1.26.3 py311h7125741_0 conda-forge
pandas 2.2.0 py311hfbe21a1_0 conda-forge
seaborn 0.13.1 hd8ed1ab_0 conda-forge
seaborn-base 0.13.1 pyhd8ed1ab_0 conda-forge
Solution
Histograms and kde plots with a very small sample size often aren't a good indicator for how things behave with more suitable sample sizes. In this case, the default bandwidth seems too wide for this particular dataset. The bandwidth can be scaled via the bw_adjust
parameter of the kdeplot
. When creating a histplot
, parameters ("keywords") for the kde
can be provided via the kde_kws
dictionary.
In my test, to both test weights and a log scale, I used a few x
values, and gave each a weight of 1, except for two of them, giving them a really high weight.
A log scale can be mimicked by taking the log of the x values. The axis will show the log values, which will be less intuitive than the original values.
For small data sets and integer weights, np.repeat
can create a dataset with every value repeated. This doesn't work for very large datasets due to memory constraints.
To quickly test settings and behavior for the real dataset, waiting time can be reduced by taking a random sample (e.g. df.sample(10000)
).
The test code below indicates that both the weights and the log scale seem to work as expected. One difference is that the default bandwidth is different in both cases; a different bw_scale
is used to compensate.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
df = pd.DataFrame({'x': [1, 10, 20, 30, 40, 50, 60],
'w': [1, 10, 1, 1, 1, 10, 1]})
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(9, 6))
sns.histplot(df, x='x', bins=10, weights='w', kde=True, log_scale=True, stat='density',
kde_kws={'bw_adjust': 0.3}, ax=ax1)
ax1.set_xticks(df['x'].to_numpy(), df['x'].to_numpy())
ax1.set_title('sns.histplot with log scale, weights and kde')
sns.histplot(x=np.log(np.repeat(df['x'], df['w'])), bins=10, kde=True, stat='density',
kde_kws={'bw_adjust': 0.55}, ax=ax2)
ax2.set_title('mimicing scale and weights')
plt.tight_layout()
plt.show()
Answered By - JohanC
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.