1 year ago
#377473
W. Walter
Pandas multiindex filter if duplicated, but keep first duplicated
Assume I have a multiindex Pandas data frame with three indices (Date, Product, Country)
arrays = [['2021-12-30', '2021-12-30', '2021-12-30', '2021-12-31', '2021-12-31', '2021-12-31', '2022-01-01', '2022-01-01', '2022-01-02', '2022-01-02' ],
['A', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B'], ['DE', 'IT', 'DE', 'DE', 'IT', 'DE', 'DE', 'IT', 'DE', 'IT']]
index = pd.MultiIndex.from_arrays(arrays, names = ('Date', 'Product', 'Country'))
df = pd.DataFrame(np.zeros(10), index = index)
df
0
Date Product Country
2021-12-30 A DE 0.0
IT 0.0
B DE 0.0
2021-12-31 A DE 0.0
IT 0.0
B DE 0.0
2022-01-01 B DE 0.0
IT 0.0
2022-01-02 B DE 0.0
IT 0.0
The data tells me which product (A and B) exists at what date in which country (DE and IT). I want to filter the data such that if Date & Product is duplicated (meaning that a Product at a certain time exists in both countries), then the row with DE is removed. I found an intuitive way to do this:
df1 = df.index.to_frame()
df = df[~df1.duplicated(subset=['Date','Product'], keep=False) |
df1['Country'].eq('IT')]
df
0
Date Product Country
2021-12-30 A IT 0.0
B DE 0.0
2021-12-31 A IT 0.0
B DE 0.0
2022-01-01 B IT 0.0
2022-01-02 B IT 0.0
I have however one further constraint that I am so far unable to consider. It tells me to keep both duplicated entries when it is duplicated for the first time (30.12. excluded). For example, on 30.12. and 31.12. product B is not duplicated, while it is the first time as of 01.01. Here '2022-01-01 B DE' should be kept, while it will be removed the day after.
Does anybody have an idea how to do this? I would appreciate any help, many thanks in advance!
python
pandas
multi-index
0 Answers
Your Answer