Pandas multiindex filter if duplicated, but keep first duplicated

2 years ago

#377473

W. Walter

Assume I have a multiindex Pandas data frame with three indices (Date, Product, Country)

arrays = [['2021-12-30', '2021-12-30', '2021-12-30', '2021-12-31', '2021-12-31', '2021-12-31', '2022-01-01', '2022-01-01', '2022-01-02', '2022-01-02' ],
          ['A', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B'], ['DE', 'IT', 'DE', 'DE', 'IT', 'DE', 'DE', 'IT', 'DE', 'IT']]

index = pd.MultiIndex.from_arrays(arrays, names = ('Date', 'Product', 'Country'))
df = pd.DataFrame(np.zeros(10), index = index)
df

                             0
Date       Product Country     
2021-12-30 A       DE       0.0
                   IT       0.0
           B       DE       0.0
2021-12-31 A       DE       0.0
                   IT       0.0
           B       DE       0.0
2022-01-01 B       DE       0.0
                   IT       0.0
2022-01-02 B       DE       0.0
                   IT       0.0

The data tells me which product (A and B) exists at what date in which country (DE and IT). I want to filter the data such that if Date & Product is duplicated (meaning that a Product at a certain time exists in both countries), then the row with DE is removed. I found an intuitive way to do this:

df1 = df.index.to_frame()

df = df[~df1.duplicated(subset=['Date','Product'], keep=False) | 
         df1['Country'].eq('IT')]

df

                              0
Date       Product Country     
2021-12-30 A       IT       0.0
           B       DE       0.0
2021-12-31 A       IT       0.0
           B       DE       0.0
2022-01-01 B       IT       0.0
2022-01-02 B       IT       0.0

I have however one further constraint that I am so far unable to consider. It tells me to keep both duplicated entries when it is duplicated for the first time (30.12. excluded). For example, on 30.12. and 31.12. product B is not duplicated, while it is the first time as of 01.01. Here '2022-01-01 B DE' should be kept, while it will be removed the day after.

Does anybody have an idea how to do this? I would appreciate any help, many thanks in advance!

python

pandas

multi-index

0 Answers

Your Answer

Posts

Questions

Blogs