1 year ago
#349855
Rivie Cham
Computing similarity between duplicated variables using unique identifier
I have a data set that looks like this, where id is supposed to be the unique identifier. There are duplicates, for example, lines 1 and 4, but not lines 1 and 6 or 3 and 6 due to the year difference. Variable dupfreq shows if there are any similar instances within the dataset, including that row.
id | year | tlabor | rev | dupfreq | |
---|---|---|---|---|---|
1 | 1419 | 2005 | 5 | 1072 | 2 |
2 | 1425 | 2005 | 42 | 2945 | 1 |
3 | 1419 | 2005 | 4 | 950 | 2 |
4 | 1443 | 2006 | 18 | 3900 | 1 |
5 | 1485 | 2006 | 118 | 35034 | 1 |
6 | 1419 | 2006 | 6 | 1851 | 1 |
I want to check for row similarity (tlabor and rev) for those with dupfreq > 1, group by id and year.
I was thinking of something similar to this:
id | year | sim | |
---|---|---|---|
1 | 1419 | 2005 | 0.83 |
Note that dupfreq can be >2, but if I can only generate the new table using rows with dupfreq==2 I am ok with it too.
Any advice is greatly appreciated! Thanks in advance!
r
dataframe
group-by
similarity
0 Answers
Your Answer