1 year ago

#349855

test-img

Rivie Cham

Computing similarity between duplicated variables using unique identifier

I have a data set that looks like this, where id is supposed to be the unique identifier. There are duplicates, for example, lines 1 and 4, but not lines 1 and 6 or 3 and 6 due to the year difference. Variable dupfreq shows if there are any similar instances within the dataset, including that row.

id year tlabor rev dupfreq
1 1419 2005 5 1072 2
2 1425 2005 42 2945 1
3 1419 2005 4 950 2
4 1443 2006 18 3900 1
5 1485 2006 118 35034 1
6 1419 2006 6 1851 1

I want to check for row similarity (tlabor and rev) for those with dupfreq > 1, group by id and year.

I was thinking of something similar to this:

id year sim
1 1419 2005 0.83

Note that dupfreq can be >2, but if I can only generate the new table using rows with dupfreq==2 I am ok with it too.

Any advice is greatly appreciated! Thanks in advance!

r

dataframe

group-by

similarity

0 Answers

Your Answer

Accepted video resources