Computing similarity between duplicated variables using unique identif - Enhance your coding expertise with Rivie Cham on @onlycoders.net

1 year ago

#349855

Rivie Cham

Computing similarity between duplicated variables using unique identifier

I have a data set that looks like this, where id is supposed to be the unique identifier. There are duplicates, for example, lines 1 and 4, but not lines 1 and 6 or 3 and 6 due to the year difference. Variable dupfreq shows if there are any similar instances within the dataset, including that row.

	id	year	tlabor	rev	dupfreq
1	1419	2005	5	1072	2
2	1425	2005	42	2945	1
3	1419	2005	4	950	2
4	1443	2006	18	3900	1
5	1485	2006	118	35034	1
6	1419	2006	6	1851	1

I want to check for row similarity (tlabor and rev) for those with dupfreq > 1, group by id and year.

I was thinking of something similar to this:

	id	year	sim
1	1419	2005	0.83

Note that dupfreq can be >2, but if I can only generate the new table using rows with dupfreq==2 I am ok with it too.

Any advice is greatly appreciated! Thanks in advance!

dataframe

group-by

similarity

0 Answers

Your Answer

Posts

Questions

Blogs

Jobs