1 year ago

#386678

test-img

Dhiraj

Hive GROUP BY optimization based on cardinality

Logically cardinality of columns should matter while doing GROUP BY operation. When we write Hive queries involving GROUP BY, since we are familiar with the data being queried, we have an idea about cardinality of individual columns involved in the GROUP BY. But Hive has no idea about this. So let's say the Hive query in question is:-

SELECT Col1,Col2,Col3,Col4,Col5,COUNT(*) FROM MyTable GROUP BY Col1,Col2,Col3,Col4,Col5

I know the degree of cardinality of all the 5 columns here. But Hive doesn't know that, so Hive will probably perform the worst.

So let's say the cardinality information that I have about these columns is like this, from lowest to highest and also giving example of values contained:-

  • Col5 = it contains country name
  • Col4 = it contains state name
  • Col3 = it contains city name
  • Col2 = it contains postal code
  • Col1 = it contains email address

Now Hive will treat all these the same , won't it be beneficial if Hive knew about underlying cardinality information so it could exploit this in calculating unique groups? In that case if I explicitly arrange the columns in the GROUP BY clause in the order of cardinality, will it be efficient as shown in the following example ?

SELECT Col1,Col2,Col3,Col4,Col5,COUNT(*) FROM MyTable GROUP BY Col5,Col4,Col3,Col2,Col1

Or hive will ignore this order and treat all the columns equally regardless of the order?

group-by

hive

0 Answers

Your Answer

Accepted video resources