1 year ago
#386678
Dhiraj
Hive GROUP BY optimization based on cardinality
Logically cardinality of columns should matter while doing GROUP BY
operation. When we write Hive queries involving GROUP BY
, since we are familiar with the data being queried, we have an idea about cardinality of individual columns involved in the GROUP BY
. But Hive has no idea about this. So let's say the Hive query in question is:-
SELECT Col1,Col2,Col3,Col4,Col5,COUNT(*) FROM MyTable GROUP BY Col1,Col2,Col3,Col4,Col5
I know the degree of cardinality of all the 5 columns here. But Hive doesn't know that, so Hive will probably perform the worst.
So let's say the cardinality information that I have about these columns is like this, from lowest to highest and also giving example of values contained:-
- Col5 = it contains country name
- Col4 = it contains state name
- Col3 = it contains city name
- Col2 = it contains postal code
- Col1 = it contains email address
Now Hive will treat all these the same , won't it be beneficial if Hive knew about underlying cardinality information so it could exploit this in calculating unique groups? In that case if I explicitly arrange the columns in the GROUP BY
clause in the order of cardinality, will it be efficient as shown in the following example ?
SELECT Col1,Col2,Col3,Col4,Col5,COUNT(*) FROM MyTable GROUP BY Col5,Col4,Col3,Col2,Col1
Or hive will ignore this order and treat all the columns equally regardless of the order?
group-by
hive
0 Answers
Your Answer