1 year ago

#382714

test-img

Khan Saab

Parquet writes boolean values as null

I am trying to write a dataframe in parquet format to hdfs. The dataframe contains all values before writing but when i write it to the disk, it converts the boolean values (true,false) to null.

Here is the code:

    val finalDF = dataFrame.select(dataFrame.columns.map(c => col(c).cast(StringType)) : _*).select(dataFrame.columns.map(x => col(x).as(x.toLowerCase)): _*)
    print("final DF before writing")
    finalDF.show(4)
    finalDF.write.partitionBy(pushStreamInstance.getPartitionColsForHive.map(name => name): _*).option("compression","none").mode("append").parquet(pushStreamInstance.getHiveOutputPath)

When i show first 4 row of the dataframe then i get following data:

final DF before writing+-------+-----------+------+-----------+-----------+--------+---------------------------+----------+---------------------------+--------------------------+---------------------+------------+------------+---------+-------------+-------------+----------+------------+-----------------+--------------------+-------+----------------+------------+--------+-----------+---------+--------------+------------------+-------------+------------------+------------+------------+----------------+------------------------+----------------+----------------+------------------------+--------------------+-------------+--------------------+----------+--------------------------+-------------+---------------------------+--------------------+-----------------------+-----------+--------+---------------+-------------------------+-------------+-------+----------------+--------------------+----------+-------------+
|hd_conn|stb_profile|action|action_type|call_letter|category|channel_external_identifier|channel_id|client_software_version_app|client_software_version_mw|company_location_name|company_site| device_name|event_cat|event_counter|   event_time|event_type|eventcounter|hdmi_connectivity|household_identifier|hw_type|invocation_point|  ip_address| item_id| item_title|item_type|language_audio|language_interface|language_shop|language_subtitles|         mac| mac_address|oauth2_client_id|operation_system_version|program_category|program_duration|program_reference_number|program_sub_category|program_title|recommendations_flag|resolution|scheduled_trail_identifier|   start_time|stream_adaptivity_indicator|  stream_content_url|stream_control_protocol|stream_mode|sub_type|subscriber_type|targeted_advertising_flag|time_position|   type|viewing_duration|  viewing_identifier|event_date|logstash_date|
+-------+-----------+------+-----------+-----------+--------+---------------------------+----------+---------------------------+--------------------------+---------------------+------------+------------+---------+-------------+-------------+----------+------------+-----------------+--------------------+-------+----------------+------------+--------+-----------+---------+--------------+------------------+-------------+------------------+------------+------------+----------------+------------------------+----------------+----------------+------------------------+--------------------+-------------+--------------------+----------+--------------------------+-------------+---------------------------+--------------------+-----------------------+-----------+--------+---------------+-------------------------+-------------+-------+----------------+--------------------+----------+-------------+
|   true|      HD TV|  PLAY|      START|        BX1|   EVENT|                   UID50075|  UID50075|                     4.81.1|             PXM-SW-3.80.0|               02BRA0|     BEPBXL1|F83B1D937298|   LINEAR|        278.0|1638173907037|   VIEWING|       278.0|             true|             4727286|     v7|      CHANNEL-UP|192.168.1.64|43421543|Archiurbain|   SINGLE|            FR|                FR|           FR|               OFF|F83B1D937298|F83B1D937298|            NONE|               Android 9|        Magazine|       1080000.0|                43421543|        Architecture|  Archiurbain|                true|        SD|              202111203342|1638173520000|                      FIXED|rtp://239.255.1.1...|                   IGMP|  MULTICAST|  LINEAR|              R|                     true|     387031.0|VIEWING|             0.0|F83B1D937298_1638...|2021-11-29|   2021-11-29|
|   true|      HD TV|  PLAY|      START|        BX1|   EVENT|                   UID50075|  UID50075|                     4.81.1|             PXM-SW-3.80.0|               02BRA0|     BEPBXL1|F83B1D937298|   LINEAR|        278.0|1638173907037|   VIEWING|       278.0|             true|             4727286|     v7|      CHANNEL-UP|192.168.1.64|43421543|Archiurbain|   SINGLE|            FR|                FR|           FR|               OFF|F83B1D937298|F83B1D937298|            NONE|               Android 9|        Magazine|       1080000.0|                43421543|        Architecture|  Archiurbain|                true|        SD|              202111203342|1638173520000|                      FIXED|rtp://239.255.1.1...|                   IGMP|  MULTICAST|  LINEAR|              R|                     true|     387031.0|VIEWING|             0.0|F83B1D937298_1638...|2021-11-29|   2021-11-29|
|   true|      HD TV|  PLAY|      START|        BX1|   EVENT|                   UID50075|  UID50075|                     4.81.1|             PXM-SW-3.80.0|               02BRA0|     BEPBXL1|F83B1D937298|   LINEAR|        278.0|1638173907037|   VIEWING|       278.0|             true|             4727286|     v7|      CHANNEL-UP|192.168.1.64|43421543|Archiurbain|   SINGLE|            FR|                FR|           FR|               OFF|F83B1D937298|F83B1D937298|            NONE|               Android 9|        Magazine|       1080000.0|                43421543|        Architecture|  Archiurbain|                true|        SD|              202111203342|1638173520000|                      FIXED|rtp://239.255.1.1...|                   IGMP|  MULTICAST|  LINEAR|              R|                     true|     387031.0|VIEWING|             0.0|F83B1D937298_1638...|2021-11-29|   2021-11-29|
|   true|      HD TV|  PLAY|      START|        BX1|   EVENT|                   UID50075|  UID50075|                     4.81.1|             PXM-SW-3.80.0|               02BRA0|     BEPBXL1|F83B1D937298|   LINEAR|        278.0|1638173907037|   VIEWING|       278.0|             true|             4727286|     v7|      CHANNEL-UP|192.168.1.64|43421543|Archiurbain|   SINGLE|            FR|                FR|           FR|               OFF|F83B1D937298|F83B1D937298|            NONE|               Android 9|        Magazine|       1080000.0|                43421543|        Architecture|  Archiurbain|                true|        SD|              202111203342|1638173520000|                      FIXED|rtp://239.255.1.1...|                   IGMP|  MULTICAST|  LINEAR|              R|                     true|     387031.0|VIEWING|             0.0|F83B1D937298_1638...|2021-11-29|   2021-11-29|
+-------+-----------+------+-----------+-----------+--------+---------------------------+----------+---------------------------+--------------------------+---------------------+------------+------------+---------+-------------+-------------+----------+------------+-----------------+--------------------+-------+----------------+------------+--------+-----------+---------+--------------+------------------+-------------+------------------+------------+------------+----------------+------------------------+----------------+----------------+------------------------+--------------------+-------------+--------------------+----------+--------------------------+-------------+---------------------------+--------------------+-----------------------+-----------+--------+---------------+-------------------------+-------------+-------+----------------+--------------------+----------+-------------+
only showing top 4 rows

All columns(hd_conn) that have boolean values are converted to null values despite the fact that everything has StringType as type. In other words, the type of hd_conn is StringType.

After writing to the disk, I get only nulls for these columns. An example of the output is shown below:

{"hd_conn":"null","stb_profile":"HD TV","action":"EXITED","action_type":"STOP","advertisement_reference":"null","app_duration":"4.5691844E7","app_identifier":"F83B1D16ED51_1649207767570","boot_reason":"null","call_letter":"null","category":"EVENT","channel_external_identifier":"null","channel_id":"null","channel_inactivity_threshold":"null","client_software_version":"null","client_software_version_app":"4.92.2","client_software_version_mw":"PXM-SW-3.84.0","company_location_name":"03KAP0","company_site":"BEPVLA2","connected_deflect_state_action":"null","consecutive_playout_indicator":"null","currency_unit":"null","device_identifier":"null","device_name":"F83B1D16ED51","episode_number":"null","episode_title":"null","error_code":"null","error_key":"null","error_message":"null","error_type":"null","event_cat":"Proximus Pickx","event_counter":"50574.0","event_time":"1649253459414","eventcounter":"50574.0","exception_message":"null","exception_name":"null","external_id":"null","first_sign_of_life":"null","genre":"null","group_title":"null","hdmi_connectivity":"null","household_identifier":"1457100","hw_type":"v7","into_standby":"null","into_standby_elapsed_time":"null","invocation_point":"null","ip_address":"192.168.1.65","isolation_state_action":"null","item_id":"be.px.stbtvclient","item_title":"null","item_type":"null","language_audio":"NL","language_interface":"NL","language_shop":"NL","language_subtitles":"OFF","last_activity_timestamp":"null","last_known_heart_beat":"null","mac":"F83B1D16ED51","mac_address":"F83B1D16ED51","oauth2_client_id":"NONE","operation_system_version":"Android 9","operational_mode":"null","out_of_standby":"null","out_standby_elapsed_time":"null","period_of_inactivity":"null","playout_cat":"null","playout_url":"null","post_cycle_state":"null","power_cycle_type":"null","pre_cycle_state":"null","preview":"null","previous_state_uptime":"null","program_category":"null","program_duration":"null","program_reference_number":"null","program_sub_category":"null","program_title":"null","recommendations_flag":"true","remote_connected":"false","rental_cost":"null","resolution":"null"}

I think parquet is trying to do some optimizations that cause this issue.

Any help will be great.

I am using HDP version 2.6.5 & Spark 2.3.

apache-spark

hdfs

bigdata

parquet

hdp

0 Answers

Your Answer

Accepted video resources