Tuesday, October 27, 2015

Inspecting Apache Parquet file

We had a customer issue where the customer is trying to query the parquet file from Hive and it was failing for them. Later, we found that there was some issues with the parquet file that was causing the error. This blog talks on how to examine the parquet file.

We use Apache Parquet Tool to inspect the parquet file. You can download parquet-tools-1.6.0.jar from below link




1)  Getting the metadata information from the parquet file

Command: hadoop jar <parquet-tools-x.jar> meta <parquetFile>

[hdfs@rvm sqoop]$ hadoop jar /opt/nish/parquet-tools-1.6.0.jar meta /user/biadmin/Par/15ebcea5-50d3-441a-a79f-7314d691585f.parquet

WARNING: Use "yarn jar" to launch YARN applications.
file: hdfs://rvm.test.com:8020/user/biadmin/Par/15ebcea5-50d3-441a-a79f-7314d691585f.parquet
creator: parquet-mr (build 27f71a18579ebac6db2b0e9ac758d64288b6dbff)
extra: avro.schema = {"type":"record","name":"employeet","namespace":"bigsql","doc":"bigsql.employeet","fields":[{"name":"ID","type":["null","int"],"default":null,"columnName":"ID","sqlType":"4"},{"name":"NAME","type":["null","string"],"default":null,"columnName":"NAME","sqlType":"12"},{"name":"JOBROLE","type":["null","string"],"default":null,"columnName":"JOBROLE","sqlType":"12"}],"tableName":"bigsql.employeet"}

file schema: bigsql.employeet
--------------------------------------------------------------------------------
ID: OPTIONAL INT32 R:0 D:1
NAME: OPTIONAL BINARY O:UTF8 R:0 D:1
JOBROLE: OPTIONAL BINARY O:UTF8 R:0 D:1

row group 1: RC:4 TS:163 OFFSET:4
--------------------------------------------------------------------------------
ID: INT32 SNAPPY DO:0 FPO:4 SZ:41/39/0.95 VC:4 ENC:PLAIN,BIT_PACKED,RLE
NAME: BINARY SNAPPY DO:0 FPO:45 SZ:56/70/1.25 VC:4 ENC:PLAIN,BIT_PACKED,RLE
JOBROLE: BINARY SNAPPY DO:0 FPO:101 SZ:48/54/1.13 VC:4 ENC:PLAIN,BIT_PACKED,RLE
[hdfs@rvm sqoop]$

Here
RC refers to Row Count and VC refers to Value Count.
SZ:{x}/{y}/{z} - x = Compressed total, y=uncompressedtotal, z = y:x ratio
Metadata information contains the compression used and Encoding used

You can understand the file format of parquet from https://parquet.apache.org/documentation/latest/ 



                                               Graphic sourced: http://tinyurl.com/o22gtck


                                                                           
2) Getting the schema from parquet file

Command: hadoop jar <parquet-tools-x.jar> schema <parquetFile>

[hdfs@rvm sqoop]$ hadoop jar /opt/nish/parquet-tools-1.6.0.jar schema /user/biadmin/Par/15ebcea5-50d3-441a-a79f-7314d691585f.parquet
WARNING: Use "yarn jar" to launch YARN applications.
message bigsql.employeet {
optional int32 ID;
optional binary NAME (UTF8);
optional binary JOBROLE (UTF8);
}

[hdfs@rvm sqoop]$



3) Display the content of parquet

Command: hadoop jar <parquet-tools-x.jar> cat <parquetFile>

[hdfs@rvm sqoop]$ hadoop jar /opt/nish/parquet-tools-1.6.0.jar cat /user/biadmin/Par/15ebcea5-50d3-441a-a79f-7314d691585f.parquet
WARNING: Use "yarn jar" to launch YARN applications.
15/10/27 18:49:43 INFO compress.CodecPool: Got brand-new decompressor [.snappy]
ID = 3
NAME = nisanth2
JOBROLE = dev2

ID = 1
NAME = nisanth
JOBROLE = dev

ID = 4
NAME = nisanth3
JOBROLE = dev3

ID = 2
NAME = nisanth1
JOBROLE = dev1

[hdfs@rvm sqoop]$



4) Getting first few records

Command: hadoop jar <parquet-tools-x.jar> head -n <noOfRecords> <parquetFile>

[hdfs@rvm sqoop]$ hadoop jar /opt/nish/parquet-tools-1.6.0.jar head -n 2 /user/biadmin/Par/15ebcea5-50d3-441a-a79f-7314d691585f.parquet
WARNING: Use "yarn jar" to launch YARN applications.
15/10/27 18:50:54 INFO compress.CodecPool: Got brand-new decompressor [.snappy]
ID = 3
NAME = nisanth2
JOBROLE = dev2

ID = 1
NAME = nisanth
JOBROLE = dev

[hdfs@rvm sqoop]$


These inspection helps to understand the no# of records, compression used, understanding the metadata of the columns etc for debugging the issues related to parquet.