We had a customer issue where the
customer is trying to query the parquet file from Hive and it was
failing for them. Later, we found that there was some issues with the
parquet file that was causing the error. This blog talks on how to
examine the parquet file.
We use Apache Parquet Tool to inspect
the parquet file. You can download
parquet-tools-1.6.0.jar from below link
Command: hadoop jar <parquet-tools-x.jar>
meta <parquetFile>
[hdfs@rvm sqoop]$ hadoop jar
/opt/nish/parquet-tools-1.6.0.jar meta
/user/biadmin/Par/15ebcea5-50d3-441a-a79f-7314d691585f.parquet
WARNING: Use "yarn jar" to
launch YARN applications.
file:
hdfs://rvm.test.com:8020/user/biadmin/Par/15ebcea5-50d3-441a-a79f-7314d691585f.parquet
creator: parquet-mr (build
27f71a18579ebac6db2b0e9ac758d64288b6dbff)
extra: avro.schema =
{"type":"record","name":"employeet","namespace":"bigsql","doc":"bigsql.employeet","fields":[{"name":"ID","type":["null","int"],"default":null,"columnName":"ID","sqlType":"4"},{"name":"NAME","type":["null","string"],"default":null,"columnName":"NAME","sqlType":"12"},{"name":"JOBROLE","type":["null","string"],"default":null,"columnName":"JOBROLE","sqlType":"12"}],"tableName":"bigsql.employeet"}
file schema: bigsql.employeet
--------------------------------------------------------------------------------
ID: OPTIONAL INT32 R:0 D:1
NAME: OPTIONAL BINARY O:UTF8 R:0
D:1
JOBROLE: OPTIONAL BINARY O:UTF8 R:0
D:1
row group 1: RC:4 TS:163 OFFSET:4
--------------------------------------------------------------------------------
ID: INT32 SNAPPY DO:0 FPO:4
SZ:41/39/0.95 VC:4 ENC:PLAIN,BIT_PACKED,RLE
NAME: BINARY SNAPPY DO:0 FPO:45
SZ:56/70/1.25 VC:4 ENC:PLAIN,BIT_PACKED,RLE
JOBROLE: BINARY SNAPPY DO:0
FPO:101 SZ:48/54/1.13 VC:4 ENC:PLAIN,BIT_PACKED,RLE
[hdfs@rvm sqoop]$
Here
RC refers to Row Count and VC refers to Value Count.
SZ:{x}/{y}/{z} - x = Compressed total, y=uncompressedtotal, z = y:x ratio
Metadata information contains the compression used and Encoding used
RC refers to Row Count and VC refers to Value Count.
SZ:{x}/{y}/{z} - x = Compressed total, y=uncompressedtotal, z = y:x ratio
Metadata information contains the compression used and Encoding used
You can understand the file format of
parquet from https://parquet.apache.org/documentation/latest/
Graphic sourced: http://tinyurl.com/o22gtck |
2) Getting the schema from parquet file
Command: hadoop jar <parquet-tools-x.jar> schema <parquetFile>
Command: hadoop jar <parquet-tools-x.jar> schema <parquetFile>
[hdfs@rvm sqoop]$ hadoop jar
/opt/nish/parquet-tools-1.6.0.jar schema
/user/biadmin/Par/15ebcea5-50d3-441a-a79f-7314d691585f.parquet
WARNING: Use "yarn jar" to
launch YARN applications.
message bigsql.employeet {
optional int32 ID;
optional binary NAME (UTF8);
optional binary JOBROLE (UTF8);
}
[hdfs@rvm sqoop]$
3) Display the content of parquet
Command: hadoop jar <parquet-tools-x.jar>
cat <parquetFile>
[hdfs@rvm sqoop]$ hadoop jar
/opt/nish/parquet-tools-1.6.0.jar cat
/user/biadmin/Par/15ebcea5-50d3-441a-a79f-7314d691585f.parquet
WARNING: Use "yarn jar" to
launch YARN applications.
15/10/27 18:49:43 INFO
compress.CodecPool: Got brand-new decompressor [.snappy]
ID = 3
NAME = nisanth2
JOBROLE = dev2
ID = 1
NAME = nisanth
JOBROLE = dev
ID = 4
NAME = nisanth3
JOBROLE = dev3
ID = 2
NAME = nisanth1
JOBROLE = dev1
[hdfs@rvm sqoop]$
4) Getting first few records
Command: hadoop jar <parquet-tools-x.jar>
head -n <noOfRecords> <parquetFile>
[hdfs@rvm sqoop]$ hadoop jar
/opt/nish/parquet-tools-1.6.0.jar head -n 2
/user/biadmin/Par/15ebcea5-50d3-441a-a79f-7314d691585f.parquet
WARNING: Use "yarn jar" to
launch YARN applications.
15/10/27 18:50:54 INFO
compress.CodecPool: Got brand-new decompressor [.snappy]
ID = 3
NAME = nisanth2
JOBROLE = dev2
ID = 1
NAME = nisanth
JOBROLE = dev
[hdfs@rvm sqoop]$
These inspection helps to understand the no# of records, compression used, understanding the metadata of the columns etc for debugging the issues related to parquet.
No comments:
Post a Comment