Using Parquet file format with Impala and Hive tables on MapR cluster and Talend

Using Parquet file format with Impala and Hive tables on MapR cluster and Talend:

Environment:
Talend Open Studio for Data Integration Version: 6.3.1
Java: 1.8
OS: Windows 8

This article will not cover the installation and setup of Talend Open Studio. The assumption is talend already installed and is working correctly. For details on how to install and configure Talend Open Studio see this post.

In the previous post we played with file operation on MapR FS and then we loaded data in to Hive tables. Today we will use Parquet file format as the storage type for these tables. If you do not know about Parquet then go through the official documentation. In one sentence it is nested data structures in a flat columnar format and can be used in any Hadoop ecosystems such as Hive, Impala , Pig, and Spark. I would also like to share onething I had faced lot of issues while working with Parquet and that led me to copy lot of jars here and there. But at the end I did a reset and removed all the jars that I had added followed by a service restart. After that everything started working. This is the Mapr Community Thread Link where I had started discussion on the issues.

Some of the error that I came across:

Caused by: org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: parquet.hive.DeprecatedParquetInputFormat
Serialization trace:

Caused by: java.lang.NoSuchFieldError: doubleTypeInfo
at org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.getObjectInspector(ArrayWritableObjectInspector.java:71)

For demonstration, I have created below tables in hive and loaded data to one of the table from a csv file using hive shell. We can do the same in tHiveRow as well. But to make things quick I took that approach.

create table stockdata(
sno string,
symboltype string,
open float,
high float,
low float,
close float,
volume bigint
)row format delimited
fields terminated by ‘,’
lines terminated by \n
stored as textfile;

create table stockdata_par(
sno string,
symboltype string,
open float,
high float,
low float,
close float,
volume bigint
) stored as parquet;

To load data into the stockdata table use below command provided you have a file in your local dir.

load data local inpath ‘/home/mapr/stochexchange.txt’ into stockdata;

These tables can now be reversed in Talend. Go to Hive connection and click on retrieve schema. Now drag these tables into Talend job. For stockdata select tELTHiveInput and for stockdata_par select tELTHiveOutput. In between put the tELTHiveMap and set the mapping properly. The job will look like this.

Thats it. Now execute the job and verify if data is available in the table having storage type as parquet.

hive> select * from stockdata_par;
OK
22222222 A 23.7 21.7 67.3 76.5 3456
25222222 A 27.7 21.7 67.3 76.5 3656
22722222 A 25.7 21.7 67.3 76.5 3476
24222222 A 24.7 21.7 67.3 76.5 3459
22288222 A 22.7 21.7 67.3 76.5 3156
NULL NULL NULL NULL NULL NULL
Time taken: 0.281 seconds, Fetched: 6 row(s)

Ok. Lets try the same with Impala table. In Mapr Sandbox Impala is not there by default. You have to install it manually. Here is the link for your reference.

Lets create one table in Impala-shell.

create table stockdata_par_imapala(
sno string,
symboltype string,
open float,
high float,
low float,
close float,
volume bigint
) stored as parquet;

Login to Impala shell and execute above code.

[root@maprdemo tmp]# impalashell
Starting Impala Shell without Kerberos authentication
Connected to maprdemo:21000
Server version: impalad version 2.7.0 RELEASE (build a535b583202c4a81080098a10f952d377af1949d)
***********************************************************************************
Welcome to the Impala shell.
(Impala Shell v2.7.0 (a535b58) built on Wed Mar 29 21:13:56 UTC 2017)

Want to know what version of Impala you‘re connected to? Run the VERSION command to
find out!
***********************************************************************************
[maprdemo:21000] > connect 192.168.2.119
> ;
Connected to 192.168.2.119:21000
Server version: impalad version 2.7.0 RELEASE (build a535b583202c4a81080098a10f952d377af1949d)
[192.168.2.119:21000] >

[192.168.2.119:21000] > show tables;
Query: show tables
+———————–+
| name |
+———————–+
| customers |
| orders |
| orders_subset |
| sample_07 |
| sample_08 |
| stockdata |
| stockdata_par |
| stockdata_par_imapala |
| web_logs |
+———————–+
Fetched 14 row(s) in 0.18s

 

[192.168.2.119:21000] > desc stockdata_par_imapala;
Query: describe stockdata_par_imapala
+——–+——–+———+
| name | type | comment |
+——–+——–+———+
| sno | string | |
| symbol | string | |
| open | float | |
| high | float | |
| low | float | |
| close | float | |
| volume | bigint | |
+——–+——–+———+
Fetched 7 row(s) in 0.33s

Similarly create a new connection for impala under Metadata>DBConnections.

Now create a job with tImpalaRow as shown below.

That’s it. Now execute the job and verify if data is available or not.

[192.168.2.119:21000] > select * from stockdata_par_imapala;
Query: select * from stockdata_par_imapala
Query submitted at: 20170902 13:37:28 (Coordinator: http://maprdemo:25000)
Query progress can be monitored at: http://maprdemo:25000/query_plan?query_id=2d433f078e07a33f:7cca79e100000000
+———-+——–+——————-+——————-+——————-+——-+——–+
| sno | symbol | open | high | low | close | volume |
+———-+——–+——————-+——————-+——————-+——-+——–+
| 22222222 | A | 23.70000076293945 | 21.70000076293945 | 67.30000305175781 | 76.5 | 3456 |
| 25222222 | A | 27.70000076293945 | 21.70000076293945 | 67.30000305175781 | 76.5 | 3656 |
| 22722222 | A | 25.70000076293945 | 21.70000076293945 | 67.30000305175781 | 76.5 | 3476 |
| 24222222 | A | 24.70000076293945 | 21.70000076293945 | 67.30000305175781 | 76.5 | 3459 |
| 22288222 | A | 22.70000076293945 | 21.70000076293945 | 67.30000305175781 | 76.5 | 3156 |
| | NULL | NULL | NULL | NULL | NULL | NULL |
+———-+——–+——————-+——————-+——————-+——-+——–+
Fetched 6 row(s) in 0.57s

Ok Guys. Thats it for today. Do let me now if you are stuck in any of the steps. See you in my next post. By the way I forgot to show you my both laptops used for these demonstrations. 😀 😀

0

About the author

Bhabani(http://dwteam.in) - Currently Bhabani is working as Sr Development Engineer at Harman International. He has good expertise on Oracle, Oracle Data Integrator, Pervasive Data Integrator, MSBI, Talend and Java. He is also contributing in ODI-OTN forum for last 5 years. He is from India. If you want to reach him then please visit contact us page. If you have any doubts or concerns on the above article, please put your question here. Dw Team will try to respond it as soon as possible. Also dont forget to provide your comments / suggestions / feedback for further improvement. Thanks for your time.

Similar Posts

Leave a reply

required*

Are you a human? *