Working with MapR-FS and Hive using Talend

Working with MapR-FS and Hive using Talend:

Talend Open Studio for Data Integration Version: 6.3.1
Java: 1.8
OS: Windows 8

This article will not cover the installation and setup of Talend Open Studio. The assumption is talend already installed and is working correctly. For details on how to install and configure Talend Open Studio see this post.

In our previous post we discussed about MapR cluster configuration in Talend and verified that we can connect to name node, resource manager without any issues. Today we will use the same connection to connect MapRFS to do some file operation and then we will create a hive table and load some data from one table to the other.

Right click on the cluster and create new HDFS and HIVE connection as shown below.

This is how my metadata looks like.

Now we will create a job and use the connection we did above. Lets use tHDFSPut component and fill the required parameters. This will copy the file from local system to HDFS. If you dont know the HDFS dir structure execute hadoop fs -ls /

[mapr@maprdemo ~]$ hadoop fs -ls /
Found 10 items
drwxr-xr-x - mapr mapr 0 2017-04-21 10:44 /apps
drwxr-xr-x - mapr mapr 6 2017-04-21 18:55 /data
drwxr-xr-x - root root 4 2017-04-21 18:55 /drill-beta-demo
drwxr-xr-x - mapr mapr 0 2017-04-21 10:43 /hbase
drwxr-xr-x - mapr mapr 1 2017-09-02 08:42 /home
drwxr-xr-x - mapr mapr 0 2017-04-21 10:47 /opt
drwxr-xr-x - root root 3 2017-04-21 18:57 /tables
drwxrwxrwx - mapr mapr 0 2017-09-02 08:22 /tmp
drwxr-xr-x - mapr mapr 9 2017-09-02 10:59 /user
drwxr-xr-x - mapr mapr 1 2017-04-21 10:44 /var

In the above screenshot I am copying local file D:/Thrash/employee.txt into HDFS directory and renaming to emp_hdfs.txt.

Now lets execute it. Oh Oh. It failed.

[ERROR]: org.apache.hadoop.util.Shell - Failed to locate the winutils binary in the hadoop binary path

You remember in the previous post I had mentioned we will configure something during execution? That what we will do now. Go to run job properties and click on advance setting. Give the path to winutils location.

Now we are good to go. After execution you can see the file emp_hdfs.txt in HDFS.

[mapr@maprdemo ~]$ hadoop fs -ls /user/bhabani
Found 3 items
drwxr-xr-x - mapr mapr 0 2017-09-02 13:33 /user/bhabani/destination
-rwxr-xr-x 1 root root 17256960 2017-09-02 23:07 /user/bhabani/emp_hdfs.txt
-rwxrwxrwx 1 mapr mapr 53 2017-09-01 23:28 /user/bhabani/testfile.txt
[mapr@maprdemo ~]$

Now lets do something Hive. We will use tHiveRow to create a table and load some data into it. Drag tHiveRow two times and connect them. Change the property type as Repository and select the hive connection we had created before. It will fill the connection parameters automatically.

Now lets add some queries in to the component. I have already one table called orders_subset having some rows. We will load this data to the newly created table.

Now execute the job and verify if data is loaded or not.

Thats all for today. In my next post we will load data into tables having storage type as parquet. As always if you face any issues do let me know. I will be happy to assist you.


About the author

Bhabani( - Bhabani has 10 years of experience in Data warehousing and Analytics projects that has span across multiple domains like call centre, banking financial, betting and gaming industries Solution areas he focuses on designing the data warehouse and integrating it with Cloud vendors like AWS or GCP. He has rich expertise on Oracle Data Integrator, Talend Open Studio for Big data, Pervasive Data Integrator and in reporting tool such as Qlikview and OBIEE. He has excellent knowledge of Redshift, Big Query, Python, Apache Airflow, Kafka for ETL pipe lines and Hadoop Ecosystems that includes HDFS, Map Reduce ,HIVE, SQOOP, Drill, Impala in Amazon and Google Cloud.

Similar Posts

Leave a reply


Are you a human? *