Exporting local file to hive table in ODI:
In this post we will discuss on loading a local file into hive using ODI inbuilt KMs. Before you start this I am assuming you have fair knowledge about HDFS, HIVE, LINUX and Map Reduce. Though you dont need to be an expert on this but having an idea will make you more comfortable. In this post I am using Oracle Bigdata Lite VM 4.1 which comes with most of the pre installed software and hadoop ecosystem components. Click on this link to download if you have not yet done.
Also you need one virtual box to run this VM. Check here to find the software as per your OS.
Once you are logged into the VM, first create a file with some contents.
[oracle@bigdatalite ~]$ mkdir /tmp/src_files [oracle@bigdatalite ~]$ chmod 775 /tmp/src_files [oracle@bigdatalite ~]$ chmod 775 /tmp/src_files -R [oracle@bigdatalite src_files]$ pwd /tmp/src_files [oracle@bigdatalite src_files]$ cat emp empid|empname|empsal|emploc 11111|alex|20000|bangalore 22222|swati|34000|delhi 12312|joseph|23000|chennai 11456|unmesh|29000|pune 23890|sristi|41200|mumbai 24980|simon|44444|kolkata [oracle@bigdatalite src_files]$
Once this is done we have to create a table in HIVE. Go to hive prompt and write the script similar to below structure. You can see the DDL of my table.
[oracle@bigdatalite ~]$ hive 15/06/14 16:47:13 WARN conf.HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore. Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.13.1-cdh5.3.0.jar!/hive-log4j.properties hive> show create table empinfo; OK CREATE TABLE `empinfo`( `id` int, `name` string, `salary` string, `location` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://bigdatalite.localdomain:8020/user/hive/warehouse/empinfo' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='false', 'numFiles'='1', 'numRows'='-1', 'rawDataSize'='-1', 'skip.header.line.count'='1', 'totalSize'='182', 'transient_lastDdlTime'='1434313299') Time taken: 0.902 seconds, Fetched: 22 row(s)
Since we have a file as source follow the same procedure as we used to do earlier to process delimited files. Create dataserver , physical schema, logical schema, model to reverse the file etc.
Initially you might see hive2 jdbc driver used for hive data servers but I have to go with weblogic.jdbc.hive.HiveDriver as I have applied the recent patch released for ODI12C Advance Bigdata Option. If you wont use the later one you will see error for all existing hive based mappings like below.
ODI-1228: Task Create target table -IKM File to Hive (LOAD DATA)- fails on the target connection Hive.
Caused By: java.sql.SQLException: Method not supported
If you have not applied the patch then go with hive2 jdbc driver.
Once hive physical and logical schemas are created you can follow the standard reverse engineering process to reverse all hive tables.
Drag the file data store which will be our source and the hive data store which will be the target for the new mapping
Select the LKM File to Hive LOAD DATA Direct. Note here that in case of Data Direct we dont need an IKM. To know more on this refer ODI 12C Bigdata Book. Also since our file is not in HDFS we have to uncheck the FILE_IS_LOCAL option as highlighted below.
Dont forget to validate the mapping by clicking on the green tick. This will show you if there are any issues with your mapping.
As you can see all the records got loaded in to our hive table. No need to write hive query language and no need to be a master in map reduce. Isn’t it easy? But wait a second!! Dont you see the header record as part of the data??. Though we have set the header to one in file data store it is still picking it up.
No worries the product team has given an option to deal with this. In such scenario we have to add below table properties to skip the first record.
Go to OVERRIDE_ROW_FORMAT option and add below texts into it.
row format delimited
fields terminated by ‘|’
lines terminated by ‘\n’
stored as textfile
Now lets drop the target table in hive prompt and execute the map again.
hive> drop table empinfo; OK Time taken: 0.178 seconds hive>
There you go. The header record is no more in the target datastore. 🙂
We can also use LKM SQL Multi-Connect and IKM File to Hive (LOAD DATA) to perform the same as we did above.
Performance Evaluation for 6120000 records in standalone mode:
|KM Combination||Start Time||Stop Time||Total Time in Sec|
|LKM SQL Multi-Connect,IKM File to Hive (LOAD DATA)||6/21/2015 15:09||6/21/2015 15:10||56|
|LKM File to Hive LOAD DATA Direct||6/21/2015 15:43||6/21/2015 15:43||3|
As you can see Data Direct is 18 times faster than LKM and IKM combination.
Thats it for today. In the coming days I will also post on how to load from RDBMS to HDFS, RDBMS to HBASE, HIVE to RDBMS and lot more. Also dont forget to check the fantastic post by Jerome on recent ODI Bigdata Patch Release.