Exporting local file to hive table in ODI

Exporting local file to hive table in ODI:

In this post we will discuss on loading a local file into hive using ODI inbuilt KMs. Before you start this I am assuming you have fair knowledge about HDFS, HIVE, LINUX and Map Reduce. Though you dont need to be an expert on this but having an idea will make you more comfortable.  In this post I am using Oracle Bigdata Lite VM 4.1 which comes with most of the pre installed software and hadoop ecosystem components. Click on this link to download if you have not yet done.

Also you need one virtual box to run this VM. Check here to find the software as per your OS.

Once you are logged into the VM, first create a file with some contents.

[oracle@bigdatalite ~]$ mkdir /tmp/src_files
[oracle@bigdatalite ~]$ chmod 775 /tmp/src_files
[oracle@bigdatalite ~]$ chmod 775 /tmp/src_files R

[oracle@bigdatalite src_files]$ pwd
/tmp/src_files
[oracle@bigdatalite src_files]$ cat emp
empid|empname|empsal|emploc
11111|alex|20000|bangalore
22222|swati|34000|delhi
12312|joseph|23000|chennai
11456|unmesh|29000|pune
23890|sristi|41200|mumbai
24980|simon|44444|kolkata
[oracle@bigdatalite src_files]$

Once this is done we have to create a table in HIVE. Go to hive prompt and write the script similar to below structure. You can see the DDL  of my table.

[oracle@bigdatalite ~]$ hive
15/06/14 16:47:13 WARN conf.HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.

Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hivecommon0.13.1cdh5.3.0.jar!/hivelog4j.properties
hive> show create table empinfo;
OK
CREATE TABLE `empinfo`(
`id` int,
`name` string,
`salary` string,
`location` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘|’
LINES TERMINATED BY \n
STORED AS INPUTFORMAT
‘org.apache.hadoop.mapred.TextInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
LOCATION
‘hdfs://bigdatalite.localdomain:8020/user/hive/warehouse/empinfo’
TBLPROPERTIES (
‘COLUMN_STATS_ACCURATE’=‘false’,
‘numFiles’=‘1’,
‘numRows’=‘-1’,
‘rawDataSize’=‘-1’,
‘skip.header.line.count’=‘1’,
‘totalSize’=‘182’,
‘transient_lastDdlTime’=‘1434313299’)
Time taken: 0.902 seconds, Fetched: 22 row(s)

Since we have a file as source follow the  same procedure as we used to do earlier to process delimited files. Create dataserver , physical schema, logical schema, model to reverse the file etc.

filehive1

 

Initially you might see hive2 jdbc driver used for hive data servers but I have to go with weblogic.jdbc.hive.HiveDriver as I have applied the recent patch released for ODI12C Advance Bigdata Option. If you wont use the later one you will see error for all existing hive based mappings like below.

ODI-1228: Task Create target table -IKM File to Hive (LOAD DATA)- fails on the target connection Hive.
Caused By: java.sql.SQLException: Method not supported

If you have not applied the patch then go with hive2 jdbc driver.

filehive2

Once hive physical and logical schemas are created you can follow the standard reverse engineering process to reverse all hive tables.

filehive3

 

Drag the file data store which will be our source and the hive data store which will be the target for the new mapping

filehive4

 

Select the LKM File to Hive LOAD DATA Direct. Note here that in case of Data Direct we dont need an IKM. To know more on this refer ODI 12C Bigdata Book. Also since our file is not in HDFS we have to uncheck the FILE_IS_LOCAL option as highlighted below.

filehive5

 

Dont forget to validate the mapping by clicking on the green tick. This will show you if there are any issues with your mapping.

filehive6

As you can see all the records got loaded in to our hive table. No need to write hive query language and no need to be a master in map reduce. Isn’t it easy? But wait a second!! Dont you see the header record as part of the data??. Though we have set the header to one in file data store it is still picking it up.

filehive7

 

No worries the product team has given an option to deal with this. In such scenario we have to add below table properties to skip the first record.

TBLPROPERTIES("skip.header.line.count"="1").

Go to OVERRIDE_ROW_FORMAT option and add below texts into it.

row format delimited
fields terminated by ‘|’
lines terminated by ‘\n’
stored as textfile
TBLPROPERTIES(“skip.header.line.count”=”1”)

filehive8

Now lets drop the target table in hive prompt and execute the map again.

hive> drop table empinfo;
OK
Time taken: 0.178 seconds
hive>

There you go. The header record is no more in the target datastore. 🙂

filehive9

We can also use LKM SQL Multi-Connect and  IKM File to Hive (LOAD DATA) to perform the same as we did above.

filehive10

 

Performance Evaluation for 6120000 records in standalone mode:

ho1

 

 

KM Combination Start Time Stop Time Total Time in Sec
LKM SQL Multi-Connect,IKM File to Hive (LOAD DATA) 6/21/2015 15:09 6/21/2015 15:10 56
LKM File to Hive LOAD DATA Direct 6/21/2015 15:43 6/21/2015 15:43 3

 

As you can see Data Direct is 18 times faster than LKM and IKM combination.

 

Thats it for today. In the coming days I will also post on how to load from RDBMS to HDFS, RDBMS to HBASE, HIVE to RDBMS and lot more. Also dont forget to check the fantastic post by Jerome on recent ODI Bigdata Patch Release.

0

About the author

Bhabani(http://dwteam.in) - Currently Bhabani is working as Sr Development Engineer at Harman International. He has good expertise on Oracle, Oracle Data Integrator, Pervasive Data Integrator, MSBI, Talend and Java. He is also contributing in ODI-OTN forum for last 5 years. He is from India. If you want to reach him then please visit contact us page. If you have any doubts or concerns on the above article, please put your question here. Dw Team will try to respond it as soon as possible. Also dont forget to provide your comments / suggestions / feedback for further improvement. Thanks for your time.

Similar Posts

1 comments

cesar hernandez / Reply

Hello, Im doing exactly the same steps but im getting the following error, Could you please help me out?

ODI-1590: The execution of the script failed.
Caused By: org.apache.bsf.BSFException: exception from Groovy: org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
_: 4: unable to resolve class org.apache.hadoop.fs.Path
@ line 4, column 1.
import org.apache.hadoop.fs.Path
^

_: 5: unable to resolve class org.apache.hadoop.conf.Configuration
@ line 5, column 1.
import org.apache.hadoop.conf.Configuration
^

_: 3: unable to resolve class org.apache.hadoop.fs.FileSystem
@ line 3, column 1.
import org.apache.hadoop.fs.FileSystem
^

3 errors

at org.codehaus.groovy.bsf.GroovyEngine.exec(GroovyEngine.java:111)
at com.sunopsis.dwg.codeinterpretor.SnpScriptingInterpretor.execInBSFEngine(SnpScriptingInterpretor.java:353)
at com.sunopsis.dwg.codeinterpretor.SnpScriptingInterpretor.exec(SnpScriptingInterpretor.java:216)
at oracle.odi.runtime.agent.execution.interpreter.SessionTaskScriptingInterpretor.scripting(SessionTaskScriptingInterpretor.java:171)
at oracle.odi.runtime.agent.execution.SessionTask.scripting(SessionTask.java:105)
at oracle.odi.runtime.agent.execution.cmd.ScriptingExecutor.execute(ScriptingExecutor.java:49)
at oracle.odi.runtime.agent.execution.cmd.ScriptingExecutor.execute(ScriptingExecutor.java:21)
at oracle.odi.runtime.agent.execution.TaskExecutionHandler.handleTask(TaskExecutionHandler.java:52)
at oracle.odi.runtime.agent.execution.SessionTask.processTask(SessionTask.java:206)
at oracle.odi.runtime.agent.execution.SessionTask.doExecuteTask(SessionTask.java:117)
at oracle.odi.runtime.agent.execution.AbstractSessionTask.execute(AbstractSessionTask.java:886)
at oracle.odi.runtime.agent.execution.SessionExecutor$SerialTrain.runTasks(SessionExecutor.java:2227)
at oracle.odi.runtime.agent.execution.SessionExecutor.executeSession(SessionExecutor.java:607)
at oracle.odi.runtime.agent.processor.TaskExecutorAgentRequestProcessor$1.doAction(TaskExecutorAgentRequestProcessor.java:718)
at oracle.odi.runtime.agent.processor.TaskExecutorAgentRequestProcessor$1.doAction(TaskExecutorAgentRequestProcessor.java:611)
at oracle.odi.core.persistence.dwgobject.DwgObjectTemplate.execute(DwgObjectTemplate.java:203)
at oracle.odi.runtime.agent.processor.TaskExecutorAgentRequestProcessor.doProcessStartAgentTask(TaskExecutorAgentRequestProcessor.java:800)
at oracle.odi.runtime.agent.processor.impl.StartSessRequestProcessor.access$1400(StartSessRequestProcessor.java:74)
at oracle.odi.runtime.agent.processor.impl.StartSessRequestProcessor$StartSessTask.doExecute(StartSessRequestProcessor.java:702)
at oracle.odi.runtime.agent.processor.task.AgentTask.execute(AgentTask.java:180)
at oracle.odi.runtime.agent.support.DefaultAgentTaskExecutor$2.run(DefaultAgentTaskExecutor.java:108)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at oracle.odi.runtime.agent.execution.job.OdiJob.call(OdiJob.java:73)
at oracle.odi.runtime.agent.execution.job.OdiJob.call(OdiJob.java:73)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
_: 4: unable to resolve class org.apache.hadoop.fs.Path
@ line 4, column 1.
import org.apache.hadoop.fs.Path
^

_: 5: unable to resolve class org.apache.hadoop.conf.Configuration
@ line 5, column 1.
import org.apache.hadoop.conf.Configuration
^

_: 3: unable to resolve class org.apache.hadoop.fs.FileSystem
@ line 3, column 1.
import org.apache.hadoop.fs.FileSystem
^

3 errors

at org.codehaus.groovy.control.ErrorCollector.failIfErrors(ErrorCollector.java:309)
at org.codehaus.groovy.control.CompilationUnit.applyToSourceUnits(CompilationUnit.java:938)
at org.codehaus.groovy.control.CompilationUnit.doPhaseOperation(CompilationUnit.java:585)
at org.codehaus.groovy.control.CompilationUnit.compile(CompilationUnit.java:534)
at groovy.lang.GroovyClassLoader.doParseClass(GroovyClassLoader.java:286)
at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:259)
at groovy.lang.GroovyShell.parseClass(GroovyShell.java:674)
at groovy.lang.GroovyShell.parse(GroovyShell.java:686)
at groovy.lang.GroovyShell.evaluate(GroovyShell.java:568)
at groovy.lang.GroovyShell.evaluate(GroovyShell.java:608)
at groovy.lang.GroovyShell.evaluate(GroovyShell.java:589)
at org.codehaus.groovy.bsf.GroovyEngine.exec(GroovyEngine.java:109)
… 27 more

Leave a reply

required*

Are you a human? *