Implementing Apache Drill in ODI 12C – Part 1

Implementing Apache Drill in ODI 12C – Part 1

Hello Friends,

In the previous post I had explained how to use json file in odi mapping and loading the data to oracle table. We had also modified certain properties of Complex File technology to generate correct syntax for json queries. If you have missed it then refer this post. Today I am going to demonstrate the installation and configuration of Apache drill. After the configuration we will run some sample queries to read data from CSV, TEXT and PARQUET files.  In part2, we will see drill implementation in ODI with simple joins between different source systems.

Prerequisite:

  1. You are familiar  with hadoop ecosystem. To know more about drill please refer the official documentation.
  2. You have a virtual box and the appliance is already imported.
  3. The VM is running fine
  4. Hadoop, Hive is already installed and the required services are up and running.

For this series of blog posts I will be using Oracle Bigdatalite which is a Cloudera distribution. It has almost 15 zip files and the size sums up to 30 GB. If you want a lighter version you can directly download the vm from Cloudera or Hortnworks or MapR.

Lets take a look at the definition  taken from drill documentation:

Drill is an Apache open-source SQL query engine for Big Data exploration. Drill is designed from the ground up to support high-performance analysis on the semi-structured and rapidly evolving data coming from modern Big Data applications, while still providing the familiarity and ecosystem of ANSI SQL, the industry-standard query language.

So it is quite similar to Cloudera Impala or Facebook Presto. Lets proceed with the installation. If the VM is not up start it. Once it is up and running, open a terminal and run the following command to verify that Java 7 or 8 is the version in effect:

The output looks something like this:

[oracle@bigdatalite ~]$ java version
java version "1.8.0_111"
Java(TM) SE Runtime Environment (build 1.8.0_111b14)
Java HotSpot(TM) 64Bit Server VM (build 25.111b14, mixed mode)
[oracle@bigdatalite ~]$

Then run below set of commands to download drill.

[oracle@bigdatalite ~]$ wget http://apache.mirrors.hoobly.com/drill/drill-1.10.0/apache-drill-1.10.0.tar.gz

[oracle@bigdatalite ~]$ curl o apachedrill1.10.0.tar.gz http://apache.mirrors.hoobly.com/drill/drill-1.10.0/apache-drill-1.10.0.tar.gz

The file is now downloaded to /home/oracle location. Copy this file to a directory where you want to install Drill. In my case I will directly extract it to /etc/drill.

[oracle@bigdatalite ~]$ su
Password:

[root@bigdatalite oracle]# cd /etc/drill

[root@bigdatalite drill]# tar xvzf /home/oracle/apachedrill1.10.0.tar.gz
[root@bigdatalite drill]# ls
apachedrill1.10.0
[root@bigdatalite drill]#

Before using the drill shell, lets a take look at the drill-override-example.conf file just to ensure we have an unique cluster id and zookeeper port is correctly configured.

[root@bigdatalite apachedrill1.10.0]# pwd
/etc/drill/apachedrill1.10.0
[root@bigdatalite apachedrill1.10.0]# ls
bin git.properties KEYS log README.md winutils
conf jars LICENSE NOTICE sampledata
[root@bigdatalite apachedrill1.10.0]# cd conf/
[root@bigdatalite conf]# ls
coresiteexample.xml drillenv.sh drilloverrideexample.conf
distribenv.sh drilloverride.conf logback.xml
[root@bigdatalite conf]# cat drilloverride.conf
# Licensed to the Apache Software Foundation (ASF) under one or more
# See ‘drill-override-example.conf’ for example configurations

drill.exec: {
clusterid: "drillbits1",
zk.connect: "localhost:2181"
}
[root@bigdatalite conf]#

Starting drill shell:

So far everything looks fine. Lets start the drill shell using

env -i HOME=”$HOME” LC_CTYPE=”${LC_ALL:-${LC_CTYPE:-$LANG}}” PATH=”$PATH” USER=”$USER” /etc/drill/apache-drill-1.10.0/bin/drill-embedded

[root@bigdatalite oracle]# env i HOME="$HOME" LC_CTYPE="${LC_ALL:-${LC_CTYPE:-$LANG}}" PATH="$PATH" USER="$USER" /etc/drill/apachedrill1.10.0/bin/drillembedded
Java HotSpot(TM) 64Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Apr 02, 2017 3:56:59 PM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 20140429 01:25:26…
apache drill 1.10.0
"json ain’t no thang"
0: jdbc:drill:zk=local>

Now open browser and point your url to http://localhost:8047/ .  You will see the home page of drill.

Then click on storage and enable hive , dfs if it is in disable storage plugins.

Click the update button for dfs and update the configuration as shown below. Similarly we will do for hive.

DFS configuration in drillHive configuration in drill

Change the configuration wherever required and save it. Drill is now ready for use.

Using JSON files:

Run show files command.

0: jdbc:drill:zk=local> show files;

Lets execute the following select statement. I will explain this query later.

SELECT employee_id, full_name,first_name,last_name,position_title FROM cp.`employee.json` LIMIT 5

0: jdbc:drill:zk=local> SELECT employee_id, full_name,first_name,last_name,position_title FROM cp.`employee.json` LIMIT 5;
+————–+——————+————-+————+————————-+
| employee_id | full_name | first_name | last_name | position_title |
+————–+——————+————-+————+————————-+
| 1 | Sheri Nowmer | Sheri | Nowmer | President |
| 2 | Derrick Whelply | Derrick | Whelply | VP Country Manager |
| 4 | Michael Spence | Michael | Spence | VP Country Manager |
| 5 | Maya Gutierrez | Maya | Gutierrez | VP Country Manager |
| 6 | Roberta Damstra | Roberta | Damstra | VP Information Systems |
+————–+——————+————-+————+————————-+
5 rows selected (0.789 seconds)
0: jdbc:drill:zk=local>

Well whatever we queried above was just a json file named as employee.json and located at ./jars/3rdparty/foodmart-data-json.0.4.jar. The cp command is nothing but the classpath and the json file presents inside the jar file.

Using PARQUET files:

To query a parquet file , we need to move the sample-data from installation location to hdfs location using below command.

[root@bigdatalite apachedrill1.10.0]# hadoop fs put sampledata /user/bhabani/

[root@bigdatalite apachedrill1.10.0]# hadoop fs ls /user/bhabani
Found 4 items
rwxrwxrwx 1 oracle supergroup 45 20170402 15:06 /user/bhabani/demo.tbl
rwxrwxrwx 1 oracle supergroup 41 20170402 14:32 /user/bhabani/demo.tsv
rwxrwxrwx 1 oracle supergroup 53 20170402 15:18 /user/bhabani/demo1.csv
drwxrwxrwx oracle supergroup 0 20170402 14:20 /user/bhabani/sampledata

We are now ready to query parquet file in drill shell.

0: jdbc:drill:zk=local> SELECT * FROM dfs.`/user/bhabani/sampledata/nation.parquet`;
+————–+—————–+————–+———————–+
| N_NATIONKEY | N_NAME | N_REGIONKEY | N_COMMENT |
+————–+—————–+————–+———————–+
| 0 | ALGERIA | 0 | haggle. carefully f |
| 1 | ARGENTINA | 1 | al foxes promise sly |
| 2 | BRAZIL | 1 | y alongside of the p |
| 3 | CANADA | 1 | eas hang ironic, sil |
| 4 | EGYPT | 4 | y above the carefull |
| 5 | ETHIOPIA | 0 | ven packages wake qu |
| 6 | FRANCE | 3 | refully final reques |
| 7 | GERMANY | 3 | l platelets. regular |
| 8 | INDIA | 2 | ss excuses cajole sl |
+————–+—————–+————–+———————–+
25 rows selected (0.324 seconds)
0: jdbc:drill:zk=local> SELECT * FROM dfs.`/user/bhabani/sampledata/region.parquet`;
+————–+————–+———————–+
| R_REGIONKEY | R_NAME | R_COMMENT |
+————–+————–+———————–+
| 0 | AFRICA | lar deposits. blithe |
| 1 | AMERICA | hs use ironic, even |
| 2 | ASIA | ges. thinly even pin |
| 3 | EUROPE | ly final courts cajo |
| 4 | MIDDLE EAST | uickly special accou |
+————–+————–+———————–+
5 rows selected (0.275 seconds)
0: jdbc:drill:zk=local>

Using TEXT files:

What about text files? Yes we can do it as well. Just note down here one thing.

Drill supports the following file types:

Plain text files, including:
Comma-separated values (CSV, type: text)
Tab-separated values (TSV, type: text)
Pipe-separated values (PSV, type: text)
Structured data files:
Avro (type: avro) (This file type is experimental)
JSON (type: json)
Parquet (type: parquet)

So I will create one csv file in local and then move it to HDFS location.

[oracle@bigdatalite ~]$ pwd
/home/oracle
[oracle@bigdatalite ~]$ cat demo1.csv
N_NAME,N_LOCATION
BHA,INDIA
KING,CANADA
SCOTT,BRAZIL
[oracle@bigdatalite ~]$ hadoop fs cat /user/bhabani/demo1.csv
N_NAME,N_LOCATION
BHA,INDIA
KING,CANADA
SCOTT,BRAZIL
[oracle@bigdatalite ~]$

Now lets query this file from drill shell.

0: jdbc:drill:zk=local> SELECT * FROM dfs.`/user/bhabani/demo1.csv`;
+————————–+
| columns |
+————————–+
| ["N_NAME","N_LOCATION"] |
| ["BHA","INDIA"] |
| ["KING","CANADA"] |
| ["SCOTT","BRAZIL"] |
+————————–+
4 rows selected (0.726 seconds)
0: jdbc:drill:zk=local>

Sounds good? Then lets make it more interesting!! 🙂 What if I would like to join and PARQUET and CSV file. Can we do that? Lets check it out with below query.

Using joins between files:

select tbl1.N_NAME, tbl1.N_COMMENT,tbl2.columns[0] as nm from dfs.`/user/bhabani/sample-data/nation.parquet` as tbl1
join
dfs.`/user/bhabani/demo1.csv` as tbl2
on tbl1.N_NAME=tbl2.columns[1];

0: jdbc:drill:zk=local> select tbl1.N_NAME, tbl1.N_COMMENT,tbl2.columns[0] as nm from dfs.`/user/bhabani/sampledata/nation.parquet` as tbl1
. . . . . . . . . . . > join
. . . . . . . . . . . > dfs.`/user/bhabani/demo1.csv` as tbl2
. . . . . . . . . . . > on tbl1.N_NAME=tbl2.columns[1];
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to nooperation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
+———+———————–+——–+
| N_NAME | N_COMMENT | nm |
+———+———————–+——–+
| BRAZIL | y alongside of the p | SCOTT |
| CANADA | eas hang ironic, sil | KING |
| INDIA | ss excuses cajole sl | BHA |
+———+———————–+——–+
3 rows selected (2.488 seconds)
0: jdbc:drill:zk=local>

As you saw I joined CSV and PARQUET file based on the location name because this is the only common present in both data set and then displayed two columns from PARQUET and one column from CSV file.

Using Hive:

We will connect to hive from drill shell and execute a select statement as shown below.

use hive;
select cust_id,first_name,last_name from cust limit 3;

Using joins between file and hive:

The last example for this post is to join a csv file with hive table. Use below query to create a join based on country and location column.

select tbl1.cust_id, tbl1.city,tbl2.columns[0] as nam from cust as tbl1
join
dfs.`/user/bhabani/demo2.csv` as tbl2
on tbl1.country=tbl2.columns[1] limit 5;

Thats it for today. I would suggest you to practice some basic commands and get your hands dirty on all type of joins. Once you are familiar with the commands we can then use it in ODI. Though ODI will do the work for you but having knowledge on the working principle will help you in accomplishing your objective  quickly and seamlessly.

Stay tuned. Thank you!!

0

About the author

Bhabani(http://dwteam.in) - Currently Bhabani is working as Sr Development Engineer at Harman International. He has good expertise on Oracle, Oracle Data Integrator, Pervasive Data Integrator, MSBI, Talend and Java. He is also contributing in ODI-OTN forum for last 5 years. He is from India. If you want to reach him then please visit contact us page. If you have any doubts or concerns on the above article, please put your question here. Dw Team will try to respond it as soon as possible. Also dont forget to provide your comments / suggestions / feedback for further improvement. Thanks for your time.

Similar Posts

Leave a reply

required*

Are you a human? *