Implementing Apache Drill in ODI 12C – Part 1

Implementing Apache Drill in ODI 12C – Part 1

Hello Friends,

In the previous post I had explained how to use json file in odi mapping and loading the data to oracle table. We had also modified certain properties of Complex File technology to generate correct syntax for json queries. If you have missed it then refer this post. Today I am going to demonstrate the installation and configuration of Apache drill. After the configuration we will run some sample queries to read data from CSV, TEXT and PARQUET files.  In part2, we will see drill implementation in ODI with simple joins between different source systems.


  1. You are familiar  with hadoop ecosystem. To know more about drill please refer the official documentation.
  2. You have a virtual box and the appliance is already imported.
  3. The VM is running fine
  4. Hadoop, Hive is already installed and the required services are up and running.

For this series of blog posts I will be using Oracle Bigdatalite which is a Cloudera distribution. It has almost 15 zip files and the size sums up to 30 GB. If you want a lighter version you can directly download the vm from Cloudera or Hortnworks or MapR.

Lets take a look at the definition  taken from drill documentation:

Drill is an Apache open-source SQL query engine for Big Data exploration. Drill is designed from the ground up to support high-performance analysis on the semi-structured and rapidly evolving data coming from modern Big Data applications, while still providing the familiarity and ecosystem of ANSI SQL, the industry-standard query language.

So it is quite similar to Cloudera Impala or Facebook Presto. Lets proceed with the installation. If the VM is not up start it. Once it is up and running, open a terminal and run the following command to verify that Java 7 or 8 is the version in effect:

The output looks something like this:

[oracle@bigdatalite ~]$ java -version
java version "1.8.0_111"
Java(TM) SE Runtime Environment (build 1.8.0_111-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode)
[oracle@bigdatalite ~]$

Then run below set of commands to download drill.

[oracle@bigdatalite ~]$ wget

[oracle@bigdatalite ~]$ curl -o apache-drill-1.10.0.tar.gz

The file is now downloaded to /home/oracle location. Copy this file to a directory where you want to install Drill. In my case I will directly extract it to /etc/drill.

[oracle@bigdatalite ~]$ su

[root@bigdatalite oracle]# cd /etc/drill

[root@bigdatalite drill]# tar -xvzf /home/oracle/apache-drill-1.10.0.tar.gz
[root@bigdatalite drill]# ls
[root@bigdatalite drill]#

Before using the drill shell, lets a take look at the drill-override-example.conf file just to ensure we have an unique cluster id and zookeeper port is correctly configured.

[root@bigdatalite apache-drill-1.10.0]# pwd
[root@bigdatalite apache-drill-1.10.0]# ls
bin KEYS log winutils
conf jars LICENSE NOTICE sample-data
[root@bigdatalite apache-drill-1.10.0]# cd conf/
[root@bigdatalite conf]# ls
core-site-example.xml drill-override-example.conf drill-override.conf logback.xml
[root@bigdatalite conf]# cat drill-override.conf
# Licensed to the Apache Software Foundation (ASF) under one or more
# See 'drill-override-example.conf' for example configurations

drill.exec: {
cluster-id: "drillbits1",
zk.connect: "localhost:2181"
[root@bigdatalite conf]#

Starting drill shell:

So far everything looks fine. Lets start the drill shell using

env -i HOME=”$HOME” LC_CTYPE=”${LC_ALL:-${LC_CTYPE:-$LANG}}” PATH=”$PATH” USER=”$USER” /etc/drill/apache-drill-1.10.0/bin/drill-embedded

[root@bigdatalite oracle]# env -i HOME="$HOME" LC_CTYPE="${LC_ALL:-${LC_CTYPE:-$LANG}}" PATH="$PATH" USER="$USER" /etc/drill/apache-drill-1.10.0/bin/drill-embedded
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Apr 02, 2017 3:56:59 PM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
apache drill 1.10.0
"json ain't no thang"
0: jdbc:drill:zk=local>

Now open browser and point your url to http://localhost:8047/ .  You will see the home page of drill.

Then click on storage and enable hive , dfs if it is in disable storage plugins.

Click the update button for dfs and update the configuration as shown below. Similarly we will do for hive.

DFS configuration in drillHive configuration in drill

Change the configuration wherever required and save it. Drill is now ready for use.

Using JSON files:

Run show files command.

0: jdbc:drill:zk=local> show files;

Lets execute the following select statement. I will explain this query later.

SELECT employee_id, full_name,first_name,last_name,position_title FROM cp.`employee.json` LIMIT 5

0: jdbc:drill:zk=local> SELECT employee_id, full_name,first_name,last_name,position_title FROM cp.`employee.json` LIMIT 5;
| employee_id | full_name | first_name | last_name | position_title |
| 1 | Sheri Nowmer | Sheri | Nowmer | President |
| 2 | Derrick Whelply | Derrick | Whelply | VP Country Manager |
| 4 | Michael Spence | Michael | Spence | VP Country Manager |
| 5 | Maya Gutierrez | Maya | Gutierrez | VP Country Manager |
| 6 | Roberta Damstra | Roberta | Damstra | VP Information Systems |
5 rows selected (0.789 seconds)
0: jdbc:drill:zk=local>

Well whatever we queried above was just a json file named as employee.json and located at ./jars/3rdparty/foodmart-data-json.0.4.jar. The cp command is nothing but the classpath and the json file presents inside the jar file.

Using PARQUET files:

To query a parquet file , we need to move the sample-data from installation location to hdfs location using below command.

[root@bigdatalite apache-drill-1.10.0]# hadoop fs -put sample-data /user/bhabani/

[root@bigdatalite apache-drill-1.10.0]# hadoop fs -ls /user/bhabani
Found 4 items
-rwxrwxrwx 1 oracle supergroup 45 2017-04-02 15:06 /user/bhabani/demo.tbl
-rwxrwxrwx 1 oracle supergroup 41 2017-04-02 14:32 /user/bhabani/demo.tsv
-rwxrwxrwx 1 oracle supergroup 53 2017-04-02 15:18 /user/bhabani/demo1.csv
drwxrwxrwx - oracle supergroup 0 2017-04-02 14:20 /user/bhabani/sample-data

We are now ready to query parquet file in drill shell.

0: jdbc:drill:zk=local> SELECT * FROM dfs.`/user/bhabani/sample-data/nation.parquet`;
| 0 | ALGERIA | 0 | haggle. carefully f |
| 1 | ARGENTINA | 1 | al foxes promise sly |
| 2 | BRAZIL | 1 | y alongside of the p |
| 3 | CANADA | 1 | eas hang ironic, sil |
| 4 | EGYPT | 4 | y above the carefull |
| 5 | ETHIOPIA | 0 | ven packages wake qu |
| 6 | FRANCE | 3 | refully final reques |
| 7 | GERMANY | 3 | l platelets. regular |
| 8 | INDIA | 2 | ss excuses cajole sl |
25 rows selected (0.324 seconds)
0: jdbc:drill:zk=local> SELECT * FROM dfs.`/user/bhabani/sample-data/region.parquet`;
| 0 | AFRICA | lar deposits. blithe |
| 1 | AMERICA | hs use ironic, even |
| 2 | ASIA | ges. thinly even pin |
| 3 | EUROPE | ly final courts cajo |
| 4 | MIDDLE EAST | uickly special accou |
5 rows selected (0.275 seconds)
0: jdbc:drill:zk=local>

Using TEXT files:

What about text files? Yes we can do it as well. Just note down here one thing.

Drill supports the following file types:

Plain text files, including:
Comma-separated values (CSV, type: text)
Tab-separated values (TSV, type: text)
Pipe-separated values (PSV, type: text)
Structured data files:
Avro (type: avro) (This file type is experimental)
JSON (type: json)
Parquet (type: parquet)

So I will create one csv file in local and then move it to HDFS location.

[oracle@bigdatalite ~]$ pwd
[oracle@bigdatalite ~]$ cat demo1.csv
[oracle@bigdatalite ~]$ hadoop fs -cat /user/bhabani/demo1.csv
[oracle@bigdatalite ~]$

Now lets query this file from drill shell.

0: jdbc:drill:zk=local> SELECT * FROM dfs.`/user/bhabani/demo1.csv`;
| columns |
| ["BHA","INDIA"] |
| ["KING","CANADA"] |
| ["SCOTT","BRAZIL"] |
4 rows selected (0.726 seconds)
0: jdbc:drill:zk=local>

Sounds good? Then lets make it more interesting!! 🙂 What if I would like to join and PARQUET and CSV file. Can we do that? Lets check it out with below query.

Using joins between files:

select tbl1.N_NAME, tbl1.N_COMMENT,tbl2.columns[0] as nm from dfs.`/user/bhabani/sample-data/nation.parquet` as tbl1
dfs.`/user/bhabani/demo1.csv` as tbl2
on tbl1.N_NAME=tbl2.columns[1];

0: jdbc:drill:zk=local> select tbl1.N_NAME, tbl1.N_COMMENT,tbl2.columns[0] as nm from dfs.`/user/bhabani/sample-data/nation.parquet` as tbl1
. . . . . . . . . . . > join
. . . . . . . . . . . > dfs.`/user/bhabani/demo1.csv` as tbl2
. . . . . . . . . . . > on tbl1.N_NAME=tbl2.columns[1];
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See for further details.
| N_NAME | N_COMMENT | nm |
| BRAZIL | y alongside of the p | SCOTT |
| CANADA | eas hang ironic, sil | KING |
| INDIA | ss excuses cajole sl | BHA |
3 rows selected (2.488 seconds)
0: jdbc:drill:zk=local>

As you saw I joined CSV and PARQUET file based on the location name because this is the only common present in both data set and then displayed two columns from PARQUET and one column from CSV file.

Using Hive:

We will connect to hive from drill shell and execute a select statement as shown below.

use hive;
select cust_id,first_name,last_name from cust limit 3;

Using joins between file and hive:

The last example for this post is to join a csv file with hive table. Use below query to create a join based on country and location column.

select tbl1.cust_id,,tbl2.columns[0] as nam from cust as tbl1
dfs.`/user/bhabani/demo2.csv` as tbl2
on[1] limit 5;

Thats it for today. I would suggest you to practice some basic commands and get your hands dirty on all type of joins. Once you are familiar with the commands we can then use it in ODI. Though ODI will do the work for you but having knowledge on the working principle will help you in accomplishing your objective  quickly and seamlessly.

Stay tuned. Thank you!!

About Bhabani 86 Articles
Bhabani has 12 plus years of experience in Data warehousing and Analytics projects that has span across multiple domains like Travel, Banking and Financial, Betting and Gaming Industries. Solution areas he focuses on designing the data warehouse and integrating it with cloud platforms like AWS or GCP. He is also a Elite level contributor at OTN forum more than 9 years. He loves to do experiment and POC on different integration tools and services. Some of his favorite skills are Redshift, Big Query, Python, Apache Airflow, Kafka, HDFS, Map Reduce ,HIVE, Habse, Sqoop, Drill, Impala.