Implementing Apache Drill in ODI 12C – Part 1

Implementing Apache Drill in ODI 12C – Part 1

Hello Friends,

In the previous post I had explained how to use json file in odi mapping and loading the data to oracle table. We had also modified certain properties of Complex File technology to generate correct syntax for json queries. If you have missed it then refer this post. Today I am going to demonstrate the installation and configuration of Apache drill. After the configuration we will run some sample queries to read data from CSV, TEXT and PARQUET files.  In part2, we will see drill implementation in ODI with simple joins between different source systems.

Prerequisite:

  1. You are familiar  with hadoop ecosystem. To know more about drill please refer the official documentation.
  2. You have a virtual box and the appliance is already imported.
  3. The VM is running fine
  4. Hadoop, Hive is already installed and the required services are up and running.

For this series of blog posts I will be using Oracle Bigdatalite which is a Cloudera distribution. It has almost 15 zip files and the size sums up to 30 GB. If you want a lighter version you can directly download the vm from Cloudera or Hortnworks or MapR.

Lets take a look at the definition  taken from drill documentation:

Drill is an Apache open-source SQL query engine for Big Data exploration. Drill is designed from the ground up to support high-performance analysis on the semi-structured and rapidly evolving data coming from modern Big Data applications, while still providing the familiarity and ecosystem of ANSI SQL, the industry-standard query language.

So it is quite similar to Cloudera Impala or Facebook Presto. Lets proceed with the installation. If the VM is not up start it. Once it is up and running, open a terminal and run the following command to verify that Java 7 or 8 is the version in effect:

The output looks something like this:

Then run below set of commands to download drill.

The file is now downloaded to /home/oracle location. Copy this file to a directory where you want to install Drill. In my case I will directly extract it to /etc/drill.

Before using the drill shell, lets a take look at the drill-override-example.conf file just to ensure we have an unique cluster id and zookeeper port is correctly configured.

Starting drill shell:

So far everything looks fine. Lets start the drill shell using

env -i HOME=”$HOME” LC_CTYPE=”${LC_ALL:-${LC_CTYPE:-$LANG}}” PATH=”$PATH” USER=”$USER” /etc/drill/apache-drill-1.10.0/bin/drill-embedded

Now open browser and point your url to http://localhost:8047/ .  You will see the home page of drill.

Then click on storage and enable hive , dfs if it is in disable storage plugins.

Click the update button for dfs and update the configuration as shown below. Similarly we will do for hive.

DFS configuration in drillHive configuration in drill

Change the configuration wherever required and save it. Drill is now ready for use.

Using JSON files:

Run show files command.

Lets execute the following select statement. I will explain this query later.

SELECT employee_id, full_name,first_name,last_name,position_title FROM cp.employee.json LIMIT 5

Well whatever we queried above was just a json file named as employee.json and located at ./jars/3rdparty/foodmart-data-json.0.4.jar. The cp command is nothing but the classpath and the json file presents inside the jar file.

Using PARQUET files:

To query a parquet file , we need to move the sample-data from installation location to hdfs location using below command.

We are now ready to query parquet file in drill shell.

Using TEXT files:

What about text files? Yes we can do it as well. Just note down here one thing.

Drill supports the following file types:

Plain text files, including:
Comma-separated values (CSV, type: text)
Tab-separated values (TSV, type: text)
Pipe-separated values (PSV, type: text)
Structured data files:
Avro (type: avro) (This file type is experimental)
JSON (type: json)
Parquet (type: parquet)

So I will create one csv file in local and then move it to HDFS location.

Now lets query this file from drill shell.

Sounds good? Then lets make it more interesting!! 🙂 What if I would like to join and PARQUET and CSV file. Can we do that? Lets check it out with below query.

Using joins between files:

select tbl1.N_NAME, tbl1.N_COMMENT,tbl2.columns[0] as nm from dfs./user/bhabani/sample-data/nation.parquet as tbl1
join
dfs./user/bhabani/demo1.csv as tbl2
on tbl1.N_NAME=tbl2.columns[1];

As you saw I joined CSV and PARQUET file based on the location name because this is the only common present in both data set and then displayed two columns from PARQUET and one column from CSV file.

Using Hive:

We will connect to hive from drill shell and execute a select statement as shown below.

Using joins between file and hive:

The last example for this post is to join a csv file with hive table. Use below query to create a join based on country and location column.

Thats it for today. I would suggest you to practice some basic commands and get your hands dirty on all type of joins. Once you are familiar with the commands we can then use it in ODI. Though ODI will do the work for you but having knowledge on the working principle will help you in accomplishing your objective  quickly and seamlessly.

Stay tuned. Thank you!!

0

About the author

Bhabani(http://dwteam.in) - Currently Bhabani is working as Technical Lead Product Development at Harman International. He has good expertise on Oracle, ODI, Pervasive Data Integrator, MSBI, Talend, Cloud Integration (AWS, GCP, Azure) , Map-R and Apache Airflow. He is also contributing in ODI-OTN forum for last 7 years. He is from India. If you want to reach him then please visit contact us page. If you have any doubts or concerns on the above article, please put your question here. Dw Team will try to respond it as soon as possible. Also dont forget to provide your comments / suggestions / feedback for further improvement. Thanks for your time.

Similar Posts

Leave a reply

required*

Are you a human? *