Stepping into Big Data World with ODI Footprint, Part 1
This post is dedicated for those ODI lovers requesting me in my personal email id to have some post on ODI and Bigdata Integration. I would also like to encourage you to contribute any creative and informative article you have to share here.
For the past few years, the modern IT world has been abuzz with the words “Big Data”, and it would have been impossible for any of us not to have heard about it. Coming across jargon like Mahout, Pig, Spark, Zoo Keeper, Couch DB, Crunch, DataFu, Storm and Oozie might have been annoying to some of us who wouldn’t have known what all these meant. Regular users of LinkedIn and Twitter (like me) would have seen many discussions about this Big Data, and every now and then, you might have seen news about deals between different companies on Big Data, Hadoop, etc.
Since being exposed to this new concept, I have felt quite restless; as though something big was happening that I was not aware of. As a result, I ended up spending days and nights going through many websites to understand the architecture and how it works in today’s world. Today, I will try to shed some light on this and to give you all an idea of what it is, and how it really matters to us. I have divided this discussion into 4 parts as follows:
- -Part 1 will give a picture about Hadoop ecosystem.
- -In Part 2 we will discuss about Hive, ODI Application Adapter for Hadoop (AAH) and different ODI KMs
- -In Part 3 and 4 we will discuss about some basic mappings to load data in and out of HDFS (if you don’t understand this right now, don’t worry).
So what is Big Data?
The best definition of Big Data can be found, where else, on Wikipedia. Below is the part of that definition that I would like to interpret in my way:
“Big Data is the Data that can be defined in terms of Volume, Velocity and Verity.”
Here, Volume can be represented in terms of GB or TB or PB. Velocity is a measure of how fast such large amounts of data can be processed. Verity represents the type of data, i.e., structured or unstructured. Almost 90% of data in big data is unstructured, and it normally comes from mobile devices, sensors, social media, satellites, stock exchanges, nuclear reactors, scanners, server logs etc. We all are aware of RDBMS systems, which work on ACID principles, but unfortunately they do not have the capability and flexibility to process such volumes of unstructured data.
To resolve this, a new framework named Hadoop was introduced by Doug Cutting, and this project was given to Apache for further research and development. This framework enables distributed processing of large datasets across clusters of commodity hardware (servers). Notice I said commodity hardware – this means it is affordable and easy to maintain. So it need not be a Ferrari car, rather it is more on the lines of a Hyundai Eon or a Chevrolet Spark. 🙂 Before getting in to the core of Hadoop, let’s take a look at the Hadoop ecosystem. The diagram below has been taken from hortonworks.com, and you can get similar diagrams from others like “Cloudera”; these are both companies who provide enterprise distributions of Apache Hadoop based software, support and services.
The Hadoop Ecosystem
Each block in the above diagram is used for different purposes. Some of the major terms that we will be focusing on are: HDFS, Hive, Hbase, Sqoop, Oozie, Pig, and ZooKeeper.
- HDFS: This is the Hadoop Distributed File System that stores very large datasets reliably across commodity of servers and helps in streaming the same with high bandwidth from user application. One could say that this is the heart of Hadoop.
- HIVE: It is an SQL-like language to query data that resides in HDFS.
- HBASE: A non-relational distributed column-oriented database on top of HDFS.
- SQOOP: It is used to load from RDBMS to HDFS and vice-versa.
- OOZIE: A workflow scheduler of multiple Hadoop jobs.
- PIG: It is a scripting language which follows a procedural approach (developers that use Python might like this).
- ZOOKEEPER: ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
The following types of databases are used in Hadoop, depending on the requirement:
- Column Type: Accumulo, Cassandra, Druid, HBase, Vertica
- Document Type: Clusterpoint, Apache CouchDB, Couchbase, MarkLogic, MongoDB, OrientDB, Qizx
- Key-value Type: CouchDB, Dynamo, FoundationDB, MemcacheDB, Redis, Riak, Aerospike, OrientDB,
- Graph Type: Allegro, Neo4J, InfiniteGraph, OrientDB, Virtuoso, Stardog
- Multi-model Type: OrientDB, FoundationDB, ArangoDB, Alchemy Database, CortexDB
If you have come this far, it means you are interested to know more about this. 🙂 In the above sections, you might have noticed something strange about this framework, namely, how the above components interact with HDFS and deal with such large volumes of data. To understand how this happens, we need to get to another terminology called ‘Map Reduce’. However, before we do that, let us take a quick look at the HDFS architecture.
The diagram below will give you an idea of the architecture:
Here, the upper part describes the HDFS architecture, where we have one master node and many slave nodes. A master node normally stores the required metadata, like how many blocks are there, which block is stored in which data node, how many replications are available for a block, etc. Replication is nothing but a backup of a block.
A Job Tracker is a master process which runs on the master node. It creates multiple jobs and allocates each job to a Task Tracker that runs on each data node. This Task Tracker will send the status or progress of the job to the Job Tracker. Each data node will also send a “heartbeat” at regular configurable intervals to the master node to inform that they are “alive”. If you want to store a 1.2 GB file in HDFS, it would most likely be split into 20 blocks (1.2 GB / 64 MB). These blocks need not necessarily be in one data node, and can span several data nodes.
Okay, so we got to know how all this data is stored, but we are still in the dark as to how such an enormous amount of data that is measured in terabytes and petabytes is processed and analyzed. The answer is Map Reduce. Map Reduce is a programming model to process and generate large datasets in parallel. In common terms, we can say we will be dealing with a “mapper” and a “reducer” to process and analyze the data that is already in HDFS. Remember, we are not writing any program to store data into HDFS. Here is the command to store data into HDFS: “hadoop fs -put localsystem_filename hdfs_directory”. It is as simple as that.
Let me give you an example here. Think about a blog which creates log files for each user activity. As a blog owner, I want to study the user statistics based on their clicks on each link. Since I have 100K of log files each size of 5mb will be really a challenging task for me to analyze all of them. What I would do in this case is to load all of them into HDFS. I will not be worried about the size as I am storing them in commodity hardware. If the size required for them is more than I expect, all I have to do is keep on adding machines horizontally. Here, I will write a Map Reduce program in Java that will be allocated by the Job Tracker to Task Trackers. Each mapper will generate some key values based on word occurrence, and these key values will be passed to the reducers via some kind of sorting/shuffling, and finally the output will be generated.
With this output I can track the interests of each user, and based on that, I can try to recommend other articles that I have written about similar topics (my ultimate goal is to keep him busy in my website 🙂 ). This is just a small example, but in the enterprise world, businesses are hungry for hidden patterns, forecasting, and prediction analysis, since the more data they have, the better and more accurate the decision that is taken.
Similarly, structured data that is being stored in RDBMS systems can now be migrated to and from HDFS via SQOOP. Write the simplest query you know, say “select * from emp”, using Hive and it will create Map Reduce program for you (yes, you read that correctly). Like Hive, there are some other products, like Impala designed by Cloudera that can now be used as an alternative, but both have their own pros and cons.
Hopefully, this should be sufficient (and not too much 🙂 ) for the first part. I encourage all the readers to go through the Hadoop ecosystem to get acquainted with the rapidly expanding new terminology. In the series, we will try to unleash the power of ODI to deal with real world big data challenges. In case you find any errors or issues in the above article, please don’t hesitate to correct me, as I do not claim to be an expert on this. I was a learner and will always be one. 🙂 Well, until next time!