Hadoop : November 2015

· Download pig-0.15.0.tar.gz

wget http://archive.apache.org/dist/pig/pig-0.15.0/pig-0.15.0.tar.gz

Untar the archive and move to the /usr/local/pig

tar –xzvf pig-0.15.0.tar.gz

mv pig-0.15.0 pig

sudo mv pig /usr/local

· Set the environment variable PIG_HOME to point to the installation directory:

Gedit ~/.bashrc or nano ~/.bashrc

Add at the end of the file, below entry

export PIG_HOME=/usr/local/pig

export PATH=$PATH:$PIG_HOME/bin

Note:

1. Restart your terminal for applying a new change [ .bashrc] or

source ~/.bashrc

2. We don’t required to set Hadoop Home path in pig config because it scan the properties file and set HADOOP_HOME automatically.

Run Pig From CLI

pig

Verification:

PIG

pig –help

Pig support two execution mode

pig -x local (which point to your local system and run under single JVM).

pig -x mapreduce or pig (This command will only work if hadoop is running which connect to Hadoop Cluster )

Pig - Modes of execution:

pig_modes_of_execution

First level of categorization is:
A.      Interactive : That is using grunt shell and user has to interact with grunt shell each time he want s to perform some operation on PIG
B.      BatchMode : User can run a set of command together as batch by saving it to a script file with extension .pig.

Second Level of categorization:
A.     LocalMode: Local mode will build a simulated mapreduce job running off of a local file on disk. In theory equivalent to MapReduce, but it's not a "real" mr job. You shouldn't be able to tell the difference from a user perspective. Local mode is great for development. All scripts are run on a single machine without requiring Hadoop MapReduce and HDFS. This can be useful for developing and testing Pig logic. If you’re using a small set of data to developer or test your code, then local mode could be faster than going through the MapReduce infrastructure.Local mode doesn’t require Hadoop. When you run in Local mode, the Pig program runs in the context of a local Java Virtual Machine, and data access is via the local file system of a single machine. Local mode is actually a local simulation of MapReduce in Hadoop’s LocalJobRunner class.
B.      MapReduce mode (also known as Hadoop mode): Pig is executed on the Hadoop cluster. In this case, the Pig Script gets converted into a series of MapReduce jobs that are then run on the Hadoop cluster.


pig_architecture

Benefit of local mode:
If you have a terabyte of data that you want to perform operations on and you want to interactively develop a program, you may soon find things slowing down considerably, and you may start growing your storage. Local mode allows you to work with a subset of your data in a more interactive manner so that you can figure out the logic (and work out the bugs) of your Pig program.After you have things set up as you want them and your operations are running smoothly, you can then run the script against the full data set using MapReduce mode.

/*
date_plan_percent.pig
              - Displays - Date:Plan:Percent
              - It includes the Pig Latin Statement, inbuilt function and also User Defined Function (UDF)
*/

--Script Start

/*
REGISTER provides the location of the jar file that contains the UDF. for me the jar that contains udf is myudfs.jar
To locate the jar file, Pig first checks the classpath. If the jar file can't be found in the classpath, Pig assumes that the location is either an absolute path or a path relative to the location from which Pig was invoked.
If the jar file can't be found, an error will be printed: java.io.IOException: Can't read jar file: myudfs.jar.
Multiple register commands can be used in the same script. If the same fully-qualified function is present in multiple jars, the first occurrence will be used consistently with Java semantics.
The name of the UDF has to be fully qualified with the package name or an error will be reported
*/
REGISTER myudfs.jar;

/*
Load functions determine how data goes into Pig.
Pig provides a set of built-in load functions.
@BinStorage -Loads and stores data in machine-readable format. #A = LOAD 'data' USING BinStorage();
@PigStorage -Loads and stores data in UTF-8 format. #PigStorage(field_delimiter) -The default field delimiter is tab ('\t').
@TextLoader -Loads unstructured data in UTF-8 format. #A = LOAD 'data' USING TextLoader();
We are actually using PigStorage Load function and delimeter passed in '|'. PigStorage uses '\n' as record delimeter.
recharge- is a relation
First split of each tuple is stored as PHONE, 2nd split of each tuple is stored as CHANNEL and so on.
recharge schema - PHONE:long, DATE:chararray, CHANNEL:long, PLAN:int, AMOUNT:double
DESCRIBE - recharge: {PHONE: long,DATE: chararray,CHANNEL: long,PLAN: int,AMOUNT: double}
ILLUSTRATE
------------------------------------------------------------------------------------------------------------
| recharge     | PHONE:long     | DATE:chararray     | CHANNEL:long     | PLAN:int     | AMOUNT:double     |
------------------------------------------------------------------------------------------------------------
|              | 7755058797     | 20130313           | 222319           | 13           | 50.0              |
|              | 7732937697     | 20130313           | 150750           | 13           | 20.0              |
------------------------------------------------------------------------------------------------------------
*/
recharge = LOAD '/user/hdfs/cdr/recharges.txt' using PigStorage('|') as (PHONE:long, DATE:chararray, CHANNEL:long, PLAN:int, AMOUNT:double);

/*
The GROUP operator groups together tuples that have the same group key (key field) for me it is (DATE, PLAN).
DESCRIBE - grp_date_plan: {group: (DATE: chararray,PLAN: int),recharge: {(PHONE: long,DATE: chararray,CHANNEL: long,PLAN: int,AMOUNT: double)}}
ILLUSTRATE
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| grp_date_plan     | group:tuple(DATE:chararray,PLAN:int)             | recharge:bag{:tuple(PHONE:long,DATE:chararray,CHANNEL:long,PLAN:int,AMOUNT:double)}                             |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|                   | (DATE, )                                         | {}                                                                                                              |
|                   | (DATE, )                                         | {}                                                                                                              |
|                   | (20130211, 11)                                   | {}                                                                                                              |
|                   | (20130211, 11)                                   | {}                                                                                                              |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
*/
grp_date_plan = GROUP recharge BY (DATE, PLAN);

/*
FOREACH - Generates data transformations based on columns of data. For me its iterating over grp_date_plan and generates date_plan_wise.
count is the count of number of tuples for each unique-DATE,PLAN
DESCRIBE - date_plan_wise: {grp: (DATE: chararray,PLAN: int),count: int}
ILLUSTRATE
---------------------------------------------------------------------------------------
| date_plan_wise     | grp:tuple(DATE:chararray,PLAN:int)             | count:int     |
---------------------------------------------------------------------------------------
|                    | (DATE, )                                       | 0             |
|                    | (20130211, 11)                                 | 2             |
---------------------------------------------------------------------------------------
*/
date_plan_wise = FOREACH grp_date_plan GENERATE group as grp, COUNT(recharge) as count:int;

/*
FOREACH - Generates data transformations based on columns of data. For me its iterating over date_plan_wise and generates newdata.
count is the count of number of tuples for each unique-DATE,PLAN
Note- that a field of a group that is a tuple can be accessed using '.' for ex: grp.DATE
DESCRIBE - newdata: {DATE: chararray,PLAN: int,count: int}
ILLUSTRATE
-------------------------------------------------------------------
| newdata     | DATE:chararray     | PLAN:int     | count:int     |
-------------------------------------------------------------------
|             | DATE               |              | 0             |
|             | 20130211           | 11           | 2             |
-------------------------------------------------------------------
*/
newdata = FOREACH date_plan_wise GENERATE grp.DATE AS DATE, grp.PLAN AS PLAN:int, count as count:int;

/*
FILTER - Selects tuples from a relation based on some condition.
For me Filtering the tuples that has $0(Posiiton notation i.e, 0 index field value) as 'DATE'
DESCRIBE - newd: {DATE: chararray,PLAN: int,count: int} - no change in schema
ILLUSTRATE
----------------------------------------------------------------
| newd     | DATE:chararray     | PLAN:int     | count:int     |
----------------------------------------------------------------
|          | 20130211           | 11           | 2             |
----------------------------------------------------------------
*/
newd = FILTER newdata BY $0!='DATE';

/*
The GROUP operator groups together tuples that have the same group key (key field) for me it is (DATE) now.
DESCRIBE - grp_date: {group: chararray,recharge: {(PHONE: long,DATE: chararray,CHANNEL: long,PLAN: int,AMOUNT: double)}}
ILLUSTRATE
--------------------------------------------------------------------------------------------------------------------------------------------------------
| grp_date     | group:chararray     | recharge:bag{:tuple(PHONE:long,DATE:chararray,CHANNEL:long,PLAN:int,AMOUNT:double)}                             |
--------------------------------------------------------------------------------------------------------------------------------------------------------
|              | 20130313            | {(7755058797, ..., 50.0), (7732937697, ..., 20.0)}                                                              |
--------------------------------------------------------------------------------------------------------------------------------------------------------

*/
grp_date = GROUP recharge BY (DATE);

/*
FOREACH - Generates data transformations based on columns of data. For me its iterating over grp_date and generates maxcount.
count1 is the count of number of tuples for each unique-DATE
DESCRIBE - maxcount: {grp: chararray,count1: int}
ILLUSTRATE
-----------------------------------------------------
| maxcount     | grp:chararray     | count1:int     |
-----------------------------------------------------
|              | DATE              | 0              |
|              | 20130313          | 2              |
-----------------------------------------------------
*/
maxcount = FOREACH grp_date GENERATE group AS grp, COUNT(recharge) AS count1:int;

/*
FILTER - Selects tuples from a relation based on some condition.
For me Filtering the tuples that has $0(Posiiton notation i.e, 0 index field value) as 'DATE'
DESCRIBE - maxcount1: {grp: chararray,count1: int}
ILLUSTRATE
------------------------------------------------------
| maxcount1     | grp:chararray     | count1:int     |
------------------------------------------------------
|               | 20130313          | 2              |
------------------------------------------------------
*/
maxcount1 = FILTER maxcount BY $0!='DATE';

/*
Join - Performs inner, equijoin of two or more relations based on common field values.
DESCRIBE - join_on_date: {newd::DATE: chararray,newd::PLAN: int,newd::count: int,maxcount1::grp: chararray,maxcount1::count1: int}

*/
join_on_date = JOIN newd BY ($0), maxcount1 BY ($0);

/*
Relation containing the final result i.e, eachdate : eachplan : percent
DESCRIBE - finaltb: {Date: chararray,Plan: int,per: chararray}
ILLUSTRATE
-----------------------------------------------------------------------
| finaltb    | DATE:chararray     | PLAN:int     | per: chararray    |
----------------------------------------------------------------------
|          | 20130209           | 3           | 100.0%             |
---------------------------------------------------------------------
*/
finaltb = FOREACH join_on_date GENERATE $0 AS Date, $1 AS Plan:int, myudfs.Percentage($2,$4) as per:chararray;

/*
DUMP- Shows the relation on to teh screen
*/
DUMP finaltb;

--Script End

Monday, 2 November 2015

Pig