Sunday, April 22, 2018

Overview of Pig


  •  Apache Pig is a high-level platform for creating programs that run on Apache Hadoop.
  • The language for this platform is called Pig Latin. 
  • Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark
Local mode
  • In local mode, Pig runs in a single JVM and access the local file system. This mode is suitable only for small data sets but not for big data sets.
  • We can set this local mode execution type by using “X” or “exectype”  option. To run in local mode, set the option to local



pig -x local

Load data
  • Takes LOAD operator takes a URI argument. we can refer local file, we can also refer to an HDFS URI 
  •  We need 'AS' to generate schema.If 'AS' clause is not used then we would get a message saying "Schema for emp3 is unknown"
emp =LOAD'/home/cloudera/wrt' AS(id:int,sal:int,name:chararray);

Dump
Bag or Relations are given names or aliases those which we are referring. This relation is given the emp alias. We can examine the contents of an alias using the DUMP operator.

DUMP emp

Filtering data 
Command to filter the tuple from the given bag
grunt> emp =LOAD'/home/cloudera/wrt' AS(id:int,sal:int,name:chararray);
grunt> filtered_emp= FILTER emp BY id>102;
grunt> DUMP filtered_emp;

Group By clause
Would group the data in a single relation. 
grunt> emp =LOAD'/home/cloudera/wrt' AS(id:int,sal:int,name:chararray);
grunt> grouped_emp =GROUP emp BY sal;
grunt> DUMP grouped_emp;

output

FOREACH....GENERATE:
  •  Is used to act on every row in the relation. 
  • Can be used to remove fields to generate a new one 
grunt> max_sal =FOREACH grouped_emp GENERATE group,MAX(emp.sal);
grunt> DUMP max_sal;

Output

No comments:

Post a Comment