Web Snippets: Overview of Pig

Sunday, April 22, 2018

Overview of Pig

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop.
The language for this platform is called Pig Latin.
Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark

Local mode

In local mode, Pig runs in a single JVM and access the local file system. This mode is suitable only for small data sets but not for big data sets.
We can set this local mode execution type by using “X” or “exectype” option. To run in local mode, set the option to local

pig -x local

Load data

Takes LOAD operator takes a URI argument. we can refer local file, we can also refer to an HDFS URI
We need 'AS' to generate schema.If 'AS' clause is not used then we would get a message saying "Schema for emp3 is unknown"

emp =LOAD'/home/cloudera/wrt' AS(id:int,sal:int,name:chararray);

Dump
Bag or Relations are given names or aliases those which we are referring. This relation is given the emp alias. We can examine the contents of an alias using the DUMP operator.

DUMP emp

Filtering data
Command to filter the tuple from the given bag

grunt> emp =LOAD'/home/cloudera/wrt' AS(id:int,sal:int,name:chararray);
grunt> filtered_emp= FILTER emp BY id>102;
grunt> DUMP filtered_emp;

Group By clause
Would group the data in a single relation.

grunt> emp =LOAD'/home/cloudera/wrt' AS(id:int,sal:int,name:chararray);
grunt> grouped_emp =GROUP emp BY sal;
grunt> DUMP grouped_emp;

output

FOREACH....GENERATE:

Is used to act on every row in the relation.
Can be used to remove fields to generate a new one

grunt> max_sal =FOREACH grouped_emp GENERATE group,MAX(emp.sal);
grunt> DUMP max_sal;

Output

Web Snippets

Labels

Sunday, April 22, 2018

Overview of Pig

No comments:

Post a Comment

Labels

Blog Archive