Start the hive shell
hive
Create schema in hive
hive> create schema hiveschema location '/hivedatabase/';
Create table with location
use hiveschema; create external table employee(emp_id int,emp_name string,emp_phone string) row format delimited fields terminated by '\t' location '/hivedatabase/employee';
Create External table in hive
Note: The difference between external and internal table is that.Drop table for internal table drops the table and metadata. Drop table for external table only drops the metadata and data is not touched.
External table
When there is data already in HDFS, an external Hive table can be created to describe the data.It is called External (Data is in the external table and is specified in the Location properties instead of the default warehouse directory.
- Table name would be created as directory
- Schema would be created as meta store
create table product_ratings(UserId string, ISBN string) partitioned by (ProductRating int) row format delimited fields terminated by '\t';
Load data from HDFS
load data inpath '/Locations/Products_Ratings.cvs' into table product_ratings partition (ProductRating=0)
Query data
Now we can use select * from table or specify the partition by condition. When we use partition condition internally the fetch data would be faster.
select * from product_ratings where ProductRating=0
After dropping the table data was retained.
To access the data we need to create the table. In this example after creating the table we can find that data is retained.
Partitioning of table
- Hive creates tables in partitions.
- They are used to divide the table into related parts.
- Makes data querying more efficient
sqoop import --connect jdbc:mysql://localhost:3306/securemask \
--username root --password cloudera \
--table contacts \
--fields-terminated-by "," \
--target-dir /user/cloudera/workspace/contactsmask \
--hive-import
sqoop import --connect jdbc:mysql://localhost:3306/securemask \
--username root --password cloudera \
--table contacts \
--fields-terminated-by "," \
--target-dir /user/cloudera/workspace/contactsmask \
--hive-import
In this import we are specifying the schema and not the directory path
sqoop import --connect jdbc:mysql://quickstart.cloudera/original \
--username root --password cloudera \
--table employee \
--hive-import \
--hive-table hiveschema.employee -m1 \
--fields-terminated-by '\t' \
--direct;
Add UDF function
hive> ADD JAR /home/cloudera/mask.jar;
hive> CREATE TEMPORARY FUNCTION MASK AS 'hiveudf.PImask';
Note: Before running this command we need to make sure that we have added the Jar file for UDF and create temporary function as shown above
Create similar tables in Hive
hive> create table securedaccounts like accounts;
DATA MASKING
sqoop import --connect jdbc:mysql://quickstart.cloudera/original \ --username root --password cloudera \ --table employee \ --hive-import \ --hive-table hiveschema.employee -m1 \ --fields-terminated-by '\t' \ --direct;
Add UDF function
hive> ADD JAR /home/cloudera/mask.jar;
hive> CREATE TEMPORARY FUNCTION MASK AS 'hiveudf.PImask';
hive> create table securedaccounts like accounts;
Insert masked phone numer into accounts
Iinsert masked phone and email into contacts
Move tables (securedcontacts,securedaccounts) from hive to my SQL
insert overwrite table securedaccounts select id,name,MASK(phone) from accounts;
Iinsert masked phone and email into contacts
insert overwrite table securedcontacts select id,accountid,firstname,
lastname,MASK(phone), MASK(email) from contacts;
Move tables (securedcontacts,securedaccounts) from hive to my SQL
sqoop export --connect jdbc:mysql://localhost:3306/securemask
--username root --password cloudera
--table contacts
--export-dir /user/hive/warehouse/securedcontacts
RETRIEVING INFORMATION
Function | MySQL | Hive |
Retrieving Information (General) | SELECT from_columns FROM table WHERE conditions; | SELECT from_columns FROM table WHERE conditions; |
Retrieving All Values | SELECT * FROM table; | SELECT * FROM table; |
Retrieving Some Values | SELECT * FROM table WHERE rec_name = "value"; | SELECT * FROM table WHERE rec_name = "value"; |
Retrieving With Multiple Criteria | SELECT * FROM TABLE WHERE rec1 = "value1" AND rec2 ="value2"; | SELECT * FROM TABLE WHERE rec1= "value1" AND rec2 ="value2"; |
Retrieving Specific Columns | SELECT column_name FROM table; | SELECT column_name FROM table; |
Retrieving Unique Output | SELECT DISTINCT column_name FROM table; | SELECT DISTINCT column_name FROM table; |
Sorting | SELECT col1, col2 FROM table ORDER BY col2; | SELECT col1, col2 FROM table ORDER BY col2; |
Sorting Reverse | SELECT col1, col2 FROM table ORDER BY col2 DESC; | SELECT col1, col2 FROM table ORDER BY col2 DESC; |
Counting Rows | SELECT COUNT(*) FROM table; | SELECT COUNT(*) FROM table; |
Grouping With Counting | SELECT owner, COUNT(*) FROM table GROUP BY owner; | SELECT owner, COUNT(*) FROM table GROUP BY owner; |
Maximum Value | SELECT MAX(col_name) AS label FROM table; | SELECT MAX(col_name) AS label FROM table; |
Selecting from multiple tables (Join same table using alias w/”AS”) | SELECT pet.name, comment FROM pet, event WHERE pet.name =event.name; | SELECT pet.name, comment FROM pet JOIN event ON (pet.name =event.name) |
METADATA
Function | MySQL | Hive |
Selecting a database | USE database; | USE database; |
Listing databases | SHOW DATABASES; | SHOW DATABASES; |
Listing tables in a database | SHOW TABLES; | SHOW TABLES; |
Describing the format of a table | DESCRIBE table; | DESCRIBE (FORMATTED|EXTENDED)table; |
Creating a database | CREATE DATABASE db_name; | CREATE DATABASE db_name; |
Dropping a database | DROP DATABASE db_name; | DROP DATABASE db_name (CASCADE); |
CURRENT SQL COMPATIBILITY
COMMAND LINE
Function | Hive |
Run Query | hive -e 'select a.col from tab1 a' |
Run Query Silent Mode | hive -S -e 'select a.col from tab1 a' |
Set Hive Config Variables | hive -e 'select a.col from tab1 a' -hiveconf hive.root.logger=DEBUG,console |
Use Initialization Script | hive -i initialize.sql |
Run Non-Interactive Script | hive -f script.sql |
No comments:
Post a Comment