Thursday, January 31, 2019

Fast Export vs Fast Load (Teradata)





**************
FAST EXPORT
**************
  • Used to export data from Teradata into flat files
  • Can generate data in report format
  • Data can be extracted from one or more tables using join
  • Deals in block export (Useful for extracting large volumes)
  • Is the ability to ship data over multiple session connections simultaneously thereby leveraging the total connectivity available between the client platform and the database engine. In order to do this,Fastexport spends more resources on executing the query in order to prepare the blocks in such a way that when they are exported over multiple sessions they can easily be reassembled in the right order by the client without additional sorting or processing of the rows.

rank vs dense_rank vs row_number with partition

  • One of the most obvious and useful set of window functions are ranking functions where rows from your result set are ranked according to a certain scheme. 
  • There are three ranking functions: ROW_NUMBER() RANK() DENSE_RANK() 
DDL Scripts
Create table dup_employees
(
     id int,
     first_name nvarchar(50),
     last_name nvarchar(50),
     gender nvarchar(50),
     salary int
);
Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000); Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000); Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000); Insert into dup_employees values (2, 'Mary', 'Lambeth', 'Female', 30000); Insert into dup_employees values (2, 'Mary', 'Lambeth', 'Female', 30000); Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000); Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000); Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);

SQL Basics


SELF JOIN
  • A self join is a join in which a table is joined with itself
Ex :when we want to get the manager name who is an employee then in the join condition left side would be the first table (employee e) and right side would be the second table (employee m) in the join condition.
Note: It would take manger_id from the employee (m) table and would look for the employee_id in the  employee table (e)

Select  id,e.employee_name,m.employee_name
from employee e
Join employee m
on e.manger_id=m.employee_id


MAPR Products


MAPR                                 :  Started in stealth mode  (2009)
MapR-FS                            :  It is a Hadoop compatible file system
MapR-DB                           :  It is the first in house DB that ran on the same technology stack
Apache-Drill                       :  First schema free analytics
MAPR-Streams                  : Was introduced for global event processing
Converged Data platform :  Branded all the above products into converged data platform.
                                                Only converged data platform in the industry.
                                                Supports all the data
                                                Runs on every cloud on premise and on the edge
                                                It has highly available design
                                                Capability of global data fabric
                                                It has global database
                                                It has global event streaming engine
                                                Operates at unlimited scale
                                                It supports file, tables, document and streams
                                                It supports docker container data platform to make it highly available
                                                The file system in mapr is different from others



Aws cloud formation


Is a service that helps you model and set up our amazon web services resources so that we can spend less time managing those resources and more time focusing on our applications that run on AWS

TEMPLATES
We can also create templates in AWS cloud formation.We can use designers for creating this template and save this template

  •  To create a cloud formation script we need a JSON script.This can also be created using cloud formation designer as shown below. When we drag  resource JSON script would be generated.

Wednesday, January 30, 2019

Apache Accumulo



 Is a robust database scalable data storage and retrieval system based on google's big table design and built on top of Apache Hadoop, Zookeper and Thrift.
It's improvements on big table design are:

  • Server-side programming mechanism that can modify key/value pairs
  • Cell based access control

Tuesday, January 29, 2019

Ingestion in GCP



In the figure x axis is what is closer to GCP and y axis is the amount of data
These are the different approaches which we can take for data ingestion to GCP

TRANSFER SERVICE
Storage Transfer Service allows you to quickly import online data into Cloud Storage. You can also set up a repeating schedule for transferring data, as well as transfer data within Cloud Storage, from one bucket to another.

Wednesday, January 23, 2019

Edge nodes

  • Edge nodes are the interface between the Hadoop cluster and the outside network. For this reason, they’re sometimes referred to as gateway nodes. Most commonly, edge nodes are used to run client applications and cluster administration tools.

Tinyurl design

We have different options to generate tiny url. The basic rule of thumb is that we need to convert the long url into tiny url and store it in database or cache for future retrieval.

1) Sometimes we would have restrictions on the tiny url. ex 41 bits
2) Allowed characters on the tiny url  could be caps and smaller alphanumeric characters

Fact Table


  • A fact table is the central table in a star schema of a data warehouse. 
  • A fact table stores quantitative information for analysis and is often denormalized.
  • A fact table works with dimension tables. A fact table holds the data to be analyzed, and a dimension table stores data about the ways in which the data in the fact table can be analyzed. Thus, the fact table consists of two types of columns. The foreign keys column allows joins with dimension tables, and the measures columns contain the data that is being analyzed.

Dimensions


  • Dimension is a collection of reference information about a measurable event. 
  • Dimensions categorize and describe data warehouse facts and measures in ways that support meaningful answers to business questions.  They form the very core of dimensional modeling.  

A data warehouse organizes descriptive attributes as columns in dimension tables. 

A dimension table has a primary key column that uniquely identifies each dimension record (row).  The dimension table is associated with a fact table using this key.  Data in the fact table can be filtered and grouped (“sliced and diced”) by various combinations of attributes. 


Galaxy schema


  • Is a combination of both star schema and snow flake schema
  • Has many fact table and some common dimension table
  • Can also be referred as a combination of many data marts
  • It is also known as fact constellation schema 

Star schema



Star schema is the simplest form of a dimensional model, in which data is organized into facts and dimensions. 

FACT
A fact is an event that is counted or measured, such as a sale or login. 

Snowflake schema



  • In data warehousing, snowflaking is a form of dimensional modeling in which dimensions are stored in multiple related dimension tables.  A snowflake schema is a variation of the star schema.
  • Snowflaking is used to improve the performance of certain queries. 
  • The schema is diagrammed with each fact surrounded by its associated dimensions (as in a star schema), and those dimensions are further related to other dimensions, branching out into a snowflake pattern.

Tuesday, January 15, 2019

SQL question challenge (Delete duplicates)



We will be using duplicate employees table.
Delete duplicate rows in sql
SQL Script to create dup_employees table

link to github

Create table dup_employees
(
     id int,
     first_name nvarchar(50),
     last_name nvarchar(50),
     gender nvarchar(50),
     salary int
);

Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000);
Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000);
Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000);
Insert into dup_employees values (2, 'Mary', 'Lambeth', 'Female', 30000);
Insert into dup_employees values (2, 'Mary', 'Lambeth', 'Female', 30000);
Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);
Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);
Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);

Wednesday, January 9, 2019

SQL question challenge (Messages)


Link to github

Table: date | u1 | u2 | n_msg
n_msg: the number of messsages between one unique user pair at someday.

DDL Scripts
create table messages(
  date varchar(80),
  u1 varchar(80),
  u2 varchar(80),
  n_msg integer
 );
 
 insert into messages values ('10/01/2018','user1','user2',4);
 insert into messages values ('10/01/2018','user2','user1',4);
 insert into messages values ('10/01/2018','user1','user4',2);
 insert into messages values ('10/02/2018','user1','user2',2);
 insert into messages values ('10/02/2018','user2','user1',1);
 insert into messages values ('10/02/2018','user3','user4',10);
 insert into messages values ('10/02/2018','user3','user2',14);
 insert into messages values ('10/02/2018','user1','user2',14);
 insert into messages values ('10/03/2018','user4','user1',4);
 insert into messages values ('10/03/2018','user4','user5',25);

SQL question challenge (Friends)

Link to Github

DDL Scripts


create table table1(time date,user_id int,app varchar(40),event varchar(40));

 insert into table1 values('2018-10-01',1,'facebook','processed');
 insert into table1 values('2018-10-01',1,'facebook','sent');
 insert into table1 values('2018-10-01',1,'facebook','accepted');
 insert into table1 values('2018-10-01',2,'facebook','sent');
 insert into table1 values('2018-10-02',2,'facebook','accepted');
 insert into table1 values('2018-10-01',3,'facebook','sent');

 select * from table1;
 +------------+---------+----------+-----------+
 | time       | user_id | app      | event     |
 +------------+---------+----------+-----------+
 | 2018-10-01 |       1 | facebook | processed |
 | 2018-10-01 |       1 | facebook | sent      |
 | 2018-10-01 |       1 | facebook | accepted  |
 | 2018-10-01 |       2 | facebook | sent      |
 | 2018-10-02 |       2 | facebook | accepted  |
 | 2018-10-01 |       3 | facebook | sent      |
 +------------+---------+----------+-----------+

SQL question challenge (Continents)


Link to github

Table with the columns: continents: continent, country, population
CREATE TABLE continents (
   continent          varchar(80),
   country            varchar(80),
   population         int
  );

  INSERT INTO continents VALUES ('Asia','China', 100);
  INSERT INTO continents VALUES ('Asia','India', 100);
  INSERT INTO continents VALUES ('Africa','South Africa', 50);
  INSERT INTO continents VALUES ('Africa','Egypt', 20);
  INSERT INTO continents VALUES ('North America','USA', 50);
  INSERT INTO continents VALUES ('North America','Canada', 50);

SQL question challenge (Contents)


Link to github

  Table: content_id | content_type (comment/ post) | target_id
           If it is comment,target_id is the userid who posts it.
           If it is post, then target_id is NULL.

DDL Scripts
CREATE TABLE contents(
  content_id int,
  content_type varchar(80),
  target_id varchar(80));

  INSERT INTO contents VALUES (1,'comment','u1');
  INSERT INTO contents VALUES (2,'post',NULL);
  INSERT INTO contents VALUES (3,'comment','u2');
  INSERT INTO contents VALUES (4,'post',NULL);
  INSERT INTO contents VALUES (5,'comment','u1');
  INSERT INTO contents VALUES (6,'comment','u1');
  INSERT INTO contents VALUES (7,'comment','u3');
  INSERT INTO contents VALUES (8,'post',NULL);
  INSERT INTO contents VALUES (9,'video','u1');
  INSERT INTO contents VALUES (10,'video','u3');
  INSERT INTO contents VALUES (11,'post',NULL);
  INSERT INTO contents VALUES (12,'photo','u1');
  INSERT INTO contents VALUES (13,'photo','u3');
  INSERT INTO contents VALUES (14,'photo','u1');
  INSERT INTO contents VALUES (15,'article','u3');

SQL question challenge (Companies)


Link to github

DDL Scripts
CREATE TABLE companies (
   member_id            int,
   company_name       varchar(80),
   year_start      int
  );

  INSERT INTO companies VALUES (1,'Google', 1990);
  INSERT INTO companies VALUES (1,'Microsoft', 2000);
  INSERT INTO companies VALUES (2,'Microsoft', 2000);
  INSERT INTO companies VALUES (2,'Google', 2001);
  INSERT INTO companies VALUES (3,'Microsoft', 1997);
  INSERT INTO companies VALUES (3,'Google', 1998);
  INSERT INTO companies VALUES (4,'Microsoft', 1997);
  INSERT INTO companies VALUES (4,'LinkedIn', 1998);
  INSERT INTO companies VALUES (4,'Google', 2000);
  +-----------+--------------+------------+
  | member_id | company_name | year_start |
  +-----------+--------------+------------+
  |         1 | Google       |       1990 |
  |         1 | Microsoft    |       2000 |
  |         2 | Microsoft    |       2000 |
  |         2 | Google       |       2001 |
  |         3 | Microsoft    |       1997 |
  |         3 | Google       |       1998 |
  |         4 | Microsoft    |       1997 |
  |         4 | LinkedIn     |       1998 |
  |         4 | Google       |       2000 |
  +-----------+--------------+------------+

SQL question challenge (Comments)

Link to Github

DDL Scripts
CREATE TABLE comments (
  name            varchar(80),
  posts           varchar(80),
  comments         int
 );

 INSERT INTO comments VALUES ('u1', 'page1', '90');
 INSERT INTO comments VALUES ('u1', 'page2', '50');
 INSERT INTO comments VALUES ('u1', 'page3', '40');
 INSERT INTO comments VALUES ('u2', 'page2', '55');
 INSERT INTO comments VALUES ('u2', 'page4', '45');
 INSERT INTO comments VALUES ('u4', 'page4', '30');
 INSERT INTO comments VALUES ('u4', 'page3', '40');
 INSERT INTO comments VALUES ('u3', 'page2', '100');

 Comments I
 name posts comments
 u1 page1 90
 u1 page2 50
 u1 page3 40
 u2 page2 55
 u2 page4 45
 u4 page4 30
 u4 page3 40
 u3 page2 100

SQL question challenge (Article views)

Link to github

DDL Scripts
CREATE TABLE article_views (
   date            timestamp,
   viewer_id       int,
   article_id      int,
   author_id       int
  );
  
  INSERT INTO article_views VALUES ('2017-08-01',123, 456 ,789);
  INSERT INTO article_views VALUES ('2017-08-02',432 ,543, 654);
  INSERT INTO article_views VALUES ('2017-08-01',789, 456 ,789);
  INSERT INTO article_views VALUES ('2017-08-03',567, 780, 432);
  INSERT INTO article_views VALUES ('2017-08-01',789, 457    ,789);
     
  date        viewer_id        article_id      author_id
  2017-08-01 123   456   789
  2017-08-02 432   543   654
  2017-08-01 789   456   789
  2017-08-03 567   780   432
  2017-08-01 789   457   789 

SQL question challenge (Ads conversion)


Link to Github
Here we have 2 tables

adv_info: advertiser_id | ad_id | spend: The Advertiser pay for this ad

ad_info: ad_id | user_id | price: The user spend through this ad (Assume all prices in this column >0)

DDL Scripts


  CREATE TABLE adv_info (
  advertiser_id    int,
  ad_id    int,
  spend    float
 );

 INSERT INTO adv_info VALUES (10, 200,300);
 INSERT INTO adv_info VALUES (11, 100,1000);
 INSERT INTO adv_info VALUES (13, 400,3000);
 INSERT INTO adv_info VALUES (14, 500,5000);

 CREATE TABLE ad_info (
  ad_id    int,
  user_id    int,
  price    float
 );

 INSERT INTO ad_info VALUES (200, 4,30);
 INSERT INTO ad_info VALUES (100, 4,100);
 INSERT INTO ad_info VALUES (400, 4,300);
 INSERT INTO ad_info VALUES (200, 10,31);
 INSERT INTO ad_info VALUES (100, 10,110);
 INSERT INTO ad_info VALUES (400, 10,310);

 select * 
  from ad_info;
  +-------+---------+-------+
  | ad_id | user_id | price |
  +-------+---------+-------+
  |   200 |       4 |    30 |
  |   100 |       4 |   100 |
  |   400 |       4 |   300 |
  |   200 |      10 |    31 |
  |   100 |      10 |   110 |
  |   400 |      10 |   310 |
  +-------+---------+-------+
  
 select * 
  from adv_info;
  +---------------+-------+-------+
  | advertiser_id | ad_id | spend |
  +---------------+-------+-------+
  |            10 |   200 |   300 |
  |            11 |   100 |  1000 |
  |            13 |   400 |  3000 |
  |            14 |   500 |  5000 |
  +---------------+-------+-------+