Web Snippets: January 2019

Thursday, January 31, 2019

Fast Export vs Fast Load (Teradata)

**************
FAST EXPORT
**************

Used to export data from Teradata into flat files
Can generate data in report format
Data can be extracted from one or more tables using join
Deals in block export (Useful for extracting large volumes)
Is the ability to ship data over multiple session connections simultaneously thereby leveraging the total connectivity available between the client platform and the database engine. In order to do this,Fastexport spends more resources on executing the query in order to prepare the blocks in such a way that when they are exported over multiple sessions they can easily be reassembled in the right order by the client without additional sorting or processing of the rows.

One of the most obvious and useful set of window functions are ranking functions where rows from your result set are ranked according to a certain scheme.
There are three ranking functions: ROW_NUMBER() RANK() DENSE_RANK()

DDL Scripts

Create table dup_employees
(
     id int,
     first_name nvarchar(50),
     last_name nvarchar(50),
     gender nvarchar(50),
     salary int
);
Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000);
Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000);
Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000);
Insert into dup_employees values (2, 'Mary', 'Lambeth', 'Female', 30000);
Insert into dup_employees values (2, 'Mary', 'Lambeth', 'Female', 30000);
Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);
Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);
Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);

SELF JOIN

A self join is a join in which a table is joined with itself

Ex :when we want to get the manager name who is an employee then in the join condition left side would be the first table (employee e) and right side would be the second table (employee m) in the join condition.

Note: It would take manger_id from the employee (m) table and would look for the employee_id in the employee table (e)

Select  id,e.employee_name,m.employee_name
from employee e
Join employee m
on e.manger_id=m.employee_id

MAPR : Started in stealth mode (2009)
MapR-FS : It is a Hadoop compatible file system
MapR-DB : It is the first in house DB that ran on the same technology stack
Apache-Drill : First schema free analytics
MAPR-Streams : Was introduced for global event processing
Converged Data platform : Branded all the above products into converged data platform.
Only converged data platform in the industry.
Supports all the data
Runs on every cloud on premise and on the edge
It has highly available design
Capability of global data fabric
It has global database
It has global event streaming engine
Operates at unlimited scale
It supports file, tables, document and streams
It supports docker container data platform to make it highly available
The file system in mapr is different from others

Is a service that helps you model and set up our amazon web services resources so that we can spend less time managing those resources and more time focusing on our applications that run on AWS

TEMPLATES
We can also create templates in AWS cloud formation.We can use designers for creating this template and save this template

To create a cloud formation script we need a JSON script.This can also be created using cloud formation designer as shown below. When we drag resource JSON script would be generated.

Apache Accumulo

Is a robust database scalable data storage and retrieval system based on google's big table design and built on top of Apache Hadoop, Zookeper and Thrift.
It's improvements on big table design are:

Server-side programming mechanism that can modify key/value pairs
Cell based access control

Ingestion in GCP

In the figure x axis is what is closer to GCP and y axis is the amount of data
These are the different approaches which we can take for data ingestion to GCP

TRANSFER SERVICE
Storage Transfer Service allows you to quickly import online data into Cloud Storage. You can also set up a repeating schedule for transferring data, as well as transfer data within Cloud Storage, from one bucket to another.

Edge nodes

Edge nodes are the interface between the Hadoop cluster and the outside network. For this reason, they’re sometimes referred to as gateway nodes. Most commonly, edge nodes are used to run client applications and cluster administration tools.

We have different options to generate tiny url. The basic rule of thumb is that we need to convert the long url into tiny url and store it in database or cache for future retrieval.

1) Sometimes we would have restrictions on the tiny url. ex 41 bits
2) Allowed characters on the tiny url could be caps and smaller alphanumeric characters

A fact table is the central table in a star schema of a data warehouse.
A fact table stores quantitative information for analysis and is often denormalized.
A fact table works with dimension tables. A fact table holds the data to be analyzed, and a dimension table stores data about the ways in which the data in the fact table can be analyzed. Thus, the fact table consists of two types of columns. The foreign keys column allows joins with dimension tables, and the measures columns contain the data that is being analyzed.

Dimensions

Dimension is a collection of reference information about a measurable event.
Dimensions categorize and describe data warehouse facts and measures in ways that support meaningful answers to business questions. They form the very core of dimensional modeling.

A data warehouse organizes descriptive attributes as columns in dimension tables.

A dimension table has a primary key column that uniquely identifies each dimension record (row). The dimension table is associated with a fact table using this key. Data in the fact table can be filtered and grouped (“sliced and diced”) by various combinations of attributes.

Galaxy schema

Is a combination of both star schema and snow flake schema
Has many fact table and some common dimension table
Can also be referred as a combination of many data marts
It is also known as fact constellation schema

Star schema

Star schema is the simplest form of a dimensional model, in which data is organized into facts and dimensions.

FACT
A fact is an event that is counted or measured, such as a sale or login.

In data warehousing, snowflaking is a form of dimensional modeling in which dimensions are stored in multiple related dimension tables. A snowflake schema is a variation of the star schema.
Snowflaking is used to improve the performance of certain queries.
The schema is diagrammed with each fact surrounded by its associated dimensions (as in a star schema), and those dimensions are further related to other dimensions, branching out into a snowflake pattern.

SQL question challenge (Delete duplicates)

We will be using duplicate employees table.
Delete duplicate rows in sql
SQL Script to create dup_employees table

link to github

Create table dup_employees
(
     id int,
     first_name nvarchar(50),
     last_name nvarchar(50),
     gender nvarchar(50),
     salary int
);

Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000);
Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000);
Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000);
Insert into dup_employees values (2, 'Mary', 'Lambeth', 'Female', 30000);
Insert into dup_employees values (2, 'Mary', 'Lambeth', 'Female', 30000);
Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);
Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);
Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);

SQL question challenge (Messages)

Link to github

Table: date | u1 | u2 | n_msg
n_msg: the number of messsages between one unique user pair at someday.

DDL Scripts

create table messages(
  date varchar(80),
  u1 varchar(80),
  u2 varchar(80),
  n_msg integer
 );
 
 insert into messages values ('10/01/2018','user1','user2',4);
 insert into messages values ('10/01/2018','user2','user1',4);
 insert into messages values ('10/01/2018','user1','user4',2);
 insert into messages values ('10/02/2018','user1','user2',2);
 insert into messages values ('10/02/2018','user2','user1',1);
 insert into messages values ('10/02/2018','user3','user4',10);
 insert into messages values ('10/02/2018','user3','user2',14);
 insert into messages values ('10/02/2018','user1','user2',14);
 insert into messages values ('10/03/2018','user4','user1',4);
 insert into messages values ('10/03/2018','user4','user5',25);

Link to Github

DDL Scripts

create table table1(time date,user_id int,app varchar(40),event varchar(40));

 insert into table1 values('2018-10-01',1,'facebook','processed');
 insert into table1 values('2018-10-01',1,'facebook','sent');
 insert into table1 values('2018-10-01',1,'facebook','accepted');
 insert into table1 values('2018-10-01',2,'facebook','sent');
 insert into table1 values('2018-10-02',2,'facebook','accepted');
 insert into table1 values('2018-10-01',3,'facebook','sent');

 select * from table1;
 +------------+---------+----------+-----------+
 | time       | user_id | app      | event     |
 +------------+---------+----------+-----------+
 | 2018-10-01 |       1 | facebook | processed |
 | 2018-10-01 |       1 | facebook | sent      |
 | 2018-10-01 |       1 | facebook | accepted  |
 | 2018-10-01 |       2 | facebook | sent      |
 | 2018-10-02 |       2 | facebook | accepted  |
 | 2018-10-01 |       3 | facebook | sent      |
 +------------+---------+----------+-----------+

Link to github

Table with the columns: continents: continent, country, population

CREATE TABLE continents (
   continent          varchar(80),
   country            varchar(80),
   population         int
  );

  INSERT INTO continents VALUES ('Asia','China', 100);
  INSERT INTO continents VALUES ('Asia','India', 100);
  INSERT INTO continents VALUES ('Africa','South Africa', 50);
  INSERT INTO continents VALUES ('Africa','Egypt', 20);
  INSERT INTO continents VALUES ('North America','USA', 50);
  INSERT INTO continents VALUES ('North America','Canada', 50);

Link to github

Table: content_id | content_type (comment/ post) | target_id
If it is comment，target_id is the userid who posts it.
If it is post, then target_id is NULL.

DDL Scripts

CREATE TABLE contents(
  content_id int,
  content_type varchar(80),
  target_id varchar(80));

  INSERT INTO contents VALUES (1,'comment','u1');
  INSERT INTO contents VALUES (2,'post',NULL);
  INSERT INTO contents VALUES (3,'comment','u2');
  INSERT INTO contents VALUES (4,'post',NULL);
  INSERT INTO contents VALUES (5,'comment','u1');
  INSERT INTO contents VALUES (6,'comment','u1');
  INSERT INTO contents VALUES (7,'comment','u3');
  INSERT INTO contents VALUES (8,'post',NULL);
  INSERT INTO contents VALUES (9,'video','u1');
  INSERT INTO contents VALUES (10,'video','u3');
  INSERT INTO contents VALUES (11,'post',NULL);
  INSERT INTO contents VALUES (12,'photo','u1');
  INSERT INTO contents VALUES (13,'photo','u3');
  INSERT INTO contents VALUES (14,'photo','u1');
  INSERT INTO contents VALUES (15,'article','u3');

Link to github

DDL Scripts

CREATE TABLE companies (
   member_id            int,
   company_name       varchar(80),
   year_start      int
  );

  INSERT INTO companies VALUES (1,'Google', 1990);
  INSERT INTO companies VALUES (1,'Microsoft', 2000);
  INSERT INTO companies VALUES (2,'Microsoft', 2000);
  INSERT INTO companies VALUES (2,'Google', 2001);
  INSERT INTO companies VALUES (3,'Microsoft', 1997);
  INSERT INTO companies VALUES (3,'Google', 1998);
  INSERT INTO companies VALUES (4,'Microsoft', 1997);
  INSERT INTO companies VALUES (4,'LinkedIn', 1998);
  INSERT INTO companies VALUES (4,'Google', 2000);
  +-----------+--------------+------------+
  | member_id | company_name | year_start |
  +-----------+--------------+------------+
  |         1 | Google       |       1990 |
  |         1 | Microsoft    |       2000 |
  |         2 | Microsoft    |       2000 |
  |         2 | Google       |       2001 |
  |         3 | Microsoft    |       1997 |
  |         3 | Google       |       1998 |
  |         4 | Microsoft    |       1997 |
  |         4 | LinkedIn     |       1998 |
  |         4 | Google       |       2000 |
  +-----------+--------------+------------+

Link to Github

DDL Scripts

CREATE TABLE comments (
  name            varchar(80),
  posts           varchar(80),
  comments         int
 );

 INSERT INTO comments VALUES ('u1', 'page1', '90');
 INSERT INTO comments VALUES ('u1', 'page2', '50');
 INSERT INTO comments VALUES ('u1', 'page3', '40');
 INSERT INTO comments VALUES ('u2', 'page2', '55');
 INSERT INTO comments VALUES ('u2', 'page4', '45');
 INSERT INTO comments VALUES ('u4', 'page4', '30');
 INSERT INTO comments VALUES ('u4', 'page3', '40');
 INSERT INTO comments VALUES ('u3', 'page2', '100');

 Comments I
 name posts comments
 u1 page1 90
 u1 page2 50
 u1 page3 40
 u2 page2 55
 u2 page4 45
 u4 page4 30
 u4 page3 40
 u3 page2 100

Link to github

DDL Scripts

CREATE TABLE article_views (
   date            timestamp,
   viewer_id       int,
   article_id      int,
   author_id       int
  );
  
  INSERT INTO article_views VALUES ('2017-08-01',123, 456 ,789);
  INSERT INTO article_views VALUES ('2017-08-02',432 ,543, 654);
  INSERT INTO article_views VALUES ('2017-08-01',789, 456 ,789);
  INSERT INTO article_views VALUES ('2017-08-03',567, 780, 432);
  INSERT INTO article_views VALUES ('2017-08-01',789, 457    ,789);
     
  date        viewer_id        article_id      author_id
  2017-08-01 123   456   789
  2017-08-02 432   543   654
  2017-08-01 789   456   789
  2017-08-03 567   780   432
  2017-08-01 789   457   789

Link to Github
Here we have 2 tables

adv_info: advertiser_id | ad_id | spend: The Advertiser pay for this ad

ad_info: ad_id | user_id | price: The user spend through this ad (Assume all prices in this column >0)

DDL Scripts

  CREATE TABLE adv_info (
  advertiser_id    int,
  ad_id    int,
  spend    float
 );

 INSERT INTO adv_info VALUES (10, 200,300);
 INSERT INTO adv_info VALUES (11, 100,1000);
 INSERT INTO adv_info VALUES (13, 400,3000);
 INSERT INTO adv_info VALUES (14, 500,5000);

 CREATE TABLE ad_info (
  ad_id    int,
  user_id    int,
  price    float
 );

 INSERT INTO ad_info VALUES (200, 4,30);
 INSERT INTO ad_info VALUES (100, 4,100);
 INSERT INTO ad_info VALUES (400, 4,300);
 INSERT INTO ad_info VALUES (200, 10,31);
 INSERT INTO ad_info VALUES (100, 10,110);
 INSERT INTO ad_info VALUES (400, 10,310);

 select * 
  from ad_info;
  +-------+---------+-------+
  | ad_id | user_id | price |
  +-------+---------+-------+
  |   200 |       4 |    30 |
  |   100 |       4 |   100 |
  |   400 |       4 |   300 |
  |   200 |      10 |    31 |
  |   100 |      10 |   110 |
  |   400 |      10 |   310 |
  +-------+---------+-------+
  
 select * 
  from adv_info;
  +---------------+-------+-------+
  | advertiser_id | ad_id | spend |
  +---------------+-------+-------+
  |            10 |   200 |   300 |
  |            11 |   100 |  1000 |
  |            13 |   400 |  3000 |
  |            14 |   500 |  5000 |
  +---------------+-------+-------+

Web Snippets

Labels

Thursday, January 31, 2019

Fast Export vs Fast Load (Teradata)

rank vs dense_rank vs row_number with partition

SQL Basics

MAPR Products

Aws cloud formation

Wednesday, January 30, 2019

Apache Accumulo

Tuesday, January 29, 2019

Ingestion in GCP

Wednesday, January 23, 2019

Edge nodes

Tinyurl design

Fact Table

Dimensions

Galaxy schema

Star schema

Snowflake schema

Tuesday, January 15, 2019

SQL question challenge (Delete duplicates)

Wednesday, January 9, 2019

SQL question challenge (Messages)

SQL question challenge (Friends)

SQL question challenge (Continents)

SQL question challenge (Contents)

SQL question challenge (Companies)

SQL question challenge (Comments)

SQL question challenge (Article views)

SQL question challenge (Ads conversion)

Labels

Blog Archive