Tuesday, March 12, 2019

core components of self driving cars


COMPUTER VISION:
 These are like cameras where we use camera images to figure out what the world around us look like.

SENSOR FUSION:
How we incorporate data from other sensors like lasers, radars to get richer understanding of our environment.

LOCALIZATION:
To understand where we are in the current world.

PATH PLANNING:
Chart through the world to get us where we'd like to go.

CONTROL:
How we actually turn the steering wheel and hit the throttle ,hit the break in order to execute the trajectory that we built during path planning.


Monday, February 25, 2019

Impact of scaling and shifting random variables


To make training the network easier, we standardize each of the continuous variables. That is, we'll shift and scale the variables such that they have zero mean and a standard deviation of 1.
The scaling factors are saved so we can go backwards when we use the network for predictions.

SHIFTING
If we have one random variable, that is constructed by adding a constant to another random variable
  • We would shift the mean by that constant
  • It would not shift the standard deviation

Categorical Variables


  • These are variables that fall into a category
  • There is no order for categorical variables
  • They are not quantitative variables

Monday, February 11, 2019

SQL question challenge (Consecutive numbers)



Write a SQL query to find all numbers that appear at least three times consecutively.
+----+-----+
| Id | Num |
+----+-----+
| 1  |  1  |
| 2  |  1  |
| 3  |  1  |
| 4  |  2  |
| 5  |  1  |
| 6  |  2  |
| 7  |  2  |
+----+-----+

For example, given the above Logs table, 1 is the only number that appears consecutively for at least three times.

+-----------------+
| ConsecutiveNums |
+-----------------+
| 1               |
+-----------------+

SQL question challenge (Cancellation rates for trips)



SQL Schema

The Trips table holds all taxi trips.

TRIPS TABLE.
Each trip has a unique Id, while Client_Id and Driver_Id are both foreign keys to the Users_Id at the
+----+----------------+-----------+--------------+--------------------+----------+

| Id | Client_Id      | Driver_Id | City_Id      |        Status         |Request_at|

+----+-----------+-----------+---------+--------------------+----------+

| 1  |     1          |    10     |    1         |     completed         |2013-10-01|

| 2  |     2          |    11     |    1         | cancelled_by_driver   |2013-10-01|

| 3  |     3          |    12     |    6         |     completed         |2013-10-01|

| 4  |     4          |    13     |    6         | cancelled_by_client   |2013-10-01|

| 5  |     1          |    10     |    1         |     completed         |2013-10-02|

| 6  |     2          |    11     |    6         |     completed         |2013-10-02|

| 7  |     3          |    12     |    6         |     completed         |2013-10-02|

| 8  |     2          |    12     |    12        |     completed         |2013-10-03|

| 9  |     3          |    10     |    12        |     completed         |2013-10-03|

| 10 |     4          |    13    |    12         | cancelled_by_driver   |2013-10-03|

+----+-----------+-----------+---------+--------------------+----------+

SQL question challenge (candidate winners)



SQL Schema
Table: Candidate

+-----+---------+
| id  | Name    |
+-----+---------+
| 1   | A       |
| 2   | B       |
| 3   | C       |
| 4   | D       |
| 5   | E       |
+-----+---------+
Table: Vote

SQL question challenge (Customer with no orders)

SQL Challenge
Suppose that a website contains two tables, the Customers table and the Orders table. Write a SQL query to find all customers who never order anything.

Table: Customer.

+----+-------+
| Id | Name  |   
+----+-------+
| 1  | Joe   |
| 2  | Henry |
| 3  | Sam   |
| 4  | Max   |
+----+-------+

Using the above tables as example, return the following:

+-----------+
| Customers |
+-----------+
| Henry     |
| Max       |
+-----------+

DDL SCRIPTS

Monday, February 4, 2019

twittter location clustering based on tweets (Spark Mllib)



1)  Create a directory for twitter streams
 cd /usr/lib/spark 
 sudo mkdir tweets 
 cd tweetscd
 sudo mkdir data 
 sudo mkdir training
 sudo chmod  777 /usr/lib/spark/tweets/ 

These are the two folders which we would be using in this project
data :Would contain the master of the csv files which we would pretend coming from a training source.
training :  Source to train our machine learning algorithm

SSH to Hortonworks sandbox



1) Download the sandbox for hortonworks
2) Launch virtualbox
3) Once the sandbox is up and running, we would see a screen as shown below ( Has information about the localhost url and ssh servers)


HORTONWORKS SANDBOX

Saturday, February 2, 2019

Movie ratings project part 3 (Analysis)



***************************************************************
1) WHICH YEAR HAS THE MOST NO OF RATINGS
***************************************************************
select year(from_unixtime(rating_time)) rating_year,
       count(*) as cnt
from latest_ratings 
group by year(from_unixtime(rating_time))
order by rating_year DESC;
    
 YEAR    RATING_YEAR
 2018    1086549
 2017    1973721
 2016    2077152
 2015    1907576
 2014    584216
 2013    633573
 2012    792701
 2011    833774
 2010    982104
 2009    993111
 2008    1210356
 2007    1095710
 2006    1210803
 2005    1849719
 2004    1201656
 2003    1079130
 2002    910350
 2001    1239432
 2000    2033738
 1999    1231173
 1998    329704
 1997    763929
 1996    1733263
 1995    4

Thursday, January 31, 2019

Fast Export vs Fast Load (Teradata)





**************
FAST EXPORT
**************
  • Used to export data from Teradata into flat files
  • Can generate data in report format
  • Data can be extracted from one or more tables using join
  • Deals in block export (Useful for extracting large volumes)
  • Is the ability to ship data over multiple session connections simultaneously thereby leveraging the total connectivity available between the client platform and the database engine. In order to do this,Fastexport spends more resources on executing the query in order to prepare the blocks in such a way that when they are exported over multiple sessions they can easily be reassembled in the right order by the client without additional sorting or processing of the rows.

rank vs dense_rank vs row_number with partition

  • One of the most obvious and useful set of window functions are ranking functions where rows from your result set are ranked according to a certain scheme. 
  • There are three ranking functions: ROW_NUMBER() RANK() DENSE_RANK() 
DDL Scripts
Create table dup_employees
(
     id int,
     first_name nvarchar(50),
     last_name nvarchar(50),
     gender nvarchar(50),
     salary int
);
Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000); Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000); Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000); Insert into dup_employees values (2, 'Mary', 'Lambeth', 'Female', 30000); Insert into dup_employees values (2, 'Mary', 'Lambeth', 'Female', 30000); Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000); Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000); Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);

SQL Basics


SELF JOIN
  • A self join is a join in which a table is joined with itself
Ex :when we want to get the manager name who is an employee then in the join condition left side would be the first table (employee e) and right side would be the second table (employee m) in the join condition.
Note: It would take manger_id from the employee (m) table and would look for the employee_id in the  employee table (e)

Select  id,e.employee_name,m.employee_name
from employee e
Join employee m
on e.manger_id=m.employee_id


MAPR Products


MAPR                                 :  Started in stealth mode  (2009)
MapR-FS                            :  It is a Hadoop compatible file system
MapR-DB                           :  It is the first in house DB that ran on the same technology stack
Apache-Drill                       :  First schema free analytics
MAPR-Streams                  : Was introduced for global event processing
Converged Data platform :  Branded all the above products into converged data platform.
                                                Only converged data platform in the industry.
                                                Supports all the data
                                                Runs on every cloud on premise and on the edge
                                                It has highly available design
                                                Capability of global data fabric
                                                It has global database
                                                It has global event streaming engine
                                                Operates at unlimited scale
                                                It supports file, tables, document and streams
                                                It supports docker container data platform to make it highly available
                                                The file system in mapr is different from others



Aws cloud formation


Is a service that helps you model and set up our amazon web services resources so that we can spend less time managing those resources and more time focusing on our applications that run on AWS

TEMPLATES
We can also create templates in AWS cloud formation.We can use designers for creating this template and save this template

  •  To create a cloud formation script we need a JSON script.This can also be created using cloud formation designer as shown below. When we drag  resource JSON script would be generated.

Wednesday, January 30, 2019

Apache Accumulo



 Is a robust database scalable data storage and retrieval system based on google's big table design and built on top of Apache Hadoop, Zookeper and Thrift.
It's improvements on big table design are:

  • Server-side programming mechanism that can modify key/value pairs
  • Cell based access control

Tuesday, January 29, 2019

Ingestion in GCP



In the figure x axis is what is closer to GCP and y axis is the amount of data
These are the different approaches which we can take for data ingestion to GCP

TRANSFER SERVICE
Storage Transfer Service allows you to quickly import online data into Cloud Storage. You can also set up a repeating schedule for transferring data, as well as transfer data within Cloud Storage, from one bucket to another.

Wednesday, January 23, 2019

Edge nodes

  • Edge nodes are the interface between the Hadoop cluster and the outside network. For this reason, they’re sometimes referred to as gateway nodes. Most commonly, edge nodes are used to run client applications and cluster administration tools.

Tinyurl design

We have different options to generate tiny url. The basic rule of thumb is that we need to convert the long url into tiny url and store it in database or cache for future retrieval.

1) Sometimes we would have restrictions on the tiny url. ex 41 bits
2) Allowed characters on the tiny url  could be caps and smaller alphanumeric characters

Fact Table


  • A fact table is the central table in a star schema of a data warehouse. 
  • A fact table stores quantitative information for analysis and is often denormalized.
  • A fact table works with dimension tables. A fact table holds the data to be analyzed, and a dimension table stores data about the ways in which the data in the fact table can be analyzed. Thus, the fact table consists of two types of columns. The foreign keys column allows joins with dimension tables, and the measures columns contain the data that is being analyzed.

Dimensions


  • Dimension is a collection of reference information about a measurable event. 
  • Dimensions categorize and describe data warehouse facts and measures in ways that support meaningful answers to business questions.  They form the very core of dimensional modeling.  

A data warehouse organizes descriptive attributes as columns in dimension tables. 

A dimension table has a primary key column that uniquely identifies each dimension record (row).  The dimension table is associated with a fact table using this key.  Data in the fact table can be filtered and grouped (“sliced and diced”) by various combinations of attributes. 


Galaxy schema


  • Is a combination of both star schema and snow flake schema
  • Has many fact table and some common dimension table
  • Can also be referred as a combination of many data marts
  • It is also known as fact constellation schema 

Star schema



Star schema is the simplest form of a dimensional model, in which data is organized into facts and dimensions. 

FACT
A fact is an event that is counted or measured, such as a sale or login. 

Snowflake schema



  • In data warehousing, snowflaking is a form of dimensional modeling in which dimensions are stored in multiple related dimension tables.  A snowflake schema is a variation of the star schema.
  • Snowflaking is used to improve the performance of certain queries. 
  • The schema is diagrammed with each fact surrounded by its associated dimensions (as in a star schema), and those dimensions are further related to other dimensions, branching out into a snowflake pattern.

Tuesday, January 15, 2019

SQL question challenge (Delete duplicates)



We will be using duplicate employees table.
Delete duplicate rows in sql
SQL Script to create dup_employees table

link to github

Create table dup_employees
(
     id int,
     first_name nvarchar(50),
     last_name nvarchar(50),
     gender nvarchar(50),
     salary int
);

Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000);
Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000);
Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000);
Insert into dup_employees values (2, 'Mary', 'Lambeth', 'Female', 30000);
Insert into dup_employees values (2, 'Mary', 'Lambeth', 'Female', 30000);
Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);
Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);
Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);

Wednesday, January 9, 2019

SQL question challenge (Messages)


Link to github

Table: date | u1 | u2 | n_msg
n_msg: the number of messsages between one unique user pair at someday.

DDL Scripts
create table messages(
  date varchar(80),
  u1 varchar(80),
  u2 varchar(80),
  n_msg integer
 );
 
 insert into messages values ('10/01/2018','user1','user2',4);
 insert into messages values ('10/01/2018','user2','user1',4);
 insert into messages values ('10/01/2018','user1','user4',2);
 insert into messages values ('10/02/2018','user1','user2',2);
 insert into messages values ('10/02/2018','user2','user1',1);
 insert into messages values ('10/02/2018','user3','user4',10);
 insert into messages values ('10/02/2018','user3','user2',14);
 insert into messages values ('10/02/2018','user1','user2',14);
 insert into messages values ('10/03/2018','user4','user1',4);
 insert into messages values ('10/03/2018','user4','user5',25);

SQL question challenge (Friends)

Link to Github

DDL Scripts


create table table1(time date,user_id int,app varchar(40),event varchar(40));

 insert into table1 values('2018-10-01',1,'facebook','processed');
 insert into table1 values('2018-10-01',1,'facebook','sent');
 insert into table1 values('2018-10-01',1,'facebook','accepted');
 insert into table1 values('2018-10-01',2,'facebook','sent');
 insert into table1 values('2018-10-02',2,'facebook','accepted');
 insert into table1 values('2018-10-01',3,'facebook','sent');

 select * from table1;
 +------------+---------+----------+-----------+
 | time       | user_id | app      | event     |
 +------------+---------+----------+-----------+
 | 2018-10-01 |       1 | facebook | processed |
 | 2018-10-01 |       1 | facebook | sent      |
 | 2018-10-01 |       1 | facebook | accepted  |
 | 2018-10-01 |       2 | facebook | sent      |
 | 2018-10-02 |       2 | facebook | accepted  |
 | 2018-10-01 |       3 | facebook | sent      |
 +------------+---------+----------+-----------+

SQL question challenge (Continents)


Link to github

Table with the columns: continents: continent, country, population
CREATE TABLE continents (
   continent          varchar(80),
   country            varchar(80),
   population         int
  );

  INSERT INTO continents VALUES ('Asia','China', 100);
  INSERT INTO continents VALUES ('Asia','India', 100);
  INSERT INTO continents VALUES ('Africa','South Africa', 50);
  INSERT INTO continents VALUES ('Africa','Egypt', 20);
  INSERT INTO continents VALUES ('North America','USA', 50);
  INSERT INTO continents VALUES ('North America','Canada', 50);

SQL question challenge (Contents)


Link to github

  Table: content_id | content_type (comment/ post) | target_id
           If it is comment,target_id is the userid who posts it.
           If it is post, then target_id is NULL.

DDL Scripts
CREATE TABLE contents(
  content_id int,
  content_type varchar(80),
  target_id varchar(80));

  INSERT INTO contents VALUES (1,'comment','u1');
  INSERT INTO contents VALUES (2,'post',NULL);
  INSERT INTO contents VALUES (3,'comment','u2');
  INSERT INTO contents VALUES (4,'post',NULL);
  INSERT INTO contents VALUES (5,'comment','u1');
  INSERT INTO contents VALUES (6,'comment','u1');
  INSERT INTO contents VALUES (7,'comment','u3');
  INSERT INTO contents VALUES (8,'post',NULL);
  INSERT INTO contents VALUES (9,'video','u1');
  INSERT INTO contents VALUES (10,'video','u3');
  INSERT INTO contents VALUES (11,'post',NULL);
  INSERT INTO contents VALUES (12,'photo','u1');
  INSERT INTO contents VALUES (13,'photo','u3');
  INSERT INTO contents VALUES (14,'photo','u1');
  INSERT INTO contents VALUES (15,'article','u3');

SQL question challenge (Companies)


Link to github

DDL Scripts
CREATE TABLE companies (
   member_id            int,
   company_name       varchar(80),
   year_start      int
  );

  INSERT INTO companies VALUES (1,'Google', 1990);
  INSERT INTO companies VALUES (1,'Microsoft', 2000);
  INSERT INTO companies VALUES (2,'Microsoft', 2000);
  INSERT INTO companies VALUES (2,'Google', 2001);
  INSERT INTO companies VALUES (3,'Microsoft', 1997);
  INSERT INTO companies VALUES (3,'Google', 1998);
  INSERT INTO companies VALUES (4,'Microsoft', 1997);
  INSERT INTO companies VALUES (4,'LinkedIn', 1998);
  INSERT INTO companies VALUES (4,'Google', 2000);
  +-----------+--------------+------------+
  | member_id | company_name | year_start |
  +-----------+--------------+------------+
  |         1 | Google       |       1990 |
  |         1 | Microsoft    |       2000 |
  |         2 | Microsoft    |       2000 |
  |         2 | Google       |       2001 |
  |         3 | Microsoft    |       1997 |
  |         3 | Google       |       1998 |
  |         4 | Microsoft    |       1997 |
  |         4 | LinkedIn     |       1998 |
  |         4 | Google       |       2000 |
  +-----------+--------------+------------+