Monday, October 28, 2019

Deep Neural Net simple explanation (NN 1)

Nueral net understanding.Draw a line that seperates blue and red shackles

matrix transpose example (DL)


Getting the transpose of a matrix is really easy in NumPy. Simply access its T attribute. There is also a transpose() function which returns the same thing, but you’ll rarely see that used anywhere because typing T is so much easier. :)
For example:
m = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
# displays the following result:
# array([[ 1,  2,  3,  4],
#        [ 5,  6,  7,  8],
#        [ 9, 10, 11, 12]])

# displays the following result:
# array([[ 1,  5,  9],
#        [ 2,  6, 10],
#        [ 3,  7, 11],
#        [ 4,  8, 12]])
NumPy does this without actually moving any data in memory - it simply changes the way it indexes the original matrix - so it’s quite efficient.
However, that also means you need to be careful with how you modify objects, because they are sharing the same data. For example, with the same matrix m from above, let's make a new variable m_t that stores m's transpose. Then look what happens if we modify a value in m_t:
m_t = m.T
m_t[3][1] = 200
# displays the following result:
# array([[ 1,   5, 9],
#        [ 2,   6, 10],
#        [ 3,   7, 11],
#        [ 4, 200, 12]])

# displays the following result:
# array([[ 1,  2,  3,   4],
#        [ 5,  6,  7, 200],
#        [ 9, 10, 11,  12]])

Notice how it modified both the transpose and the original matrix, too! That's because they are sharing the same copy of data. So remember to consider the transpose just as a different view of your matrix, rather than a different matrix entirely.

Tuesday, May 7, 2019

Companies know more about you

People concerned about privacy often try to be “careful” online. They stay off social media, or if they’re on it, they post cautiously.  By doing so, they think they are protecting their privacy.

But they are wrong. Because of technological advances and the sheer amount of data now available about billions of other people, discretion no longer suffices to protect your privacy. Computer algorithms and network analyses can now infer, with a sufficiently high degree of accuracy, a wide range of things about you that you may have never disclosed, including your moods, your political beliefs, your sexual orientation and your health.

There is no longer such a thing as individually “opting out” of our privacy-compromised world.

What is to be done? Designing phones and other devices to be more privacy-protected would be start, and government regulation of the collection and flow of data would help slow things down. But this is not the complete solution. We also need to start passing laws that directly regulate the use of computational inference: What will we allow to be inferred, and under what conditions, and subject to what kinds of accountability, disclosure, controls and penalties for misuse?

Until we have good answers to these questions, you can expect others to continue to know more and more about you — no matter how discreet you may have been.

Monday, April 29, 2019

One Hot Encoding

One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.


Lets take a dataset of food names. In this dataset, if there was another food name it would have categorical value as 4.As the no of unique value increases, the categorical values increases.

Saturday, April 27, 2019

Slideshow using Notebook

The slides are created in notebooks like normal, but you'll need to designate which cells are slides and the type of slide the cell will be. In the menu bar, click View > Cell Toolbar > Slideshow to bring up the slide cell menu on each cell.

Data dimensions

  • They have 0 dimensions
  • Ex a persons height would be a scalar

1      2.4      -0.3

Friday, April 26, 2019

Bag of words

The Problem with Text
A problem with modeling text is that it is messy, and techniques like machine learning algorithms prefer well defined fixed-length inputs and outputs.
Machine learning algorithms cannot work with raw text directly; the text must be converted into numbers. Specifically, vectors of numbers.
In language processing, the vectors x are derived from textual data, in order to reflect various linguistic properties of the text.
This is called feature extraction or feature encoding.
A popular and simple method of feature extraction with text data is called the bag-of-words model of text.

Thursday, April 25, 2019


  • In most learning networks, error is calculated as the difference between the actual output and the predicted output.
  • The error function is which tells us how far are we from the solution.
  • The function that is used to compute this error is known as loss function.
  • Different loss functions will give different errors for the same prediction and thus would have a considerable effort on the performance of the model.
Imagine, we are standing on top of a mountain(mount Everest) and we want to descend.It is not that easy and it is cloudy and it is big and we cant see the big picture.We would look at all the possible directions where we can walk.

Wednesday, April 24, 2019

Industries to be revolutionized by artificial intelligence

Artificial intelligence (AI) and machine learning (ML) have a rapidly growing presence in today’s world, with applications ranging from heavy industry to education. From streamlining operations to informing better decision making, it has become clear that this technology has the potential to truly revolutionize how the everyday world works.

According to a panel of Forbes Technology Council members, here are 13 industries that will soon be revolutionized by AI.

1. Cybersecurity

The enterprise attack surface is massive. With its power to bring complex reasoning and self-learning in an automated fashion at massive scale, AI will be a game-changer in how we improve our cyber-resilience. - Gaurav Banga, Balbix

Monday, April 22, 2019


  • They can make entirely new image that are realistic, even they never been seen before
  • Most of the application for GANs have been images
  • Takes a textual description of the bird and than generating a high resolution of a bird matching that description.
  • These pictures have never been seen before. It is not running a image search on a database, infact GAN is drawing a probability distribution over all hypothetical images matching that description
  • We can keep running the GAN to get more images.

Tuesday, April 16, 2019

Sage Maker Services


1) Provides jupyter notebook instance
  • Used to explore and process data
2) API
  •  This simplifies computationally difficult task like train and deploy machine learning model

Machine Learning Workflow

Machine Learning Workflow consists of 3 components
  • Explore and process data
  • Modeling
  • Deployment
This component consists of exploring and processing the data.

The first step is to retrieve the data, which includes test and train dataset. Lets take an example of housing dataset which contains csv files. We need to download the data from the source. 

Tuesday, March 12, 2019

core components of self driving cars

 These are like cameras where we use camera images to figure out what the world around us look like.

How we incorporate data from other sensors like lasers, radars to get richer understanding of our environment.

To understand where we are in the current world.

Chart through the world to get us where we'd like to go.

How we actually turn the steering wheel and hit the throttle ,hit the break in order to execute the trajectory that we built during path planning.

Monday, February 25, 2019

Impact of scaling and shifting random variables

To make training the network easier, we standardize each of the continuous variables. That is, we'll shift and scale the variables such that they have zero mean and a standard deviation of 1.
The scaling factors are saved so we can go backwards when we use the network for predictions.

If we have one random variable, that is constructed by adding a constant to another random variable
  • We would shift the mean by that constant
  • It would not shift the standard deviation

Categorical Variables

  • These are variables that fall into a category
  • There is no order for categorical variables
  • They are not quantitative variables

Monday, February 11, 2019

SQL question challenge (Consecutive numbers)

Write a SQL query to find all numbers that appear at least three times consecutively.
| Id | Num |
| 1  |  1  |
| 2  |  1  |
| 3  |  1  |
| 4  |  2  |
| 5  |  1  |
| 6  |  2  |
| 7  |  2  |

For example, given the above Logs table, 1 is the only number that appears consecutively for at least three times.

| ConsecutiveNums |
| 1               |

SQL question challenge (Cancellation rates for trips)

SQL Schema

The Trips table holds all taxi trips.

Each trip has a unique Id, while Client_Id and Driver_Id are both foreign keys to the Users_Id at the

| Id | Client_Id      | Driver_Id | City_Id      |        Status         |Request_at|


| 1  |     1          |    10     |    1         |     completed         |2013-10-01|

| 2  |     2          |    11     |    1         | cancelled_by_driver   |2013-10-01|

| 3  |     3          |    12     |    6         |     completed         |2013-10-01|

| 4  |     4          |    13     |    6         | cancelled_by_client   |2013-10-01|

| 5  |     1          |    10     |    1         |     completed         |2013-10-02|

| 6  |     2          |    11     |    6         |     completed         |2013-10-02|

| 7  |     3          |    12     |    6         |     completed         |2013-10-02|

| 8  |     2          |    12     |    12        |     completed         |2013-10-03|

| 9  |     3          |    10     |    12        |     completed         |2013-10-03|

| 10 |     4          |    13    |    12         | cancelled_by_driver   |2013-10-03|


SQL question challenge (candidate winners)

SQL Schema
Table: Candidate

| id  | Name    |
| 1   | A       |
| 2   | B       |
| 3   | C       |
| 4   | D       |
| 5   | E       |
Table: Vote

SQL question challenge (Customer with no orders)

SQL Challenge
Suppose that a website contains two tables, the Customers table and the Orders table. Write a SQL query to find all customers who never order anything.

Table: Customer.

| Id | Name  |   
| 1  | Joe   |
| 2  | Henry |
| 3  | Sam   |
| 4  | Max   |

Using the above tables as example, return the following:

| Customers |
| Henry     |
| Max       |


Monday, February 4, 2019

twittter location clustering based on tweets (Spark Mllib)

1)  Create a directory for twitter streams
 cd /usr/lib/spark 
 sudo mkdir tweets 
 cd tweetscd
 sudo mkdir data 
 sudo mkdir training
 sudo chmod  777 /usr/lib/spark/tweets/ 

These are the two folders which we would be using in this project
data :Would contain the master of the csv files which we would pretend coming from a training source.
training :  Source to train our machine learning algorithm

SSH to Hortonworks sandbox

1) Download the sandbox for hortonworks
2) Launch virtualbox
3) Once the sandbox is up and running, we would see a screen as shown below ( Has information about the localhost url and ssh servers)


Saturday, February 2, 2019

Movie ratings project part 3 (Analysis)

select year(from_unixtime(rating_time)) rating_year,
       count(*) as cnt
from latest_ratings 
group by year(from_unixtime(rating_time))
order by rating_year DESC;
 2018    1086549
 2017    1973721
 2016    2077152
 2015    1907576
 2014    584216
 2013    633573
 2012    792701
 2011    833774
 2010    982104
 2009    993111
 2008    1210356
 2007    1095710
 2006    1210803
 2005    1849719
 2004    1201656
 2003    1079130
 2002    910350
 2001    1239432
 2000    2033738
 1999    1231173
 1998    329704
 1997    763929
 1996    1733263
 1995    4

Thursday, January 31, 2019

Fast Export vs Fast Load (Teradata)

  • Used to export data from Teradata into flat files
  • Can generate data in report format
  • Data can be extracted from one or more tables using join
  • Deals in block export (Useful for extracting large volumes)
  • Is the ability to ship data over multiple session connections simultaneously thereby leveraging the total connectivity available between the client platform and the database engine. In order to do this,Fastexport spends more resources on executing the query in order to prepare the blocks in such a way that when they are exported over multiple sessions they can easily be reassembled in the right order by the client without additional sorting or processing of the rows.

rank vs dense_rank vs row_number with partition

  • One of the most obvious and useful set of window functions are ranking functions where rows from your result set are ranked according to a certain scheme. 
  • There are three ranking functions: ROW_NUMBER() RANK() DENSE_RANK() 
DDL Scripts
Create table dup_employees
     id int,
     first_name nvarchar(50),
     last_name nvarchar(50),
     gender nvarchar(50),
     salary int
Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000); Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000); Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000); Insert into dup_employees values (2, 'Mary', 'Lambeth', 'Female', 30000); Insert into dup_employees values (2, 'Mary', 'Lambeth', 'Female', 30000); Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000); Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000); Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);

SQL Basics

  • A self join is a join in which a table is joined with itself
Ex :when we want to get the manager name who is an employee then in the join condition left side would be the first table (employee e) and right side would be the second table (employee m) in the join condition.
Note: It would take manger_id from the employee (m) table and would look for the employee_id in the  employee table (e)

Select  id,e.employee_name,m.employee_name
from employee e
Join employee m
on e.manger_id=m.employee_id

MAPR Products

MAPR                                 :  Started in stealth mode  (2009)
MapR-FS                            :  It is a Hadoop compatible file system
MapR-DB                           :  It is the first in house DB that ran on the same technology stack
Apache-Drill                       :  First schema free analytics
MAPR-Streams                  : Was introduced for global event processing
Converged Data platform :  Branded all the above products into converged data platform.
                                                Only converged data platform in the industry.
                                                Supports all the data
                                                Runs on every cloud on premise and on the edge
                                                It has highly available design
                                                Capability of global data fabric
                                                It has global database
                                                It has global event streaming engine
                                                Operates at unlimited scale
                                                It supports file, tables, document and streams
                                                It supports docker container data platform to make it highly available
                                                The file system in mapr is different from others

Aws cloud formation

Is a service that helps you model and set up our amazon web services resources so that we can spend less time managing those resources and more time focusing on our applications that run on AWS

We can also create templates in AWS cloud formation.We can use designers for creating this template and save this template

  •  To create a cloud formation script we need a JSON script.This can also be created using cloud formation designer as shown below. When we drag  resource JSON script would be generated.

Wednesday, January 30, 2019

Apache Accumulo

 Is a robust database scalable data storage and retrieval system based on google's big table design and built on top of Apache Hadoop, Zookeper and Thrift.
It's improvements on big table design are:

  • Server-side programming mechanism that can modify key/value pairs
  • Cell based access control

Tuesday, January 29, 2019

Ingestion in GCP

In the figure x axis is what is closer to GCP and y axis is the amount of data
These are the different approaches which we can take for data ingestion to GCP

Storage Transfer Service allows you to quickly import online data into Cloud Storage. You can also set up a repeating schedule for transferring data, as well as transfer data within Cloud Storage, from one bucket to another.

Wednesday, January 23, 2019

Edge nodes

  • Edge nodes are the interface between the Hadoop cluster and the outside network. For this reason, they’re sometimes referred to as gateway nodes. Most commonly, edge nodes are used to run client applications and cluster administration tools.

Tinyurl design

We have different options to generate tiny url. The basic rule of thumb is that we need to convert the long url into tiny url and store it in database or cache for future retrieval.

1) Sometimes we would have restrictions on the tiny url. ex 41 bits
2) Allowed characters on the tiny url  could be caps and smaller alphanumeric characters

Fact Table

  • A fact table is the central table in a star schema of a data warehouse. 
  • A fact table stores quantitative information for analysis and is often denormalized.
  • A fact table works with dimension tables. A fact table holds the data to be analyzed, and a dimension table stores data about the ways in which the data in the fact table can be analyzed. Thus, the fact table consists of two types of columns. The foreign keys column allows joins with dimension tables, and the measures columns contain the data that is being analyzed.


  • Dimension is a collection of reference information about a measurable event. 
  • Dimensions categorize and describe data warehouse facts and measures in ways that support meaningful answers to business questions.  They form the very core of dimensional modeling.  

A data warehouse organizes descriptive attributes as columns in dimension tables. 

A dimension table has a primary key column that uniquely identifies each dimension record (row).  The dimension table is associated with a fact table using this key.  Data in the fact table can be filtered and grouped (“sliced and diced”) by various combinations of attributes. 

Galaxy schema

  • Is a combination of both star schema and snow flake schema
  • Has many fact table and some common dimension table
  • Can also be referred as a combination of many data marts
  • It is also known as fact constellation schema 

Star schema

Star schema is the simplest form of a dimensional model, in which data is organized into facts and dimensions. 

A fact is an event that is counted or measured, such as a sale or login. 

Snowflake schema

  • In data warehousing, snowflaking is a form of dimensional modeling in which dimensions are stored in multiple related dimension tables.  A snowflake schema is a variation of the star schema.
  • Snowflaking is used to improve the performance of certain queries. 
  • The schema is diagrammed with each fact surrounded by its associated dimensions (as in a star schema), and those dimensions are further related to other dimensions, branching out into a snowflake pattern.

Tuesday, January 15, 2019

SQL question challenge (Delete duplicates)

We will be using duplicate employees table.
Delete duplicate rows in sql
SQL Script to create dup_employees table

link to github

Create table dup_employees
     id int,
     first_name nvarchar(50),
     last_name nvarchar(50),
     gender nvarchar(50),
     salary int

Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000);
Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000);
Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000);
Insert into dup_employees values (2, 'Mary', 'Lambeth', 'Female', 30000);
Insert into dup_employees values (2, 'Mary', 'Lambeth', 'Female', 30000);
Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);
Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);
Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);