Monday, October 28, 2019

Deep Neural Net simple explanation (NN 1)

Nueral net understanding.Draw a line that seperates blue and red shackles

Transpose

Getting the transpose of a matrix is really easy in NumPy. Simply access its T attribute. There is also a transpose() function which returns the same thing, but you’ll rarely see that used anywhere because typing T is so much easier. :)

For example:

m = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
m
# displays the following result:
# array([[ 1,  2,  3,  4],
#        [ 5,  6,  7,  8],
#        [ 9, 10, 11, 12]])

m.T
# displays the following result:
# array([[ 1,  5,  9],
#        [ 2,  6, 10],
#        [ 3,  7, 11],
#        [ 4,  8, 12]])

NumPy does this without actually moving any data in memory - it simply changes the way it indexes the original matrix - so it’s quite efficient.

However, that also means you need to be careful with how you modify objects, because they are sharing the same data. For example, with the same matrix m from above, let's make a new variable m_t that stores m's transpose. Then look what happens if we modify a value in m_t:

m_t = m.T
m_t[3][1] = 200
m_t
# displays the following result:
# array([[ 1,   5, 9],
#        [ 2,   6, 10],
#        [ 3,   7, 11],
#        [ 4, 200, 12]])

m
# displays the following result:
# array([[ 1,  2,  3,   4],
#        [ 5,  6,  7, 200],
#        [ 9, 10, 11,  12]])

Notice how it modified both the transpose and the original matrix, too! That's because they are sharing the same copy of data. So remember to consider the transpose just as a different view of your matrix, rather than a different matrix entirely.

Companies know more about you

People concerned about privacy often try to be “careful” online. They stay off social media, or if they’re on it, they post cautiously. By doing so, they think they are protecting their privacy.

But they are wrong. Because of technological advances and the sheer amount of data now available about billions of other people, discretion no longer suffices to protect your privacy. Computer algorithms and network analyses can now infer, with a sufficiently high degree of accuracy, a wide range of things about you that you may have never disclosed, including your moods, your political beliefs, your sexual orientation and your health.

There is no longer such a thing as individually “opting out” of our privacy-compromised world.

What is to be done? Designing phones and other devices to be more privacy-protected would be start, and government regulation of the collection and flow of data would help slow things down. But this is not the complete solution. We also need to start passing laws that directly regulate the use of computational inference: What will we allow to be inferred, and under what conditions, and subject to what kinds of accountability, disclosure, controls and penalties for misuse?

Until we have good answers to these questions, you can expect others to continue to know more and more about you — no matter how discreet you may have been.

Monday, April 29, 2019

One Hot Encoding

One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

CATEGORICAL DATA

Lets take a dataset of food names. In this dataset, if there was another food name it would have categorical value as 4.As the no of unique value increases, the categorical values increases.

Slideshow using Notebook

The slides are created in notebooks like normal, but you'll need to designate which cells are slides and the type of slide the cell will be. In the menu bar, click View > Cell Toolbar > Slideshow to bring up the slide cell menu on each cell.

SCALARS

They have 0 dimensions
Ex a persons height would be a scalar

1 2.4 -0.3

Bag of words

Way of representing text data when modeling text with machine learning algorithms.
Is simple to understand and implement and has seen great success in problems such as language modeling and document classification.

The Problem with Text

A problem with modeling text is that it is messy, and techniques like machine learning algorithms prefer well defined fixed-length inputs and outputs.

Machine learning algorithms cannot work with raw text directly; the text must be converted into numbers. Specifically, vectors of numbers.

In language processing, the vectors x are derived from textual data, in order to reflect various linguistic properties of the text.

This is called feature extraction or feature encoding.

A popular and simple method of feature extraction with text data is called the bag-of-words model of text.

ERROR FUNCTION IN NN

In most learning networks, error is calculated as the difference between the actual output and the predicted output.
The error function is which tells us how far are we from the solution.
The function that is used to compute this error is known as loss function.
Different loss functions will give different errors for the same prediction and thus would have a considerable effort on the performance of the model.

EXAMPLE:
Imagine, we are standing on top of a mountain(mount Everest) and we want to descend.It is not that easy and it is cloudy and it is big and we cant see the big picture.We would look at all the possible directions where we can walk.

Industries to be revolutionized by artificial intelligence

Artificial intelligence (AI) and machine learning (ML) have a rapidly growing presence in today’s world, with applications ranging from heavy industry to education. From streamlining operations to informing better decision making, it has become clear that this technology has the potential to truly revolutionize how the everyday world works.

According to a panel of Forbes Technology Council members, here are 13 industries that will soon be revolutionized by AI.

1. Cybersecurity

The enterprise attack surface is massive. With its power to bring complex reasoning and self-learning in an automated fashion at massive scale, AI will be a game-changer in how we improve our cyber-resilience. - Gaurav Banga, Balbix

GAN

They can make entirely new image that are realistic, even they never been seen before
Most of the application for GANs have been images

STACKGAN

Takes a textual description of the bird and than generating a high resolution of a bird matching that description.
These pictures have never been seen before. It is not running a image search on a database, infact GAN is drawing a probability distribution over all hypothetical images matching that description
We can keep running the GAN to get more images.

Sage Maker Services

SERVICES PROVIDED BY SAGEMAKER

1) Provides jupyter notebook instance

Used to explore and process data

2) API

This simplifies computationally difficult task like train and deploy machine learning model

Machine Learning Workflow consists of 3 components

Explore and process data
Modeling
Deployment

EXPLORE AND PROCESS DATA

This component consists of exploring and processing the data.

Retrieve

The first step is to retrieve the data, which includes test and train dataset. Lets take an example of housing dataset which contains csv files. We need to download the data from the source.

core components of self driving cars

COMPUTER VISION:
These are like cameras where we use camera images to figure out what the world around us look like.

SENSOR FUSION:
How we incorporate data from other sensors like lasers, radars to get richer understanding of our environment.

LOCALIZATION:
To understand where we are in the current world.

PATH PLANNING:
Chart through the world to get us where we'd like to go.

CONTROL:
How we actually turn the steering wheel and hit the throttle ,hit the break in order to execute the trajectory that we built during path planning.

Monday, February 25, 2019

Impact of scaling and shifting random variables

To make training the network easier, we standardize each of the continuous variables. That is, we'll shift and scale the variables such that they have zero mean and a standard deviation of 1.

The scaling factors are saved so we can go backwards when we use the network for predictions.

SHIFTING
If we have one random variable, that is constructed by adding a constant to another random variable

We would shift the mean by that constant
It would not shift the standard deviation

These are variables that fall into a category
There is no order for categorical variables
They are not quantitative variables

SQL question challenge (Consecutive numbers)

Write a SQL query to find all numbers that appear at least three times consecutively.

+----+-----+
| Id | Num |
+----+-----+
| 1  |  1  |
| 2  |  1  |
| 3  |  1  |
| 4  |  2  |
| 5  |  1  |
| 6  |  2  |
| 7  |  2  |
+----+-----+

For example, given the above Logs table, 1 is the only number that appears consecutively for at least three times.

+-----------------+
| ConsecutiveNums |
+-----------------+
| 1               |
+-----------------+

SQL Schema

The Trips table holds all taxi trips.

TRIPS TABLE.
Each trip has a unique Id, while Client_Id and Driver_Id are both foreign keys to the Users_Id at the

+----+----------------+-----------+--------------+--------------------+----------+

| Id | Client_Id      | Driver_Id | City_Id      |        Status         |Request_at|

+----+-----------+-----------+---------+--------------------+----------+

| 1  |     1          |    10     |    1         |     completed         |2013-10-01|

| 2  |     2          |    11     |    1         | cancelled_by_driver   |2013-10-01|

| 3  |     3          |    12     |    6         |     completed         |2013-10-01|

| 4  |     4          |    13     |    6         | cancelled_by_client   |2013-10-01|

| 5  |     1          |    10     |    1         |     completed         |2013-10-02|

| 6  |     2          |    11     |    6         |     completed         |2013-10-02|

| 7  |     3          |    12     |    6         |     completed         |2013-10-02|

| 8  |     2          |    12     |    12        |     completed         |2013-10-03|

| 9  |     3          |    10     |    12        |     completed         |2013-10-03|

| 10 |     4          |    13    |    12         | cancelled_by_driver   |2013-10-03|

+----+-----------+-----------+---------+--------------------+----------+

SQL Schema
Table: Candidate

+-----+---------+
| id | Name |
+-----+---------+
| 1 | A |
| 2 | B |
| 3 | C |
| 4 | D |
| 5 | E |
+-----+---------+
Table: Vote

SQL Challenge
Suppose that a website contains two tables, the Customers table and the Orders table. Write a SQL query to find all customers who never order anything.

Table: Customer.

+----+-------+
| Id | Name |
+----+-------+
| 1 | Joe |
| 2 | Henry |
| 3 | Sam |
| 4 | Max |
+----+-------+

Using the above tables as example, return the following:

+-----------+
| Customers |
+-----------+
| Henry |
| Max |
+-----------+

DDL SCRIPTS

twittter location clustering based on tweets (Spark Mllib)

1) Create a directory for twitter streams

 cd /usr/lib/spark

 sudo mkdir tweets

 cd tweetscd
 sudo mkdir data

 sudo mkdir training

 sudo chmod  777 /usr/lib/spark/tweets/

These are the two folders which we would be using in this project
data :Would contain the master of the csv files which we would pretend coming from a training source.
training : Source to train our machine learning algorithm

1) Download the sandbox for hortonworks
2) Launch virtualbox
3) Once the sandbox is up and running, we would see a screen as shown below ( Has information about the localhost url and ssh servers)

HORTONWORKS SANDBOX

Movie ratings project part 3 (Analysis)

cont from http://www.prathapkudupublog.com/2018/07/movie-ratings-project-part-2.html

link to github

***************************************************************

1) WHICH YEAR HAS THE MOST NO OF RATINGS
***************************************************************

select year(from_unixtime(rating_time)) rating_year,
       count(*) as cnt
from latest_ratings 
group by year(from_unixtime(rating_time))
order by rating_year DESC;
    
 YEAR    RATING_YEAR
 2018    1086549
 2017    1973721
 2016    2077152
 2015    1907576
 2014    584216
 2013    633573
 2012    792701
 2011    833774
 2010    982104
 2009    993111
 2008    1210356
 2007    1095710
 2006    1210803
 2005    1849719
 2004    1201656
 2003    1079130
 2002    910350
 2001    1239432
 2000    2033738
 1999    1231173
 1998    329704
 1997    763929
 1996    1733263
 1995    4

Fast Export vs Fast Load (Teradata)

**************
FAST EXPORT
**************

Used to export data from Teradata into flat files
Can generate data in report format
Data can be extracted from one or more tables using join
Deals in block export (Useful for extracting large volumes)
Is the ability to ship data over multiple session connections simultaneously thereby leveraging the total connectivity available between the client platform and the database engine. In order to do this,Fastexport spends more resources on executing the query in order to prepare the blocks in such a way that when they are exported over multiple sessions they can easily be reassembled in the right order by the client without additional sorting or processing of the rows.

One of the most obvious and useful set of window functions are ranking functions where rows from your result set are ranked according to a certain scheme.
There are three ranking functions: ROW_NUMBER() RANK() DENSE_RANK()

DDL Scripts

Create table dup_employees
(
     id int,
     first_name nvarchar(50),
     last_name nvarchar(50),
     gender nvarchar(50),
     salary int
);
Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000);
Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000);
Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000);
Insert into dup_employees values (2, 'Mary', 'Lambeth', 'Female', 30000);
Insert into dup_employees values (2, 'Mary', 'Lambeth', 'Female', 30000);
Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);
Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);
Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);

SELF JOIN

A self join is a join in which a table is joined with itself

Ex :when we want to get the manager name who is an employee then in the join condition left side would be the first table (employee e) and right side would be the second table (employee m) in the join condition.

Note: It would take manger_id from the employee (m) table and would look for the employee_id in the employee table (e)

Select  id,e.employee_name,m.employee_name
from employee e
Join employee m
on e.manger_id=m.employee_id

MAPR : Started in stealth mode (2009)
MapR-FS : It is a Hadoop compatible file system
MapR-DB : It is the first in house DB that ran on the same technology stack
Apache-Drill : First schema free analytics
MAPR-Streams : Was introduced for global event processing
Converged Data platform : Branded all the above products into converged data platform.
Only converged data platform in the industry.
Supports all the data
Runs on every cloud on premise and on the edge
It has highly available design
Capability of global data fabric
It has global database
It has global event streaming engine
Operates at unlimited scale
It supports file, tables, document and streams
It supports docker container data platform to make it highly available
The file system in mapr is different from others

Is a service that helps you model and set up our amazon web services resources so that we can spend less time managing those resources and more time focusing on our applications that run on AWS

TEMPLATES
We can also create templates in AWS cloud formation.We can use designers for creating this template and save this template

To create a cloud formation script we need a JSON script.This can also be created using cloud formation designer as shown below. When we drag resource JSON script would be generated.

Apache Accumulo

Is a robust database scalable data storage and retrieval system based on google's big table design and built on top of Apache Hadoop, Zookeper and Thrift.
It's improvements on big table design are:

Server-side programming mechanism that can modify key/value pairs
Cell based access control

Ingestion in GCP

In the figure x axis is what is closer to GCP and y axis is the amount of data
These are the different approaches which we can take for data ingestion to GCP

TRANSFER SERVICE
Storage Transfer Service allows you to quickly import online data into Cloud Storage. You can also set up a repeating schedule for transferring data, as well as transfer data within Cloud Storage, from one bucket to another.

Edge nodes

Edge nodes are the interface between the Hadoop cluster and the outside network. For this reason, they’re sometimes referred to as gateway nodes. Most commonly, edge nodes are used to run client applications and cluster administration tools.

We have different options to generate tiny url. The basic rule of thumb is that we need to convert the long url into tiny url and store it in database or cache for future retrieval.

1) Sometimes we would have restrictions on the tiny url. ex 41 bits
2) Allowed characters on the tiny url could be caps and smaller alphanumeric characters

A fact table is the central table in a star schema of a data warehouse.
A fact table stores quantitative information for analysis and is often denormalized.
A fact table works with dimension tables. A fact table holds the data to be analyzed, and a dimension table stores data about the ways in which the data in the fact table can be analyzed. Thus, the fact table consists of two types of columns. The foreign keys column allows joins with dimension tables, and the measures columns contain the data that is being analyzed.

Dimensions

Dimension is a collection of reference information about a measurable event.
Dimensions categorize and describe data warehouse facts and measures in ways that support meaningful answers to business questions. They form the very core of dimensional modeling.

A data warehouse organizes descriptive attributes as columns in dimension tables.

A dimension table has a primary key column that uniquely identifies each dimension record (row). The dimension table is associated with a fact table using this key. Data in the fact table can be filtered and grouped (“sliced and diced”) by various combinations of attributes.

Galaxy schema

Is a combination of both star schema and snow flake schema
Has many fact table and some common dimension table
Can also be referred as a combination of many data marts
It is also known as fact constellation schema

Star schema

Star schema is the simplest form of a dimensional model, in which data is organized into facts and dimensions.

FACT
A fact is an event that is counted or measured, such as a sale or login.

In data warehousing, snowflaking is a form of dimensional modeling in which dimensions are stored in multiple related dimension tables. A snowflake schema is a variation of the star schema.
Snowflaking is used to improve the performance of certain queries.
The schema is diagrammed with each fact surrounded by its associated dimensions (as in a star schema), and those dimensions are further related to other dimensions, branching out into a snowflake pattern.

SQL question challenge (Delete duplicates)

We will be using duplicate employees table.
Delete duplicate rows in sql
SQL Script to create dup_employees table

link to github

Create table dup_employees
(
     id int,
     first_name nvarchar(50),
     last_name nvarchar(50),
     gender nvarchar(50),
     salary int
);

Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000);
Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000);
Insert into dup_employees values (1, 'Mark', 'Hastings', 'Male', 60000);
Insert into dup_employees values (2, 'Mary', 'Lambeth', 'Female', 30000);
Insert into dup_employees values (2, 'Mary', 'Lambeth', 'Female', 30000);
Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);
Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);
Insert into dup_employees values (3, 'Ben', 'Hoskins', 'Male', 70000);

Labels

Monday, October 28, 2019

Transpose

Tuesday, May 7, 2019

Monday, April 29, 2019

Saturday, April 27, 2019

Friday, April 26, 2019

Thursday, April 25, 2019

Wednesday, April 24, 2019

Monday, April 22, 2019

Tuesday, April 16, 2019

Tuesday, March 12, 2019

Monday, February 25, 2019

Monday, February 11, 2019

Monday, February 4, 2019

Saturday, February 2, 2019

Thursday, January 31, 2019

Wednesday, January 30, 2019

Tuesday, January 29, 2019

Wednesday, January 23, 2019

Tuesday, January 15, 2019

Labels