## Monday, February 25, 2019

### Impact of scaling and shifting random variables

To make training the network easier, we standardize each of the continuous variables. That is, we'll shift and scale the variables such that they have zero mean and a standard deviation of 1.
The scaling factors are saved so we can go backwards when we use the network for predictions.

SHIFTING
If we have one random variable, that is constructed by adding a constant to another random variable
• We would shift the mean by that constant
• It would not shift the standard deviation

### Categorical Variables

• These are variables that fall into a category
• There is no order for categorical variables
• They are not quantitative variables

## Monday, February 11, 2019

### SQL question challenge (Consecutive numbers)

Write a SQL query to find all numbers that appear at least three times consecutively.
```+----+-----+
| Id | Num |
+----+-----+
| 1  |  1  |
| 2  |  1  |
| 3  |  1  |
| 4  |  2  |
| 5  |  1  |
| 6  |  2  |
| 7  |  2  |
+----+-----+
```

For example, given the above Logs table, 1 is the only number that appears consecutively for at least three times.

```+-----------------+
| ConsecutiveNums |
+-----------------+
| 1               |
+-----------------+```

### SQL question challenge (Cancellation rates for trips)

SQL Schema

The Trips table holds all taxi trips.

TRIPS TABLE.
Each trip has a unique Id, while Client_Id and Driver_Id are both foreign keys to the Users_Id at the
```+----+----------------+-----------+--------------+--------------------+----------+

| Id | Client_Id      | Driver_Id | City_Id      |        Status         |Request_at|

+----+-----------+-----------+---------+--------------------+----------+

| 1  |     1          |    10     |    1         |     completed         |2013-10-01|

| 2  |     2          |    11     |    1         | cancelled_by_driver   |2013-10-01|

| 3  |     3          |    12     |    6         |     completed         |2013-10-01|

| 4  |     4          |    13     |    6         | cancelled_by_client   |2013-10-01|

| 5  |     1          |    10     |    1         |     completed         |2013-10-02|

| 6  |     2          |    11     |    6         |     completed         |2013-10-02|

| 7  |     3          |    12     |    6         |     completed         |2013-10-02|

| 8  |     2          |    12     |    12        |     completed         |2013-10-03|

| 9  |     3          |    10     |    12        |     completed         |2013-10-03|

| 10 |     4          |    13    |    12         | cancelled_by_driver   |2013-10-03|

+----+-----------+-----------+---------+--------------------+----------+
```

### SQL question challenge (candidate winners)

SQL Schema
Table: Candidate

+-----+---------+
| id  | Name    |
+-----+---------+
| 1   | A       |
| 2   | B       |
| 3   | C       |
| 4   | D       |
| 5   | E       |
+-----+---------+
Table: Vote

### SQL question challenge (Customer with no orders)

SQL Challenge
Suppose that a website contains two tables, the Customers table and the Orders table. Write a SQL query to find all customers who never order anything.

Table: Customer.

+----+-------+
| Id | Name  |
+----+-------+
| 1  | Joe   |
| 2  | Henry |
| 3  | Sam   |
| 4  | Max   |
+----+-------+

Using the above tables as example, return the following:

+-----------+
| Customers |
+-----------+
| Henry     |
| Max       |
+-----------+

DDL SCRIPTS

## Monday, February 4, 2019

### twittter location clustering based on tweets (Spark Mllib)

1)  Create a directory for twitter streams
` cd /usr/lib/spark `
` sudo mkdir tweets `
``` cd tweetscd
sudo mkdir data ```
` sudo mkdir training`
` sudo chmod  777 /usr/lib/spark/tweets/ `

These are the two folders which we would be using in this project
data :Would contain the master of the csv files which we would pretend coming from a training source.
training :  Source to train our machine learning algorithm

### SSH to Hortonworks sandbox

2) Launch virtualbox
3) Once the sandbox is up and running, we would see a screen as shown below ( Has information about the localhost url and ssh servers)

HORTONWORKS SANDBOX

## Saturday, February 2, 2019

### Movie ratings project part 3 (Analysis)

***************************************************************
1) WHICH YEAR HAS THE MOST NO OF RATINGS
***************************************************************
```select year(from_unixtime(rating_time)) rating_year,
count(*) as cnt
from latest_ratings
group by year(from_unixtime(rating_time))
order by rating_year DESC;

YEAR    RATING_YEAR
2018    1086549
2017    1973721
2016    2077152
2015    1907576
2014    584216
2013    633573
2012    792701
2011    833774
2010    982104
2009    993111
2008    1210356
2007    1095710
2006    1210803
2005    1849719
2004    1201656
2003    1079130
2002    910350
2001    1239432
2000    2033738
1999    1231173
1998    329704
1997    763929
1996    1733263
1995    4
```