Web Snippets: February 2019

Monday, February 25, 2019

Impact of scaling and shifting random variables

To make training the network easier, we standardize each of the continuous variables. That is, we'll shift and scale the variables such that they have zero mean and a standard deviation of 1.

The scaling factors are saved so we can go backwards when we use the network for predictions.

SHIFTING
If we have one random variable, that is constructed by adding a constant to another random variable

We would shift the mean by that constant
It would not shift the standard deviation

These are variables that fall into a category
There is no order for categorical variables
They are not quantitative variables

SQL question challenge (Consecutive numbers)

Write a SQL query to find all numbers that appear at least three times consecutively.

+----+-----+
| Id | Num |
+----+-----+
| 1  |  1  |
| 2  |  1  |
| 3  |  1  |
| 4  |  2  |
| 5  |  1  |
| 6  |  2  |
| 7  |  2  |
+----+-----+

For example, given the above Logs table, 1 is the only number that appears consecutively for at least three times.

+-----------------+
| ConsecutiveNums |
+-----------------+
| 1               |
+-----------------+

SQL Schema

The Trips table holds all taxi trips.

TRIPS TABLE.
Each trip has a unique Id, while Client_Id and Driver_Id are both foreign keys to the Users_Id at the

+----+----------------+-----------+--------------+--------------------+----------+

| Id | Client_Id      | Driver_Id | City_Id      |        Status         |Request_at|

+----+-----------+-----------+---------+--------------------+----------+

| 1  |     1          |    10     |    1         |     completed         |2013-10-01|

| 2  |     2          |    11     |    1         | cancelled_by_driver   |2013-10-01|

| 3  |     3          |    12     |    6         |     completed         |2013-10-01|

| 4  |     4          |    13     |    6         | cancelled_by_client   |2013-10-01|

| 5  |     1          |    10     |    1         |     completed         |2013-10-02|

| 6  |     2          |    11     |    6         |     completed         |2013-10-02|

| 7  |     3          |    12     |    6         |     completed         |2013-10-02|

| 8  |     2          |    12     |    12        |     completed         |2013-10-03|

| 9  |     3          |    10     |    12        |     completed         |2013-10-03|

| 10 |     4          |    13    |    12         | cancelled_by_driver   |2013-10-03|

+----+-----------+-----------+---------+--------------------+----------+

SQL Schema
Table: Candidate

+-----+---------+
| id | Name |
+-----+---------+
| 1 | A |
| 2 | B |
| 3 | C |
| 4 | D |
| 5 | E |
+-----+---------+
Table: Vote

SQL Challenge
Suppose that a website contains two tables, the Customers table and the Orders table. Write a SQL query to find all customers who never order anything.

Table: Customer.

+----+-------+
| Id | Name |
+----+-------+
| 1 | Joe |
| 2 | Henry |
| 3 | Sam |
| 4 | Max |
+----+-------+

Using the above tables as example, return the following:

+-----------+
| Customers |
+-----------+
| Henry |
| Max |
+-----------+

DDL SCRIPTS

twittter location clustering based on tweets (Spark Mllib)

1) Create a directory for twitter streams

 cd /usr/lib/spark

 sudo mkdir tweets

 cd tweetscd
 sudo mkdir data

 sudo mkdir training

 sudo chmod  777 /usr/lib/spark/tweets/

These are the two folders which we would be using in this project
data :Would contain the master of the csv files which we would pretend coming from a training source.
training : Source to train our machine learning algorithm

1) Download the sandbox for hortonworks
2) Launch virtualbox
3) Once the sandbox is up and running, we would see a screen as shown below ( Has information about the localhost url and ssh servers)

HORTONWORKS SANDBOX

Movie ratings project part 3 (Analysis)

cont from http://www.prathapkudupublog.com/2018/07/movie-ratings-project-part-2.html

link to github

***************************************************************

1) WHICH YEAR HAS THE MOST NO OF RATINGS
***************************************************************

select year(from_unixtime(rating_time)) rating_year,
       count(*) as cnt
from latest_ratings 
group by year(from_unixtime(rating_time))
order by rating_year DESC;
    
 YEAR    RATING_YEAR
 2018    1086549
 2017    1973721
 2016    2077152
 2015    1907576
 2014    584216
 2013    633573
 2012    792701
 2011    833774
 2010    982104
 2009    993111
 2008    1210356
 2007    1095710
 2006    1210803
 2005    1849719
 2004    1201656
 2003    1079130
 2002    910350
 2001    1239432
 2000    2033738
 1999    1231173
 1998    329704
 1997    763929
 1996    1733263
 1995    4

Web Snippets

Labels

Monday, February 25, 2019

Impact of scaling and shifting random variables

Categorical Variables

Monday, February 11, 2019

SQL question challenge (Consecutive numbers)

SQL question challenge (Cancellation rates for trips)

SQL question challenge (candidate winners)

SQL question challenge (Customer with no orders)

Monday, February 4, 2019

twittter location clustering based on tweets (Spark Mllib)

SSH to Hortonworks sandbox

Saturday, February 2, 2019

Movie ratings project part 3 (Analysis)

Labels

Blog Archive