Friday, October 26, 2018

System design Twitter


When we design a  system. we first need to list the features what we need.
Lets build the system feature by feature and later test for performance and improve on it.

Let us consider 3 features main features of twitter system
1.Tweeting
    This would have all the tweets
2.Timeline
    -User
       List of all the tweets within that timeline.Content of this would be less as this would be the list of
       tweets that the user has done at a particular timeline
    -Home
       List of all the tweets of the users whom we are following. Each users might have thousands of             tweets at a particular timeline.
3.Following
      These are the list of the users that we are following. This would not dynamically change and                would have less load on the system

System design Uber



Uber architecture relies on supply and demand.

  • Demand is for the user and supply is provided by the car
  • It uses a flat map from google to generate unique cell across the world regions
  • Each cell would have a  unique id
  • It would draw a circle that would have one or more cells as shown below

Wednesday, July 25, 2018

File format using hive




SEQUENCE FILE
1
2
3
4
5
6
7
Sequencefile
======================
create external table flight_seq 
 (year smallint,month tinyint,dayofmonth tinyint,dayofweek tinyint,
  lateaircraftdelay smallint)
 stored as sequencefile
location '/user/raj_ops/rawdata/handson_train/airline_performance/flights_seq';


Hive partition

  • Partitioning improves the time taken to access data by restricting query to only a certain portion of the dataset.
  • Care has to be taken as to what will make the partition column.
  • Once partition has been created, you can alter some definitions of the partition different from other partitions.
  • There is no hard limit on the number of partitions that a hive table can contain.However we still need to be careful
  • Querying without the partition column would increase the amount of time the query will complete compared to a non-partitioned table.
  •  Prefer static partitioning to dynamic for day-to-day data ingestion
  •  Pre-empt small file scenarios

Tuesday, July 17, 2018

Create pair key in Aws







SSH to Aws



Create new AWS cluster



Movie ratings project part 2

RATINGS

RATINGS FILE DESCRIPTION
================================================================================

All ratings are contained in the file "ratings.dat" and are in the
following format:

UserID::MovieID::Rating::Timestamp

- UserIDs range between 1 and 6040 
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings

FIND AND REPLACE DELIMITER
more ratings.dat | sed -e 's/::/@/g' > ratings_clean

CREATE DIRECTORY FOR RATINGS
hdfs dfs -mkdir /hackerday_ratings/million/ratings

Movie ratings project part 1



ADD USER
sudo useradd hduser

CREATE DIRECTORY
hdfs dfs -mkdir /hackerday_ratings

LIST CREATED DIRECTORY
hdfs dfs -ls /

ADD NEW USER
sudo usermod -G hadoop hduser

CHECK FOR EXISTING USER
id hduser

CREATE PWD FOR THE USER
sudo passwd hduser

CHANGE THE OWNERSHIP FOR THAT FILE
hdfs dfs -chown -R hduser:hadoop /hackerday_ratings

Login as root


hdfs dfs -chown -R hduser:hadoop /hackerday_ratings

CHECK FOR OWNERSHIP CHANGES
hdfs dfs -ls /

Tuesday, July 10, 2018

DSL in Spark


DSL 
  • Stands for domain specific language
  • Language designed for specific purpose
  • Data-frames are schema aware
  • Expose rich domain specific language
  • Structure data manipulation
  • SQL like way (Think in SQL)

Dynamically create DataFrames



We can dynamically create a string of rows and then generate a dataframe.
However it would be considered as a single line and would throw an error.
We need to split lines based on the delimiter. This can be done by writing a split function as shown below

CREATE DATAFRAME
from pyspark.sql.functions  import lit
# create rdd for new id
data_string =""
for rw in baseline_row.collect():
    for i in range(24):
        hour="h" + str(i+1)
        hour_value= str(rw[hour])
        data = 'Row('+ rw.id +', "unique_id"),'
        data_string = data_string + data

#dynamically generated data for hours 
print(hourly_data)
rdds=spark_session.sparkContext.parallelize([data_string])
rdds.map(split_the_line).toDF().show()

Deployment mode is spark


DEPLOYMENT MODE 
We can specify local/ cluster mode

Cluster mode :
  • Driver runs on the cluster even if launched from outside. 
  • Process not killed if the computer submitted is not killed

Spark samples (RDD, DataFrames,DSL)



SHARK :THE BEGGING OF THE API 
  • SQL using Spark execution engine
  • Evolved into Spark SQL in 1.0
SCHEMA RDD
  • RDD with schema information
  • For unit testing and debugging Spark SQL
  • Drew attention by spark developers
  • Released as DataFrame API in 1.3

Examples for compression and file format in spark

PARQUET
  • Design based on Google's Dremel paper
  • Schema segregated into footer
  • Column major format with stripes
  • Simpler type-model with logical types
  • All data pushed to leaves of the tree

Monday, July 9, 2018

Spark samples (Spark SQL, Window functions , persist )

Image result for spark sql png

WRITE AS CSV
df_sample.write.csv("./spark-warehouse/SAMPLE.csv")

WRITE AS CSV WITH HEADER
df_sample.write.csv("./spark-warehouse/SAMPLE_5.csv",header=True) 


DISPLAY ONLY COLUMNS 
#Load csv as dataframe
data = spark.read.csv("./spark-warehouse/LOADS.csv", header=True)

#Register temp viw
data.createOrReplaceTempView("vw_data")

#load data based on the select query
load = spark.sql("Select * from vw_data limit 5")
load.columns


Sunday, June 24, 2018

Simple transformations in Spark



MAP :
  • map    is    a    transformation    operation    in    Spark    hence    it    is    lazily    evaluated   
  • It    is    a    narrow    operation    as    it    is    not    shuffling    data    from    one    partition    to    multiple partitions
Note:  In this example map takes the values from the array and assigns 1 to all the elements in the array
scala> val x=sc.parallelize(List("spark","rdd","example","sample","example"),3)
x: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[4]  
at parallelize at <console>:27

scala> val y=x.map(x=>(x,1))
y: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[5] 
at map at <console>:29

scala> y.collect
res0: Array[(String, Int)] = Array((spark,1), (rdd,1), (example,1), 
(sample,1), (example,1))



Running cron Job




What is cron?

Cron is the name of program that enables unix users to execute commands
or scripts (groups of commands) automatically at a specified time/date.
It is normally used for sys admin commands, like makewhatis, which builds a 
search database for the man -k command, or for running a backup script, 
but can be used for anything. A common use for it today is connecting to 
the internet and downloading your email.

Overview of HBase





  • A non-relational (NoSQL) database that runs on top of HDFS
  • Is an open source NoSQL database that provides real-time read/write access to those large datasets.
  • Scales linearly to handle huge data sets with billions of rows and millions of columns, and it easily combines data sources that use a wide variety of different structures and schemas. 
  • Is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN

Permanent UDF in hive


If we create a temporary function then it would be available for the current cli session.
Every time we want to use the function, we need to add the jar and create a temporary function

hive> ADD JAR /home/cloudera/mask.jar;
Added [/home/cloudera/mask.jar] to class path
Added resources: [/home/cloudera/mask.jar]
hive> CREATE TEMPORARY FUNCTION MASK AS 'hiveudf.PImask';


HIVE PERMANENT FUNCTION

Note: If we have already created a temporary file then we need to create a new function name while creating permanent function

The problem with temporary function is that the function is valid only till the session is alive in which it was created and is lost as soon as we log off.  
Many a times we have requirements where we need the functions to be permanent so that they can be used across sessions and across different edge nodes. Let us create a permanent function from the same jar file and test the same. 

Create UDF functions in hive



HIVE UDF FUNCTIONS
Functions are built for a specific purpose to perform operations like Mathematical, arithmetic, logical and relational on the operands of table column names.

We can write the UDF function in java as shown below.
In this example,  we are replacing a character string into "*" . We are masking characters which should not be shown to the user.

Eclipse market place for scala


We can find scala IDE for eclipse in eclipse market place.

STEPS INVOLVED
  • In the eclipse IDE we need to go the help tab and click on Eclipse marketplace


Linux file system hierarchy


If you look at the Linux file hierarchy, you find the following :

/bin - Common binaries

/sbin - Binaries used for system administration are placed here.

/boot - Static files of the boot loader. Usually it contain the Linux kernel, Grub boot loader files and so on.

/dev - Device files such as your CD drive, hard disk, and any other physical device. 

In Linux/Unix, the common premise is that everything is a file.

/home - User HOME directories are found here. In unices like FreeBSD, the HOME directories are found in /usr/home. And in Solaris it is in /export. So quite a big difference here.

Install SBT using yum


sbt is an open-source build tool for Scala and Java projects, similar to Java's Maven and Ant.

Its main features are:

  • Native support for compiling Scala code and integrating with many Scala test frameworks
  • Continuous compilation, testing, and deployment
  • Incremental testing and compilation (only changed sources are re-compiled, only affected tests are re-run etc.)
  • Build descriptions written in Scala using a DSL


Tuesday, June 19, 2018

Accumulators and Broadcast Variables in Spark


SPARK PROCESSING
  • Distributed and parallel processing
  • Each executor has separate copies (variables and functions)
  • No propagation data back to the driver (Except on certain necessary cases)

Wednesday, June 13, 2018

Simple demo for spark streaming


















Login to your shell and open pyspark
[cloudera@quickstart ~]$ pyspark
Python 2.6.6 (r266:84292, Jul 23 2015, 15:22:56) 


Run Netcat
[cloudera@quickstart ~]$  nc -l localhost 2182


Tuesday, April 24, 2018

Mean, Median and Mode

Mean
The "average" number; found by adding all data points and dividing by the number of data points.


Sunday, April 22, 2018

Kafka partitions

  • Each topic has one or more partitions
  • The no of topics in kafka is dependent on the circumstances in which Apache Kafka is intended to be used.It can be configurable
  • A partition is the basis for which kafka can
    • Scale
    • Become fault-tolerant
    • Achieve higher level of throughput
  • Each partitions are maintained at at-least one or more brokers
 Note: Each partition must fit on an entire machine. If we have one partition for a large and growing topic, we would be limited by the one broker node's ability to capture and retain messages being published to that topic. We would also run into IO constraints

Overview of S3


  • Interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web
  • Is an object store, not a file system.
  • Highly scalable, reliable, fast, inexpensive data storage infrastructure
  • Uses eventually consistency model


Markdown Cheat Sheet (Jupyter Notebook)

Headers

# H1
## H2
### H3
#### H4
##### H5
###### H6

Alternatively, for H1 and H2, an underline-ish style:

Alt-H1
======

Alt-H2
------

Overview of Pig


  •  Apache Pig is a high-level platform for creating programs that run on Apache Hadoop.
  • The language for this platform is called Pig Latin. 
  • Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark
Local mode
  • In local mode, Pig runs in a single JVM and access the local file system. This mode is suitable only for small data sets but not for big data sets.
  • We can set this local mode execution type by using “X” or “exectype”  option. To run in local mode, set the option to local

Shell script



  • Interprets user command which are directly entered by the user or which are read from a file called shell script/program
  • Shell script are interpreted
  • Typical operations performed by shell scripts include file manipulation, program execution, and printing text
Command to identify the shell type which the operating system supports

Overview of Flume



Flume


  • Distributed data collection service
  • Gets streaming event data from different sources
  • Moves large amount of log data from many different sources to a centralized data store.

Note: We cannot use flume to get relational data

Span vs div (CSS)


div is a block element, span is inline.
This means that to use them semantically, divs should be used to wrap sections of a document, while spans should be used to wrap small portions of text, images, etc.
For example:
<div>This a large main division, with <span>a small bit</span> of spanned text!</div>
Example

Display block vs inline vs inline-block



Sample html
<!DOCTYPE html>
<html>
<head>
<style>
.floating-box {
    display: inline-block;
    width: 150px;
    height: 75px;
    margin: 10px;
    border: 3px solid #73AD21;  
}

.after-box {
    border: 3px solid red; 
}
</style>
</head>
<body>

<h2>The New Way - using inline-block</h2>

<div class="floating-box">Floating box</div>
<div class="floating-box">Floating box</div>
<div class="floating-box">Floating box</div>
<div class="floating-box">Floating box</div>
<div class="floating-box">Floating box</div>
<div class="floating-box">Floating box</div>
<div class="floating-box">Floating box</div>
<div class="floating-box">Floating box</div>

<div class="after-box">Another box, after the floating boxes...</div>

</body>
</html>



CSS for td in the same line (Angular 2)


Html

<table id="table_id"><tr><td>testtesttesttest</td>
<td>testtesttesttest</td>
</tr>
</table>

React Function vs Class Component




React has 2 Components

Function Component:
  • Simplest form of react component
  • It receives an object of properties and returns JSX (which looks like html)


Saturday, April 14, 2018

Simple demo using Kafka



Start ZooKeeper
1
bin/zookeeper-server-start.sh config/zookeeper.properties

Start telnet
1
telnet localhost 2181

Create topic
1
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

List kafka Topics
1
bin/kafka-topics.sh --list --zookeeper localhost:2181

Install kafka


Steps for installing Kafka

  • You need to setup a Java virtual machine on your system before you can run Apache Kafka properly.
  • We can install OpenJDK Runtime Environment 1.8.0 using YUM:
sudo yum install java-1.8.0-openjdk.x86_64
  • Validate your installation with:
java -version

Tuesday, February 27, 2018

Message Consumption in Apache kafka



MESSAGE OFFSET

  • Critical concept to understand because it is how consumers can read messages at their own pace and process them independently. 
  • Place holder, It is like a bookmark that maintains the last read position
  • In the case of kafka topic, it is the last read message.
  • The offset is entirely established and maintained by the consumer.Since the consumer is entirely responsible for reading the messages and processing them on its own.
  • Keep track of what it has read and has not read
  • Offset refers to a message identifier

STEPS INVOLVED

  1.  When a consumer wishes to read from a topic, it must establish a connection with a Broker
  2. After establishing the connection, the consumer will decide what messages it wants to consume
  3. If the consumer has not previously read from the topic , or it has to start over, it will issue a statement to read from the beginning of the topic (Consumer establishing that its message offset for the topic is 0)   

Apache ZooKeeper


APACHE KAFKA DISTRIBUTED ARCHITECTURE

  • At the heart of Apache kafka we have a cluster, which consists of hundreds of independent Brokers.
  • Closely associated with the kafka cluster, we have a Zookeeper environment,which provides the Brokers within a cluster, the metadata it needs to operate at scale and reliability.As this metadata is constantly changing, connectivity and chatter between the cluster members and Zookeeper is required.


Team formation in Kafka


CONTROLLER ELECTION
  • Hierarchy starts with a controller/supervisor 
  • It is a worker node elected by its peers to officiate in the administrative capacity of a controller
  • The worker node selected as controller is the one that is been around the longest

RESPONSIBILITY OF CONTROLLER ELECTION
  • Maintain inventory of what workers are available to take on work.
  • Maintain a list of work items that has been committed to and assigned to workers
  • Maintain active status of the staff and their progress on assigned tasks

Overview of kafka



  • Apache Kafka is a distributed commit log service
  • Functions much like a publish/subscribe messaging system
  • Better throughput
  • Built-in partitioning, replication, and fault tolerance. 
  • Increasingly popular for log collection and stream processing.