Showing posts with label Apache Spark. Show all posts
Showing posts with label Apache Spark. Show all posts
Tuesday, January 21, 2020
Installing spark on windows
https://www.youtube.com/watch?v=2CvtwKTjI4Q&vl=en
1) Download specific version of spark
http://spark.apache.org/downloads.html
2) Unzip and create a directory for spark
Monday, February 4, 2019
twittter location clustering based on tweets (Spark Mllib)
1) Create a directory for twitter streams
cd /usr/lib/spark
sudo mkdir tweets
cd tweetscd sudo mkdir data
sudo mkdir training
sudo chmod 777 /usr/lib/spark/tweets/
These are the two folders which we would be using in this project
data :Would contain the master of the csv files which we would pretend coming from a training source.
training : Source to train our machine learning algorithm
Tuesday, July 10, 2018
Dynamically create DataFrames
We can dynamically create a string of rows and then generate a dataframe.
However it would be considered as a single line and would throw an error.
We need to split lines based on the delimiter. This can be done by writing a split function as shown below
We need to split lines based on the delimiter. This can be done by writing a split function as shown below
CREATE DATAFRAME
from pyspark.sql.functions import lit
# create rdd for new id data_string ="" for rw in baseline_row.collect(): for i in range(24): hour="h" + str(i+1) hour_value= str(rw[hour]) data = 'Row('+ rw.id +', "unique_id"),' data_string = data_string + data #dynamically generated data for hours print(hourly_data) rdds=spark_session.sparkContext.parallelize([data_string]) rdds.map(split_the_line).toDF().show()
Monday, July 9, 2018
Spark samples (Spark SQL, Window functions , persist )

WRITE AS CSV
df_sample.write.csv("./spark-warehouse/SAMPLE.csv")
WRITE AS CSV WITH HEADER
df_sample.write.csv("./spark-warehouse/SAMPLE_5.csv",header=True)
DISPLAY All COLUMNS
#Load csv as dataframe data = spark.read.csv("./spark-warehouse/LOADS.csv", header=True) #Register temp viw data.createOrReplaceTempView("vw_data")
#load data based on the select query load = spark.sql("Select * from vw_data limit 5") load.show()
Sunday, June 24, 2018
Simple transformations in Spark
MAP :
- map is a transformation operation in Spark hence it is lazily evaluated
- It is a narrow operation as it is not shuffling data from one partition to multiple partitions
scala> val x=sc.parallelize(List("spark","rdd","example","sample","example"),3) x: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[4]
at parallelize at <console>:27 scala> val y=x.map(x=>(x,1)) y: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[5]
at map at <console>:29 scala> y.collect res0: Array[(String, Int)] = Array((spark,1), (rdd,1), (example,1),
(sample,1), (example,1))
Install SBT using yum
sbt is an open-source build tool for Scala and Java projects, similar to Java's Maven and Ant.
Its main features are:
- Native support for compiling Scala code and integrating with many Scala test frameworks
- Continuous compilation, testing, and deployment
- Incremental testing and compilation (only changed sources are re-compiled, only affected tests are re-run etc.)
- Build descriptions written in Scala using a DSL
Tuesday, June 19, 2018
Friday, February 9, 2018
Data Representation in RDD
Spark has 3 data representation
- RDD(Resilient Distributed Database)
- Is a collection of elements, that can be divided across multiple nodes in a cluster for parallel processing.
- It is also fault tolerant collection of elements, which means it can automatically recover from failures.
- Is immutable, we can create RDD once but can’t change it.
Overview of Spark streaming
How much data does google deal?
Limitations of Map reduce
- Stores about 15 exabytes ( 1000000000000000000B )of data
- Process 100 petabytes of data per day
- 60 trillion pages are indexed
- 1 billion google search users per month
Limitations of Map reduce
- Entire Map reduce job is a batch processing job
- Does not allow real time processing of the data
Subscribe to:
Posts (Atom)
Labels
- Algorithms (52)
- Apache Kafka (7)
- Apache Spark (21)
- Architecture (8)
- Arrays (23)
- Big Data (98)
- Cloud services (6)
- Cognitive technologies (12)
- Data Analytics (3)
- Data Science (6)
- Design (1)
- Hadoop (26)
- Hive (11)
- Java (2)
- JavaScript (65)
- JavaScript Run-time (12)
- Machine learning (11)
- Maths (6)
- MySQL (1)
- Networking (3)
- No SQL (2)
- Node (20)
- Python (28)
- SQL (40)
- Security (4)
- Spark Grpahx (1)
- Spark MLlib (1)
- Spark Sql (3)
- Spark Streaming (4)
- Sqoop (2)
- Strings (13)
- devOps (1)
- mongoDb (2)
- ssis (3)