MK Study Journal: Apache Spark

Monday, 10 June 2019

Apache Spark

Apache Spark is fast cluster computing technology. It is based on Hadoop MapReduce and it extends the MapReduce model. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.

Spark provide a high level API in java, scala, python and R. Spark. It provides a shell in scala and python

Scala can be accessed through ./bin/spark-shell and python shell can be accessed via ./bin/pyspark

Spark is 100 time faster than hadoob, spark achieve this via parallel distributed data processing using partitions.

Spark support multiple data source like Parquet, JSON, hive and Cassandra apart from text file, csv and RDBMS

Setup Apache Spark-

Install JDK https://www.oracle.com/technetwork/es/java/javasebusiness/downloads/index.html

Set Java home, use below command to set JAVA_HOME

setx JAVA_HOME -m "Path". For “Path”, paste in your Java installation path .

Install Scala https://www.scala-lang.org/download/

setx SCALA_HOME -m "Path of scala installation dir".

Download and unzip the Apache spark https://spark.apache.org/downloads.html

setx SPARK_HOME –m “path of spark bin folder”

Download and unzip Hadoob common library

setx HADOOB_HOME –m “hadoob common bin directory”

Install python https://www.python.org/downloads/

Open command prompt and navigate to spark bin folder and run spark-shell command

Refer below snippet-

MK Study Journal

Monday, 10 June 2019

Apache Spark

No comments:

Post a Comment

Total page views