Monday, 10 June 2019

Apache Spark

Apache Spark is fast cluster computing technology. It is based on Hadoop MapReduce and it extends the MapReduce model. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.
Spark provide a high level API in java, scala, python and R. Spark. It provides a shell in scala and python
Scala can be accessed through ./bin/spark-shell and python shell can be accessed via ./bin/pyspark

Spark is 100 time faster than hadoob, spark achieve this via parallel distributed data processing using partitions.
Spark support multiple data source like Parquet, JSON, hive and Cassandra apart from text file, csv and RDBMS


Setup Apache Spark-

Set Java home, use below command to set JAVA_HOME
setx JAVA_HOME -m "Path". For “Path”, paste in your Java installation path .

setx SCALA_HOME -m "Path of scala installation dir".

Download and unzip the Apache spark https://spark.apache.org/downloads.html
setx SPARK_HOME –m “path of spark bin folder”

Download and unzip Hadoob common library
setx HADOOB_HOME –m “hadoob common bin directory”

Open command prompt and navigate to spark bin folder and run spark-shell command
Refer below snippet-


No comments:

Post a Comment