Suthesana
3 min readMay 20, 2020

Apache Spark

Introduction

Apache Spark is an open-source framework for Bigdata. It provides faster and more general data processing platform. It is making it a standard tool for any developer / data scientist interested in big data. Apache Spark provides high-level APIs in Python,Scala,Java and R. It has various advantages over other big data solutions. Apache Spark is Dynamic in Nature. It supports in-Memory Computation of RDDs. It provides Fault Tolerance, reusability ,real-time stream processing etc.

Features

Fault tolerence and Recoverable

Apache Spark gives adaptation to fault tolerance through Spark abstraction-RDD. Sparkle RDDs are intended to deal with the disappointment of any worker node within the cluster. In this manner, it guarantees that the loss of information decreases to zero. while performing on spark we may apply various changes on RDDs. That makes a logical execution plan for all tasks executed. This logical execution plan is furthermore well known as lineage graph. Within the procedure, we may lose any RDD as though any deficiency emerges during a machine. By applying an equivalent calculation consequently node, we will recover our equivalent dataset once more. we will apply an equal calculations by utilizing lineage graph. Henceforth, This procedure is fault tolerance or self-recovery process.

High Availability

We can say any system highly available if its downtime is fair. This point relies upon how basic the machine is. Zero downtime is a nonexistent term for any system. Consider any machine has an uptime of 97.7%, so its likelihood to head out down will be 0.023. In the event that we’ve comparative two machines, at that point the likelihood of them two occurring will be (0.023*0.023). in most high accessibility condition we’ve three machines being used, all things considered, the likelihood of occurring is (0.023*0.023*0.023) , which ensures an uptime of framework to be 99.9987833% which is truly satisfactory uptime ensure.

Security

Apache Spark security aids authentication through a shared secret. Spark authentication is the configuration parameter through which authentication can be configured. It is the parameter which checks whether the protocols of the spark communication are doing authentication using a shared secret or not. Both the sender and recipient must have some shared shared to communicate. They won’t be permitted to communicate with one another if the shared secret isn’t indistinguishable.

Predictable performance and Consistent

Cloud Pak for Data v2.5 gives Analytics Engine, powered by Apache Spark service, integrated inside the platform to offer users an all encompassing encounter to run and oversee various varieties of analytics during a single platform. This licenses consistent and predictable performance of Spark jobs with dedicated resources.

Scalable

Spark supports multiple programming languages (Python, Scala, Java, R).Furthermore, it incorporates libraries for many and various tasks beginning from SQL to streaming and machine learning. It runs anyplace from a laptop to a cluster of thousands of servers.So this makes a simple system to begin with and scale-up to big processing or inconceivably large scale.

Conclusion

Spark assists with streamlining the difficult and computationally intensive task of processing high volumes of real-time or archived data. Both structured and unstructured, consistently coordinating pertinent complex abilities, for example, machine learning and diagram calculations. Sparkle brings Big Data processing to the masses.

No responses yet