-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating

Clojure for Data Science
By :

GraphX (https://spark.apache.org/graphx/) is a distributed graph processing library that is designed to work with Spark. Like the MLlib library we used in the previous chapter, GraphX provides a set of abstractions that are built on top of Spark's RDDs. By representing the vertices and edges of a graph as RDDs, GraphX is able to process very large graphs in a scalable way.
We've seen in previous chapters how to process a large dataset using MapReduce and Hadoop. Hadoop is an example of a data-parallel system: the dataset is divided into groups that are processed in parallel. Spark is also a data-parallel system: RDDs are distributed across the cluster and processed in parallel.
Data-parallel systems are appropriate ways of scaling data processing when your data closely resembles a table. Graphs, which may have complex internal structure, are not most efficiently represented as tables. Although graphs can be represented as edge lists, as...
Change the font size
Change margin width
Change background colour