
Big Data Analytics with Hadoop 3
By :

A schema is the description of the structure of your data and can be either implicit or explicit. There are two main ways to convert existing RDDs into datasets as the DataFrames are internally based on the RDD; they are as follows:
Let's look at an example of loading a comma-separated values (CSV) file into a DataFrame. Whenever a text file contains a header, the read API can infer the schema by reading the header line. We also have the option to specify the separator to be used to split the text file lines.
We read the csv
inferring the schema from the header line and use the comma (,
) as the separator. We also show the use of the schema
command and the printSchema
command to verify the schema of the input file:
scala> val statesDF = spark.read.option...