Loading JSON file using Spark (Scala)
Main menu: Spark Scala Tutorial
In this Apache Spark Tutorial - We will be loading a simple JSON file. Now-a-days most of the time you will find files in either JSON format, XML or a flat file. JSON file format is very easy to understand and you will love it once you understand JSON file structure.
JSON File Structure
Before we ingest JSON file using spark, it's important to understand JSON data structure. Basically, JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. JSON is built on two structures:
A collection of name/value pairs, usually referred as an object and its value pair.
An ordered list of values. You can think it like an array, list of values.
An object is an unordered set of name/value pairs. An object begins with { (left brace) and ends with } (right brace). Each name is followed by : (colon) and the name/value pairs are separated by , (comma).
An array is an ordered collection of values. An array begins with [ (left bracket) and ends with ] (right bracket). Values are separated by , (comma).
A value can be a string in double quotes, or a number, or true or false or null, or an object or an array. These structures can be nested.
One more fact, JSON files could exist in two formats. However, most of the time you will encounter multiline JSON files.
Multiline JSON where each line could have multiple records.
Single line JSON where each line depicts one record.
Multiline JSON would look something like this:
[ { "color": "red", "value": "#f00" }, { "color": "green", "value": "#0f0" }, { "color": "blue", "value": "#00f" }, { "color": "cyan", "value": "#0ff" }, { "color": "magenta", "value": "#f0f" }, { "color": "yellow", "value": "#ff0" }, { "color": "black", "value": "#000" } ]
Single line JSON would look something like this (try to correlate object, array and value structure format which I explained earlier):
{ "color": "red", "value": "#f00" }
{ "color": "green", "value": "#0f0" }
{ "color": "blue", "value": "#00f" }
{ "color": "cyan", "value": "#0ff" }
{ "color": "magenta", "value": "#f0f" }
{ "color": "yellow", "value": "#ff0" }
{ "color": "black", "value": "#000" }
Creating Sample JSON file
I have created two different sample files - multiline and single line JSON file with above mentioned records (just copy-paste).
singlelinecolors.json
multilinecolors.json
Sample files look like:
Note: I assume that you have installed Scala IDE if not please refer my previous blogs for installation steps (Windows & Mac users).
1. Create a new Scala project "jsnReader"
Go to File → New → Project and enter jsnReader in project name field and click finish.
2. Create a new Scala Package "com.dataneb.spark"
Right click on the jsnReader project in the Package Explorer panel → New → Package and enter name com.dataneb.spark and finish.
3. Create a Scala object "jsonfileReader"
Expand the jsnReader project tree and right click on the com.dataneb.spark package → New → Scala Object → enter jsonfileReader in the object name and press finish.
4. Add external jar files
Right click on jsnReader project → properties → Java Build Path → Add External Jars
Now navigate to the path where you have installed Spark. You will find all the jar files under /spark/jars folder.
After adding these jar files you will find Referenced Library folder created on left panel of your screen below Scala object. You will also find that project has become invalid (red cross sign), we will fix it shortly.
5. Setup Scala Compiler
Now right click on jsnReader project → properties → Scala Compiler and check the box Use Project Settings and select Fixed Scala installation: 2.11.11 (built-in) from drop-down options.
After applying these changes, you will find project has become valid again (red cross sign is gone).
6. Sample code
Open jsonfileReader.scala and copy-paste the code written below.
I have written separate blog to explain what are basic terminologies used in Spark like RDD, SparkContext, SQLContext, various transformations and actions etc. You can go through this for basic understanding.
However, I have explained little bit in comments above each line of code what it actually does. For list of spark functions you can refer this.
// Your package name
package com.dataneb.spark
// Each library has its significance, I have commented in below code how its being used
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.log4j._
object jsonfileReader {
// Reducing the error level to just "ERROR" messages
// It uses library org.apache.log4j._
// You can apply other logging levels like ALL, DEBUG, ERROR, INFO, FATAL, OFF etc
Logger.getLogger("org").setLevel(Level.ERROR)
// Defining Spark configuration to define application name and the local resources to use
// It uses library org.apache.spark._
val conf = new SparkConf().setAppName("Sample App")
conf.setMaster("local")
// Using above configuration to define our SparkContext
val sc = new SparkContext(conf)
// Defining SQL context to run Spark SQL
// It uses library org.apache.spark.sql._
val sqlContext = new SQLContext(sc)
// Main function where all operations will occur
def main (args: Array[String]): Unit = {
// Reading the json file
val df = sqlContext.read.json("/Volumes/MYLAB/testdata/multilinecolors.json")
// Printing schema
df.printSchema()
// Saving as temporary table
df.registerTempTable("JSONdata")
// Retrieving all the records
val data=sqlContext.sql("select * from JSONdata")
// Showing all the records
data.show()
// Stopping Spark Context
sc.stop
}
}
7. Run the code!
Right click anywhere on the screen and select Run As Scala Application.
That's it!! If you have followed the steps properly you will find the result in Console.
We have successfully loaded JSON file using Spark SQL dataframes. Printed JSON schema and displayed the data.
Try reading single line JSON file which we created earlier. There is a multiline flag which you need to make true to read such files. Also, you can save this data in HDFS, database or CSV file depending upon your need. If you have any question, please don't forget to write in comments section below. Thank you.
Navigation menu
1. Apache Spark and Scala Installation
2. Getting Familiar with Scala IDE
3. Spark data structure basics
4. Spark Shell
5. Reading data files in Spark
6. Writing data files in Spark
7. Spark streaming
9. What's Artificial Intelligence, Machine Learning, Deep Learning, Predictive Analytics, Data Science?
It's very helpful for me Thank You so much