top of page
BlogPageTop

Loading JSON file using Spark (Scala)

In this Apache Spark Tutorial - We will be loading a simple JSON file. Now-a-days most of the time you will find files in either JSON format, XML or a flat file. JSON file format is very easy to understand and you will love it once you understand JSON file structure.


 

JSON File Structure


Before we ingest JSON file using spark, it's important to understand JSON data structure. Basically, JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. JSON is built on two structures:

  • A collection of name/value pairs, usually referred as an object and its value pair.

  • An ordered list of values. You can think it like an array, list of values.


 

  1. An object is an unordered set of name/value pairs. An object begins with { (left brace) and ends with } (right brace). Each name is followed by : (colon) and the name/value pairs are separated by , (comma).

  2. An array is an ordered collection of values. An array begins with [ (left bracket) and ends with ] (right bracket). Values are separated by , (comma).

  3. A value can be a string in double quotes, or a number, or true or false or null, or an object or an array. These structures can be nested.


One more fact, JSON files could exist in two formats. However, most of the time you will encounter multiline JSON files.

  • Multiline JSON where each line could have multiple records.

  • Single line JSON where each line depicts one record.


 

Multiline JSON would look something like this:

[ { "color": "red", "value": "#f00" }, { "color": "green", "value": "#0f0" }, { "color": "blue", "value": "#00f" }, { "color": "cyan", "value": "#0ff" }, { "color": "magenta", "value": "#f0f" }, { "color": "yellow", "value": "#ff0" }, { "color": "black", "value": "#000" } ]


 

Single line JSON would look something like this (try to correlate object, array and value structure format which I explained earlier):


{ "color": "red", "value": "#f00" }

{ "color": "green", "value": "#0f0" }

{ "color": "blue", "value": "#00f" }

{ "color": "cyan", "value": "#0ff" }

{ "color": "magenta", "value": "#f0f" }

{ "color": "yellow", "value": "#ff0" }

{ "color": "black", "value": "#000" }



Creating Sample JSON file


I have created two different sample files - multiline and single line JSON file with above mentioned records (just copy-paste).

  • singlelinecolors.json

  • multilinecolors.json

Sample files look like:


 

Note: I assume that you have installed Scala IDE if not please refer my previous blogs for installation steps (Windows & Mac users).


1. Create a new Scala project "jsnReader"

  • Go to FileNewProject and enter jsnReader in project name field and click finish.




 

2. Create a new Scala Package "com.dataneb.spark"

  • Right click on the jsnReader project in the Package Explorer panel → NewPackage and enter name com.dataneb.spark and finish.




 

3. Create a Scala object "jsonfileReader"

  • Expand the jsnReader project tree and right click on the com.dataneb.spark package → NewScala Object → enter jsonfileReader in the object name and press finish.


 



4. Add external jar files

  • Right click on jsnReader project → propertiesJava Build PathAdd External Jars

  • Now navigate to the path where you have installed Spark. You will find all the jar files under /spark/jars folder.


 


 

After adding these jar files you will find Referenced Library folder created on left panel of your screen below Scala object. You will also find that project has become invalid (red cross sign), we will fix it shortly.



5. Setup Scala Compiler

  • Now right click on jsnReader project → properties Scala Compiler and check the box Use Project Settings and select Fixed Scala installation: 2.11.11 (built-in) from drop-down options.

  • After applying these changes, you will find project has become valid again (red cross sign is gone).


 

6. Sample code

  • Open jsonfileReader.scala and copy-paste the code written below.

  • I have written separate blog to explain what are basic terminologies used in Spark like RDD, SparkContext, SQLContext, various transformations and actions etc. You can go through this for basic understanding.

However, I have explained little bit in comments above each line of code what it actually does. For list of spark functions you can refer this.


 

// Your package name

package com.dataneb.spark


// Each library has its significance, I have commented in below code how its being used

import org.apache.spark._

import org.apache.spark.sql._

import org.apache.log4j._


object jsonfileReader {


// Reducing the error level to just "ERROR" messages

// It uses library org.apache.log4j._

// You can apply other logging levels like ALL, DEBUG, ERROR, INFO, FATAL, OFF etc

Logger.getLogger("org").setLevel(Level.ERROR)


// Defining Spark configuration to define application name and the local resources to use

// It uses library org.apache.spark._

val conf = new SparkConf().setAppName("Sample App")

conf.setMaster("local")


// Using above configuration to define our SparkContext

val sc = new SparkContext(conf)


// Defining SQL context to run Spark SQL

// It uses library org.apache.spark.sql._

val sqlContext = new SQLContext(sc)


// Main function where all operations will occur

def main (args: Array[String]): Unit = {


// Reading the json file

val df = sqlContext.read.json("/Volumes/MYLAB/testdata/multilinecolors.json")


// Printing schema

df.printSchema()


// Saving as temporary table

df.registerTempTable("JSONdata")


// Retrieving all the records

val data=sqlContext.sql("select * from JSONdata")


// Showing all the records

data.show()


// Stopping Spark Context

sc.stop

}

}


 

7. Run the code!

  • Right click anywhere on the screen and select Run As Scala Application.


 

That's it!! If you have followed the steps properly you will find the result in Console.



We have successfully loaded JSON file using Spark SQL dataframes. Printed JSON schema and displayed the data.


Try reading single line JSON file which we created earlier. There is a multiline flag which you need to make true to read such files. Also, you can save this data in HDFS, database or CSV file depending upon your need. If you have any question, please don't forget to write in comments section below. Thank you.




Navigation menu

1. Apache Spark and Scala Installation

2. Getting Familiar with Scala IDE

3. Spark data structure basics

4. Spark Shell

5. Reading data files in Spark

6. Writing data files in Spark

7. Spark streaming


1件のコメント


Manisha Tank
Manisha Tank
2018年10月30日

It's very helpful for me Thank You so much

いいね!

Want to share your thoughts about this blog?

Disclaimer: Please note that the information provided on this website is for general informational purposes only and should not be taken as legal advice. Dataneb is a platform for individuals to share their personal experiences with visa and immigration processes, and their views and opinions may not necessarily reflect those of the website owners or administrators. 

 

While we strive to keep the information up-to-date and accurate, we make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability, or availability with respect to the website or the information, products, services, or related graphics contained on the website for any purpose. Any reliance you place on such information is therefore strictly at your own risk. 

 

We strongly advise that you consult with a qualified immigration attorney or official government agencies for any specific questions or concerns related to your individual situation. We are not responsible for any losses, damages, or legal disputes arising from the use of information provided on this website. 

 

By using this website, you acknowledge and agree to the above disclaimer and Google's Terms of Use and Privacy Policy.

bottom of page