Write CSV/JSON data to Elasticsearch using Spark dataframes

Elasticsearch-hadoop connector allows Spark-elasticsearch integration in Scala and Java language. Elasticsearch-hadoop library helps Apache Spark to integrate with Elasticsearch.

Contents:

Write JSON data to Elasticsearch using Spark dataframe
Write CSV file to Elasticsearch using Spark dataframe

I am using Elasticsearch version [7.3.0], Spark [2.3.1] and Scala [2.11].

Download Jar

In order to execute Spark with Elasticsearch, you need to download proper version of spark-elasticsearch jar file and add it to Spark's classpath. If you are running Spark in local mode it will be added to just one machine but if you are running in cluster, you need to add it per-node.

I assume you have already installed Elasticsearch, if not please follow these for installation steps (Linux | Mac users). Elasticsearch installation is very easy and it will be done in few minutes. I would encourage you all to install Kibana as well.

Now, you can download complete list of hadoop library (Storm, Mapreduce, Hive and Pig as shown below) from here. I have added elasticsearch-spark-20_2.10-7.3.0.jar because I am running Elastics 7.3 version.

[Tip] Make sure you are downloading correct version of jar, otherwise you will get this error during execution: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Unsupported/Unknown Elasticsearch version x.x.x

Adding Jar (Scala IDE)

If you are using Scala IDE, just right click on project folder => go to properties => Java build path => add external jars and add the downloaded jar file. Apply and close.

Adding Jar (Spark-shell)

If you are using Spark-shell, just navigate to the Spark executable library where you can see all other jar files and add the downloaded jar file there. For example,

Start Elasticsearch & Kibana

Now, make sure Elasticsearch is running. If Elasticsearch is not running, Spark will not be able to make connection and you will get this error.

org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed.

To start Elasticsearch and Kibana run this command on your terminal,

$ elasticsearch

$ kibana

Writing JSON data to Elasticsearch

In all sections these three steps are mandatory,

Import necessary elasticsearch spark library
Configure ES nodes
Configure ES port
If you are running ES on AWS just add this line to your configurations - .config("spark.es.nodes.wan.only","true")

JSON file

multilinecolors.json sample data:

[ { "color": "red", "value": "#f00" }, { "color": "green", "value": "#0f0" }, { "color": "blue", "value": "#00f" }, { "color": "cyan", "value": "#0ff" }, { "color": "magenta", "value": "#f0f" }, { "color": "yellow", "value": "#ff0" }, { "color": "black", "value": "#000" } ]

package com.dataneb.spark

import org.apache.spark.sql.SparkSession

import org.elasticsearch.spark.sql._

object toES {

def main(args: Array[String]): Unit = {

// Configuration

val spark = SparkSession

.builder()

.appName("WriteJSONToES")

.master("local[*]")

.config("spark.es.nodes","localhost")

.config("spark.es.port","9200")

.getOrCreate()

// Create dataframe

val colorsDF = spark.read.json("/Volumes/MYLAB/testdata/multilinecolors.json")

// Write to ES with index name in lower case

colorsDF.saveToEs("dataframejsonindex")

}

[Tip] Make sure you are writing index name in lower case otherwise you will get error:

org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Illegal write index name [ABCindex]. Write resources must be lowercase singular index names, with no illegal pattern characters except for multi-resource writes.

Here is the Scala IDE output,

You can also check the index created in Elasticsearch, go to Management => ES Index Management

You can further discover the index pattern in Kibana;

Writing CSV data to Elasticsearch

books.csv sample data:

bookID,title,authors,average_rating,isbn,isbn13,language_code,# num_pages,ratings_count,text_reviews_count

1,Harry Potter and the Half-Blood Prince (Harry Potter #6),J.K. Rowling-Mary GrandPré,4.56,0439785960,9780439785969,eng,652,1944099,26249

2,Harry Potter and the Order of the Phoenix (Harry Potter #5),J.K. Rowling-Mary GrandPré,4.49,0439358078,9780439358071,eng,870,1996446,27613

3,Harry Potter and the Sorcerer's Stone (Harry Potter #1),J.K. Rowling-Mary GrandPré,4.47,0439554934,9780439554930,eng,320,5629932,70390

4,Harry Potter and the Chamber of Secrets (Harry Potter #2),J.K. Rowling,4.41,0439554896,9780439554893,eng,352,6267,272

5,Harry Potter and the Prisoner of Azkaban (Harry Potter #3),J.K. Rowling-Mary GrandPré,4.55,043965548X,9780439655484,eng,435,2149872,33964

8,Harry Potter Boxed Set Books 1-5 (Harry Potter #1-5),J.K. Rowling-Mary GrandPré,4.78,0439682584,9780439682589,eng,2690,38872,154

Everything is same except the read method (json => csv) and index name.

package com.dataneb.spark

import org.apache.spark.sql.SparkSession

import org.elasticsearch.spark.sql._

object toES {

def main(args: Array[String]): Unit = {

val spark = SparkSession

.builder()

.appName("WriteJSONToES")

.master("local[*]")

.config("spark.es.nodes","localhost")

.config("spark.es.port","9200")

.getOrCreate()

val colorsDF = spark.read.csv("/Volumes/MYLAB/testdata/books*.csv")

colorsDF.saveToEs("dataframecsvindex")

}

Here is the Scala IDE output, I have two csv files books1.csv and books2.csv so you are seeing 2 task ID in result.

You can also check the index created in Elasticsearch, go to Management => ES Index Management

You can further create the index pattern in Kibana;

You can further discover the index pattern in Kibana. I haven't applied format options to read header while applying csv method in Spark program hence you are seeing header record in the index.

Thank you. If you have any question please write in comments section below.

Loading CSV data into Elasticsearch with Logstash

ELK stack Installation on OEL (Oracle Enterprise Linux)

Kibana GeoIP example: How to index geographical location of ...

Elasticsearch Tutorial - What is ELK stack (Elastic stack)

Kibana dashboard example

Terms

Policy

Privacy

Contact

Processing Time Calculator

Green Card Calculator