Back on articles

Tech blog: Heavy load from MS SQL into Elasticsearch

The Instarea Technology Blog

At Instarea | Big Data Monetization we work with companies such a telecommunication operators to unlock the potential hidden away in their anonymized & aggregated big data. Our labs team (responsible for coming up with new ways to process, visualize and handle the data) has to solve a variety of issues when looking for ways to improve the way we process, query or visualize data. In this technical blog series they will share their findings (some of them worth days of research & tens of coffees), so that you don’t have to go through the same pain yourself 🙂

And if you are from Bratislava & want to solve similar challenges on a daily basis, then check out our Careers page & stop by for a coffee.

Heavy load from MS SQL into Elasticsearch

When the volume of data for our searches started to grow and it definitely could be labeled with the fancy adjective “big”, we started to look around to use some cool stuff called “nosql database” to replace our present SQL approach. Elasticsearch (ES) was the candidate number one, as we needed to search by really complex queries.

Let me explain, what the “big” means for us. We need to search within more than 500 million records. This is completely fine – Elasticsearch is built to cope with it. But… we need to completely replace whole dataset every day (!) and as we found out very early, indexing data into ES is not that fast.

Option 1: Connecting to SQL server via JDBC

As we needed to move data from MS SQL server to ES, the most straightforward way would be to use JDBC driver to connect to the MS SQL server and pour data directly to ES. And here comes Logstash, a very good friend of ES made by the same developers.

You can download JDBC driver here, and then create a configuration file import_jdbc.conf which copies data from table testtable to ES index testindex.

input {
	jdbc {
    	jdbc_connection_string => "jdbc:sqlserver://localhost"
    	jdbc_user => "ms_sql_login"
    	jdbc_password => "ms_sql_password"
    	jdbc_driver_library => "C:\temp\sqljdbc42.jar"
    	jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
     	jdbc_validate_connection => true
    	statement => "SELECT * FROM table"
	}
}
filter {
	metrics {
    	meter => "documents"
    	add_tag => "metric"
    	flush_interval => 60
	}
}
output {
   elasticsearch {
 	action => "index"
 	hosts => ["127.0.0.1:9200"]
 	index => "test"
document_type => "doc"
  }
  if "metric" in [tags] {
 	stdout {
     	codec => line {
          	format => "1m rate: %{[documents][rate_1m]} ( %{[documents][count]} )"
     	}
 	}
  }
}

For testing purposes, we created an index with 4 shards on a single node, which possesses 8 virtual CPUs and 12 GB RAM – not that great metal, but our mission was to squeeze the most from the lightweight one.

From our table with 11 columns ES was able to eat 3370 rows per second. What did it mean to us? Oh, we needed to wait around 41 hours to move the entire table!

Option 2: Importing CSV using Logstash

Ok, we did not have 2 days for importing the data. So we tried something else.

We suspected JDBC driver to be the bottleneck. So, another option was to export data from MS SQL and import it separately. Such configuration file could look like this one:

# file: from_csv.conf
input {
	file {
	path => "C:\temp\export.csv"
	start_position => "beginning"
	sincedb_path => "/dev/null"
  }
}
filter {
  csv {
  	separator => ","
  	skip_empty_columns => true
  	columns => ["column1","column2","column3"]  	
  }
  metrics {
 	meter => "documents"
 	add_tag => "metric"
 	flush_interval => 60
  }
}
output {
  elasticsearch {
   	action => "index"
   	hosts => ["localhost:9200"]
   	index => "test"
   	document_type => "doc"
    	document_id => "%{targetid}"
  }
  if "metric" in [tags] {
 	stdout {
     	codec => line {
          	format => "1m rate: %{[documents][rate_1m]} ( %{[documents][count]} )"
     	}
 	}
  }
}

This leads to only a little better rate, 3390 rows/s. But Logstash in default adds a lot of fields to the index for each document, which are useless for us, such as message, path, @timestamp, host and @version. Let’s get rid of them by adding this line to the csv section:

remove_field => [ "host", "message", "path", "@timestamp", "@version" ]

We can also tell ES to store the values as smaller types (also in the csv section) when possible:

convert > {
    "column1" => "integer"
    "column2" => "integer"
    "column2" => "integer"
}

Now, the rate is 3770 rows/s. Improvement by 10% – we are moving forward.

Then, there are lot of settings that you can play around. But after a lot of attempts most probably you will realize that most settings will have no or little effect. Interestingly, storing index on spin disk was not worse than SSD. Or creating index with 8 shards made importing slower (on the other hand, our search queries performed better with more shards).

The only helpful setting was refresh_interval that was set when creating index to some higher value. For testing, we used “60s” (actually, setting higher values makes only a little difference). A bit higher import rate 3910 rows/s is, however, still completely unusable for us. Even importing data in JSON format (one used by ES to store documents) instead of CSV makes no difference. This lead us to suspect the Logstash to be the possible bottleneck. But is it?

Option 3: Bulk load via Elasticsearch API

So, we decided to avoid Logstash and rely on ES API to import the data. ES has one suitable API that is tailored for bulk operations including index operation for storing a JSON document. Well, you only need to export data from MS SQL in JSON format, first. You can create a view to format the data or you can leverage a built-in function starting from MS SQL 2016. One row of the exported file could look something like this:

{"index":{}}
{"column1":1092137,"column2":146212,"column3":146216}

And then we make a http request using curl in Cygwin, for example:

curl -o nul -H 'Content-Type: application/x-ndjson' -XPOST localhost:9200/test/doc/_bulk --data-binary @export.json

Wow, 100 thousand rows in less than 5 seconds. This looks much more optimistic. However, when we need to import file with several gigabytes, it has to be split into smaller parts. Therefore, we created a bash script which splits the file, make requests and deletes the files.

#!/bin/sh
LINES=400000
split -l $LINES export.json
 
for xf in $(ls | grep x..$)
do
  curl -o nul -H 'Content-Type: application/x-ndjson' -XPOST localhost:9200/test/doc/_bulk --data-binary @$xf
  rm $xf
done

This script allows us to import files with rate 20 200 rows/s into the index with 4 shards. With 8 shards it was even more, 28 700 rows/s, which is more than 7x times faster.
Note that exporting data from MS SQL in CSV and JSON took the same time, however JSON file is almost 3 time bigger.

Sources & additional info:

Kudos to Peter & Tomas

Special thanks to our engineers Peter Kratky and Tomas Repik for putting this blog together for you! If you have any follow-up questions drop us a message & we will convey it to the guys.

Peter Kratky

Data Scientist @ Instarea. PhD graduate of informatics at Slovak University of Technology in Bratislava (FIIT STU). Experienced in data analysis and machine learning with several international research publications.

Share article with friends
Embark on your big data monetization journey today! Contact us