Things to Consider while Indexing Large volume of Data to ElasticSearch

Indexing large volume of data usually more than a TB from hadoop to
elasticsearch may not work as your expectation,and to get the job
done you may need to make specific changes in both sides:

ElasticSearch cluster side
Hadoop Cluster side

For ElasticSearch

try max_open_file_descriptors to at least 32k or 64k
try setting no replicas at all and later change it back to origional
turn off refreshing and later chage it back to origional
disable flushing and later change it back to origional
try indices.memory.index_buffer_size to higher value than default 10% and later chage it back to origional
disable index warm up and later change it back to origional

For Hadoop

set number of tasks to 10-20
try bulk request with proper request size
try node_client instead of transport client
try multiple parallel clients
(In hadoop it happens usually since each task run in parallel but keep in mind that nothing comes for free so too many parallel client may give high load to elasticsearch cluster)
set the query parameter ReplicationType to async

Note

Again keep in mind that nothing comes for free and you may get poor search experience if you only optimize your elasticsearch cluster for high indexing throughput. So good option is to optimize cluster for maximum indexing throughput at the time of indexing and after completing the indexing of data optimize it for search.

7 comments

Balas

Unknown

June 12, 2015 at 10:50 PM

Thanks for your support, i am very interested in learning Hadoop.. If you want more details on HADOOP BIGDATA
just go through this link.....http://www.tekclasses.com/courses/hadoop/