Things to Consider while Indexing Large volume of Data to ElasticSearch

Indexing large volume of data usually more than a TB from hadoop to
elasticsearch may not work as your expectation,and to get the job
done you may need to make specific changes in both sides:

  • ElasticSearch cluster side
  • Hadoop Cluster side

For ElasticSearch

  • try max_open_file_descriptors to at least 32k or 64k
  • try setting no replicas at all and later change it back to origional
  • turn off refreshing and later chage it back to origional
  • disable flushing and later change it back to origional
  • try indices.memory.index_buffer_size to higher value than default 10% and later chage it back to origional
  • disable index warm up and later change it back to origional

For Hadoop

  • set number of tasks to 10-20
  • try bulk request with proper request size
  • try node_client instead of transport client
  • try multiple parallel clients

    (In hadoop it happens usually since each task run in parallel but keep in mind that nothing comes for free so too many parallel client may give high load to elasticsearch cluster)
  • set the query parameter ReplicationType to async

Note

Again keep in mind that nothing comes for free and you may get poor search experience if you only optimize your elasticsearch cluster for high indexing throughput. So good option is to optimize cluster for maximum indexing throughput at the time of indexing and after completing the indexing of data optimize it for search.

2 comments

Thanks for your support, i am very interested in learning Hadoop.. If you want more details on HADOOP BIGDATA
just go through this link.....http://www.tekclasses.com/courses/hadoop/