venerdì 19 febbraio 2021

Elasticsearch on Workspace One Access (former vIDM) start and exit with status 7

Issue


This week I had problem with elasticsearch service on Workspace ONE Access (former VIDM) part of the new VMware Cloud Foundation environment (VCF Version 4.X). It seems that the service has some problem in the startup phase on all the nodes that compose the cluster. 'elasticsearch start' exits with status 7.
Workspace One Access version is 3.3.2-15951611.

Opening the console was present an Error message like “Error: Error log is in /var/log/boot.msg.”
Part of the message are reported below:
 No JSON object could be decoded
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib64/python2.6/json/__init__.py", line 267, in load
    parse_constant=parse_constant, **kw)
  File "/usr/lib64/python2.6/json/__init__.py", line 307, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.6/json/decoder.py", line 319, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python2.6/json/decoder.py", line 338, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
Number of nodes in cluster is : 
Configuring /opt/vmware/elasticsearch/config/elasticsearch.yml file
Starting elasticsearch: 
<notice -- Feb 15 15:05:17.122319000> 'elasticsearch start' exits with status 7
<notice -- Feb 15 15:05:17.130417000> hzn-dots start
Application Server already running.
<notice -- Feb 15 15:05:17.339108000> 'hzn-dots start' exits with status 0
Master Resource Control: runlevel 3 has been reached
Failed services in runlevel 3: elasticsearch
Skipped services in runlevel 3: splash
<notice -- Feb 15 15:05:17.340630000> 
killproc: kill(456,3)

Solution


Disclaimer: Procedures described below, if you are not fully aware of what you are changing, it is advisable to make the changes with the help of the VMware GSS to prevent the environment from becoming unstable. Use it at your own risk.

Short Answer
We just need to run the following commands on each Workspace ONE Access appliance, to understand if nodes communicates each other, and so on..

  • Check how many nodes are part of the cluster:
    curl -s -XGET http://localhost:9200/_cat/nodes
  • Check cluster health:
    curl http://localhost:9200/_cluster/health?pretty=true
  • Check the queue list of rabbitmq
    rabbitmqctl list_queues | grep analytics
  • If the cluster health is red run these commands:
    • to find UNASSIGNED SHARDS:
      curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason | grep UNASSIGNED
    • to DELETE SHARDS:
      curl -XGET http://localhost:9200/_cat/shards | grep UNASSIGNED | awk {'print $1'} | xargs -i curl -XDELETE "http://localhost:9200/{}"
  • Recheck the health to insure it is green and once green ....
    curl http://localhost:9200/_cluster/health?pretty=true
  • ... then check the elastic search if it is working or not.

  • Nodes may need to be restarted. Proceed as follows:
    • turn off 2 nodes and leave one active
    • turn on a node again (at time), wait for it to appear in the cluster and start correctly
    • do the same with the third node
    • When the third is active and present in the cluster, perform a clean restart cycle also for the first node.


Long Answer (with command's output)
The commands that we will perform into the long answer will be the same already explained above, but we will report down here the output (of one node only). We remember that the commands must be performed on each nodes part of the cluster.

  • Check how many nodes are part of the cluster:
    custm-vrsidm1:~ # curl -s -XGET http://localhost:9200/_cat/nodes
    10.174.28.18 10.174.28.18 6 98 0.31 d * Exploding Man
  • Check cluster health:
    custm-vrsidm1:~ # curl http://localhost:9200/_cluster/health?pretty=true
    {
      "cluster_name" : "horizon",
      "status" : "red",
      "timed_out" : false,
      "number_of_nodes" : 1,
      "number_of_data_nodes" : 1,
      "active_primary_shards" : 74,
      "active_shards" : 74,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 146,
      "delayed_unassigned_shards" : 0,
      "number_of_pending_tasks" : 0,
      "number_of_in_flight_fetch" : 0,
      "task_max_waiting_in_queue_millis" : 0,
      "active_shards_percent_as_number" : 33.63636363636363
    }
  • Check the queue list of rabbitmq
    custm-vrsidm1:~ #  rabbitmqctl list_queues | grep analytics
    -.analytics.127.0.0.1   0
  • If the cluster health is red run these commands:
    • to find UNASSIGNED SHARDS:
      custm-vrsidm1:~ # curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
        % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                       Dload  Upload   Total   Spent    Left  Speed
      100 11440  100 11440    0     0   270k      0 --:--:-- --:--:-- --:--:--  279k
      v4_2021-02-14     4 r UNASSIGNED CLUSTER_RECOVERED
      v4_2021-02-14     1 r UNASSIGNED CLUSTER_RECOVERED
      v4_2021-02-14     2 p UNASSIGNED CLUSTER_RECOVERED
      v4_2021-02-14     2 r UNASSIGNED CLUSTER_RECOVERED
      v4_2021-02-14     3 r UNASSIGNED CLUSTER_RECOVERED
      v4_2021-02-14     0 r UNASSIGNED CLUSTER_RECOVERED
      v4_2021-02-03     4 p UNASSIGNED CLUSTER_RECOVERED
      v4_2021-02-03     4 r UNASSIGNED CLUSTER_RECOVERED
      v4_2021-02-03     3 p UNASSIGNED CLUSTER_RECOVERED
      v4_2021-02-03     3 r UNASSIGNED CLUSTER_RECOVERED
      v4_2021-01-28     4 r UNASSIGNED CLUSTER_RECOVERED
      v4_2021-01-28     3 r UNASSIGNED CLUSTER_RECOVERED
      v4_2021-01-28     2 r UNASSIGNED CLUSTER_RECOVERED
      v4_2021-01-28     1 r UNASSIGNED CLUSTER_RECOVERED
      v4_2021-01-28     0 r UNASSIGNED CLUSTER_RECOVERED
      v2_searchentities 4 p UNASSIGNED CLUSTER_RECOVERED
      v2_searchentities 4 r UNASSIGNED CLUSTER_RECOVERED
      v2_searchentities 1 r UNASSIGNED CLUSTER_RECOVERED
      v2_searchentities 2 r UNASSIGNED CLUSTER_RECOVERED
      v2_searchentities 3 r UNASSIGNED CLUSTER_RECOVERED
      v2_searchentities 0 r UNASSIGNED CLUSTER_RECOVERED
      v4_2021-02-06     4 p UNASSIGNED CLUSTER_RECOVERED
      v4_2021-02-06     4 r UNASSIGNED CLUSTER_RECOVERED
      v4_2021-01-27     0 r UNASSIGNED CLUSTER_RECOVERED
      v4_2021-02-05     4 p UNASSIGNED CLUSTER_RECOVERED
      v4_2021-02-05     4 r UNASSIGNED CLUSTER_RECOVERED
      .................................................
      v4_2021-02-05     2 r UNASSIGNED CLUSTER_RECOVERED
      v4_2021-02-05     1 r UNASSIGNED CLUSTER_RECOVERED
      v4_2021-02-05     0 r UNASSIGNED CLUSTER_RECOVERED
      v4_2021-01-26     4 p UNASSIGNED CLUSTER_RECOVERED
      v4_2021-01-26     4 r UNASSIGNED CLUSTER_RECOVERED
      v4_2021-02-04     1 r UNASSIGNED CLUSTER_RECOVERED
      v4_2021-02-04     0 r UNASSIGNED CLUSTER_RECOVERED
    • to DELETE SHARDS:
      custm-vrsidm1:~ # curl -XGET http://localhost:9200/_cat/shards | grep UNASSIGNED | awk {'print $1'} | xargs -i curl -XDELETE "http://localhost:9200/{}"
        % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                       Dload  Upload   Total   Spent    Left  Speed
      100 16060  100 16060    0     0   589k      0 --:--:-- --:--:-- --:--:--  627k
      {"acknowledged":true}{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-02-14","index":"v4_2021-02-14"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-02-14","index":"v4_2021-02-14"},"status":404}{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-02-14","index":"v4_2021-02-14"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-02-14","index":"v4_2021-02-14"},"status":404}{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-02-14","index":"v4_2021-02-14"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-02-14","index":"v4_2021-02-14"},"status":404}{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-02-14","index":"v4_2021-02-14"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-02-14","index":"v4_2021-02-14"},"status":404}{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-02-14","index":"v4_2021-02-14"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-02-14","index":"v4_2021-02-14"},"status":404}{"acknowledged":true}{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-02-03","index":"v4_2021-02-03"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-02-03","index":"v4_2021-02-03"},"status":404}
      ..........................................................
      {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-01-28","index":"v4_2021-01-28"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-01-28","index":"v4_2021-01-28"},"status":404}{"acknowledged":true}{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v2_searchentities","index":"v2_searchentities"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v2_searchentities","index":"v2_searchentities"},"status":404}{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v2_searchentities","index":"v2_searchentities"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v2_searchentities","index":"v2_searchentities"},"status":404}{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v2_searchentities","index":"v2_searchentities"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v2_searchentities","index":"v2_searchentities"},"status":404}{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v2_searchentities","index":"v2_searchentities"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v2_searchentities","index":"v2_searchentities"},"status":404}{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v2_searchentities","index":"v2_searchentities"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v2_searchentities","index":"v2_searchentities"},"status":404}{"acknowledged":true}{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-02-06","index":"v4_2021-02-06"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-02-06","index":"v4_2021-02-06"},"status":404}{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-02-06","index":"v4_2021-02-06"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-02-06","index":"v4_2021-02-06"},"status":404}{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-02-06","index":"v4_2021-02-06"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-02-06","index":"v4_2021-02-06"},"status":404}{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-02-04","index":"v4_2021-02-04"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-02-04","index":"v4_2021-02-04"},"status":404}{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-02-04","index":"v4_2021-02-04"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"v4_2021-02-04","index":"v4_2021-02-04"},"status":404}
  • Recheck the health to insure it is green and once green ....
    custm-vrsidm1:~ # curl http://localhost:9200/_cluster/health?pretty=true
    {
      "cluster_name" : "horizon",
      "status" : "green",
      "timed_out" : false,
      "number_of_nodes" : 1,
      "number_of_data_nodes" : 1,
      "active_primary_shards" : 0,
      "active_shards" : 0,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0,
      "delayed_unassigned_shards" : 0,
      "number_of_pending_tasks" : 0,
      "number_of_in_flight_fetch" : 0,
      "task_max_waiting_in_queue_millis" : 0,
      "active_shards_percent_as_number" : 100.0
    }
  • ... then check the elastic search if it is working or not.

  • After the reboots of the all nodes. Number_of_nodes and number_of_data_nodes is now three (in my case) as should be .....
    custm-vrsidm1:~ # curl http://localhost:9200/_cluster/health?pretty=true
    {
      "cluster_name" : "horizon",
      "status" : "green",
      "timed_out" : false,
      "number_of_nodes" : 3,
      "number_of_data_nodes" : 3,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0,
      "delayed_unassigned_shards" : 0,
      "number_of_pending_tasks" : 0,
      "number_of_in_flight_fetch" : 0,
      "task_max_waiting_in_queue_millis" : 0,
      "active_shards_percent_as_number" : 100.0
    }
    custm-vrsidm1:~ #
    custm-vrsidm1:~ #  curl -s -XGET http://localhost:9200/_cat/nodes
    10.174.28.19 10.174.28.19 14 97 0.20 d * Orka
    10.174.28.20 10.174.28.20  5 97 0.18 d m Mongoose
    10.174.28.18 10.174.28.18 11 96 0.47 d m Urthona


So, now VIDM seems to be up and running, if we check NSX-T's LB we can see that .....
... the pool is successfully contacting all nodes.
We are also, able to log into .....
... and check graphically that everything is ...
... FINE.

A double check can be done, verifying the file /var/log/boot.msg
<notice -- Feb 16 18:31:28.776900000> 
elasticsearch start

horizon-workspace service is running
Waiting for IDM: ..........
<notice -- Feb 16 18:33:44.203450000> checkproc: /opt/likewise/sbin/lwsmd 1419
<notice -- Feb 16 18:33:44.530367000> 
checkproc: /opt/likewise/sbin/lwsmd 
1419

... Ok.
Number of nodes in cluster is : 3
Configuring /opt/vmware/elasticsearch/config/elasticsearch.yml file
Starting elasticsearch: done.
    elasticsearch logs: /opt/vmware/elasticsearch/logs
    elasticsearch data: /db/elasticsearch
<notice -- Feb 16 18:34:39.403558000> 
'elasticsearch start' exits with status 0


That's it.

Nessun commento:

Posta un commento