Evalator hangs while training

**Environment:**
 - Python version 3.7
 - Spark version 2.4
 - TensorFlow version 2.5
 - TensorFlowOnSpark version 2.2.3
 - Cluster version hadoop

**Describe the bug:**
I found the evaluator node won't work any more after sometime while training nodes work fine and the whole cluster doesn't crash. The total training step is 80000 and the evaluator only evaluates for 10000+ step. After that no more logs are output. 
![image](https://user-images.githubusercontent.com/8109984/182289029-f5fcc126-104c-433c-88e7-df567c7ec0d8.png)

![image](https://user-images.githubusercontent.com/8109984/182289062-21a60251-69e8-4a4c-a865-8330665f371b.png)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evalator hangs while training #589

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evalator hangs while training #589

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions