Skip to content

Commit 3acdd12

Browse files
authored
Merge 1.5.1 Release into master
2 parents 794380d + 5f59871 commit 3acdd12

File tree

7 files changed

+140
-10
lines changed

7 files changed

+140
-10
lines changed

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
*Issue #, if available:*
2+
3+
*Description of changes:*
4+
5+
6+
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

CHANGELOG.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
cfncluster-node CHANGELOG
2+
=========================
3+
4+
This file is used to list changes made in each version of the cfncluster-node package.
5+
6+
1.5.1
7+
-----
8+
9+
Bug fixes/minor improvements:
10+
11+
- Fixed Torque behaviour when scaling up from an empty cluster
12+
- Avoid Torque server restart when adding and removing compute nodes

CODE_OF_CONDUCT.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
## Code of Conduct
2+
This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3+
For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4+
opensource-codeofconduct@amazon.com with any additional questions or comments.

CONTRIBUTING.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
# Contributing Guidelines
2+
3+
Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
4+
documentation, we greatly value feedback and contributions from our community.
5+
6+
Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
7+
information to effectively respond to your bug report or contribution.
8+
9+
10+
## Reporting Bugs/Feature Requests
11+
12+
We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13+
14+
When filing an issue, please check [existing open](https://github.com/awslabs/cfncluster-node/issues), or [recently closed](https://github.com/awslabs/cfncluster-node/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aclosed%20), issues to make sure somebody else hasn't already
15+
reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16+
17+
* A reproducible test case or series of steps
18+
* The version of our code being used
19+
* Any modifications you've made relevant to the bug
20+
* Anything unusual about your environment or deployment
21+
22+
23+
## Contributing via Pull Requests
24+
Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25+
26+
1. You are working against the latest source on the *develop* branch.
27+
2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28+
3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29+
30+
To send us a pull request, please:
31+
32+
1. Fork the repository.
33+
2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34+
3. Ensure local tests pass.
35+
4. Commit to your fork using clear commit messages.
36+
5. Send us a pull request, answering any default questions in the pull request interface.
37+
6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38+
39+
GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40+
[creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41+
42+
43+
## Finding contributions to work on
44+
Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels ((enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any ['help wanted'](https://github.com/awslabs/cfncluster-node/labels/help%20wanted) issues is a great place to start.
45+
46+
47+
## Code of Conduct
48+
This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49+
For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50+
opensource-codeofconduct@amazon.com with any additional questions or comments.
51+
52+
53+
## Security issue notifications
54+
If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55+
56+
57+
## Licensing
58+
59+
See the [LICENSE](https://github.com/awslabs/cfncluster-node/blob/develop/LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60+
61+
We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes.

nodewatcher/plugins/torque.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ def getJobs(hostname):
4141
try:
4242
status, output = runPipe(commands)
4343
except subprocess.CalledProcessError:
44-
log.error("Failed to run %s\n" % _command)
44+
log.error("Failed to run %s\n" % commands)
4545

4646
if output == "":
4747
_jobs = False

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ def read(fname):
2121

2222
console_scripts = ['sqswatcher = sqswatcher.sqswatcher:main',
2323
'nodewatcher = nodewatcher.nodewatcher:main']
24-
version = "1.4.3"
24+
version = "1.5.1"
2525
requires = ['boto>=2.48.0', 'python-dateutil>=2.6.1']
2626

2727
if sys.version_info[:2] == (2, 6):

sqswatcher/plugins/torque.py

Lines changed: 55 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -16,17 +16,68 @@
1616
import paramiko
1717
import logging
1818
import shlex
19+
import time
20+
import xml.etree.ElementTree as xmltree
21+
import socket
1922

2023
log = logging.getLogger(__name__)
2124

2225
def __runCommand(command):
2326
log.debug(repr(command))
2427
_command = shlex.split(str(command))
2528
log.debug(_command)
29+
30+
DEV_NULL = open(os.devnull, "rb")
2631
try:
27-
sub.check_call(_command, env=dict(os.environ))
28-
except sub.CalledProcessError:
29-
log.error("Failed to run %s\n" % _command)
32+
process = sub.Popen(_command, env=dict(os.environ), stdout=sub.PIPE, stderr=sub.STDOUT, stdin=DEV_NULL)
33+
stdout = process.communicate()[0]
34+
exitcode = process.poll()
35+
if exitcode != 0:
36+
log.error("Failed to run %s:\n%s" % (_command, stdout))
37+
return stdout
38+
finally:
39+
DEV_NULL.close()
40+
41+
42+
def isHostInitState(host_state):
43+
# Node states http://docs.adaptivecomputing.com/torque/6-0-2/adminGuide/help.htm#topics/torque/8-resources/resources.htm#nodeStates
44+
init_states = ("down", "offline", "unknown", str(None))
45+
return str(host_state).startswith(init_states)
46+
47+
def wakeupSchedOn(hostname):
48+
log.info('Waking up scheduler on host %s', hostname)
49+
command = ("/opt/torque/bin/pbsnodes -x %s" % (hostname))
50+
51+
sleep_time = 3
52+
times = 20
53+
host_state = None
54+
while isHostInitState(host_state) and times > 0:
55+
output = __runCommand(command)
56+
try:
57+
# Ex.1: <Data><Node><name>ip-10-0-76-39</name><state>down,offline,MOM-list-not-sent</state><power_state>Running</power_state>
58+
# <np>1</np><ntype>cluster</ntype><mom_service_port>15002</mom_service_port><mom_manager_port>15003</mom_manager_port></Node></Data>
59+
# Ex 2: <Data><Node><name>ip-10-0-76-39</name><state>free</state><power_state>Running</power_state><np>1</np><ntype>cluster</ntype>
60+
# <status>rectime=1527799181,macaddr=02:e4:00:b0:b1:72,cpuclock=Fixed,varattr=,jobs=,state=free,netload=210647044,gres=,loadave=0.00,
61+
# ncpus=1,physmem=1017208kb,availmem=753728kb,totmem=1017208kb,idletime=856,nusers=1,nsessions=1,sessions=19698,
62+
# uname=Linux ip-10-0-76-39 4.9.75-25.55.amzn1.x86_64 #1 SMP Fri Jan 5 23:50:27 UTC 2018 x86_64,opsys=linux</status>
63+
# <mom_service_port>15002</mom_service_port><mom_manager_port>15003</mom_manager_port></Node></Data>
64+
xmlnode = xmltree.XML(output)
65+
host_state = xmlnode.findtext("./Node/state")
66+
except:
67+
log.error("Error parsing XML from %s" % output)
68+
69+
if isHostInitState(host_state):
70+
log.debug("Host %s is still in state %s" % (hostname, host_state))
71+
time.sleep(sleep_time)
72+
times -= 1
73+
74+
if host_state == "free":
75+
command = "/opt/torque/bin/qmgr -c \"set server scheduling=true\""
76+
__runCommand(command)
77+
elif times == 0:
78+
log.error("Host %s is still in state %s" % (hostname, host_state))
79+
else:
80+
log.debug("Host %s is in state %s" % (hostname, host_state))
3081

3182
def addHost(hostname,cluster_user,slots):
3283
log.info('Adding %s', hostname)
@@ -64,8 +115,7 @@ def addHost(hostname,cluster_user,slots):
64115
ssh.save_host_keys(hosts_key_file)
65116
ssh.close()
66117

67-
command = ('/etc/init.d/pbs_server restart')
68-
__runCommand(command)
118+
wakeupSchedOn(hostname)
69119

70120
def removeHost(hostname, cluster_user):
71121
log.info('Removing %s', hostname)
@@ -76,6 +126,3 @@ def removeHost(hostname, cluster_user):
76126
command = ("/opt/torque/bin/qmgr -c 'delete node %s'" % hostname)
77127
__runCommand(command)
78128

79-
command = ('/etc/init.d/pbs_server restart')
80-
__runCommand(command)
81-

0 commit comments

Comments
 (0)