1 year ago
#312331
knkn1711
OpenMPI does not recognize multiple nodes?
I am trying to run a Julia script in paralell on a cluster. The cluster uses Moab and Torque for the scheduler and resource manager. Since SSH seems to be restricted, I use MPI for multiprocessing.
I throw the following job, requesting for 3 nodes:
#!/bin/bash
#PBS -l walltime=1:00:00
#PBS -l pmem=10gb
#PBS -l nodes=3:ppn=1
#PBS -j oe
#PBS -A open
#PBS -o (some path)
#PBS -e (some path)
cd (some path)
echo ""
echo "JOB Started on $(hostname -s) at $(date)"
echo ""
module purge
module use (some path)/modules
module load julia
module load openmpi
mpirun -np 3 -display-allocation julia --project=. "(some path)/test.jl"
echo ""
echo "JOB ended at $(date)"
But it if I look at the output script, it seems that it recognizes only one node, comp-bc-0384
:
JOB Started on comp-bc-0384 at Sat Mar 19 22:05:12 EDT 2022
====================== ALLOCATED NODES ======================
comp-bc-0384: slots=24 max_slots=0 slots_inuse=0 state=UP
=================================================================
--------------------------------------------------------------------------
[[12308,1],2]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: comp-bc-0384
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
[comp-bc-0384.acib.production.int.aci.ics.psu.edu:10656] 2 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[comp-bc-0384.acib.production.int.aci.ics.psu.edu:10656] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
10.214858 seconds (116.21 k allocations: 6.110 MiB)
JOB ended at Sat Mar 19 22:05:36 EDT 2022
I was expecting the ALLOCATED NODES
section to display the other node(s) I was assigned to.
A similar question in the past (openMPI/mpich2 doesn't run on multiple nodes) suggests that it has something to do with host file.
Therefore I also tried with mpirun -hostfile $PBS_NODEFILE -np 3 -display-allocation julia --project=. "(some path)/test.jl"
. It then returns the following:
JOB Started on comp-bc-0384 at Sat Mar 19 22:16:15 EDT 2022
Host key verification failed.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
JOB ended at Sat Mar 19 22:16:16 EDT 2022
What could be the cause here?
julia
mpi
openmpi
pbs
torque
0 Answers
Your Answer