1 year ago

#312331

test-img

knkn1711

OpenMPI does not recognize multiple nodes?

I am trying to run a Julia script in paralell on a cluster. The cluster uses Moab and Torque for the scheduler and resource manager. Since SSH seems to be restricted, I use MPI for multiprocessing.

I throw the following job, requesting for 3 nodes:

#!/bin/bash
#PBS -l walltime=1:00:00
#PBS -l pmem=10gb   
#PBS -l nodes=3:ppn=1
#PBS -j oe
#PBS -A open
#PBS -o (some path)
#PBS -e (some path)

cd (some path)
echo ""
echo "JOB Started on $(hostname -s) at $(date)"

echo ""
module purge
module use (some path)/modules
module load julia
module load openmpi
mpirun -np 3 -display-allocation julia --project=.  "(some path)/test.jl"

echo ""
echo "JOB ended at $(date)"

But it if I look at the output script, it seems that it recognizes only one node, comp-bc-0384:

JOB Started on comp-bc-0384 at Sat Mar 19 22:05:12 EDT 2022


======================   ALLOCATED NODES   ======================
    comp-bc-0384: slots=24 max_slots=0 slots_inuse=0 state=UP
=================================================================
--------------------------------------------------------------------------
[[12308,1],2]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: comp-bc-0384

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
[comp-bc-0384.acib.production.int.aci.ics.psu.edu:10656] 2 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[comp-bc-0384.acib.production.int.aci.ics.psu.edu:10656] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
 10.214858 seconds (116.21 k allocations: 6.110 MiB)

JOB ended at Sat Mar 19 22:05:36 EDT 2022

I was expecting the ALLOCATED NODES section to display the other node(s) I was assigned to. A similar question in the past (openMPI/mpich2 doesn't run on multiple nodes) suggests that it has something to do with host file. Therefore I also tried with mpirun -hostfile $PBS_NODEFILE -np 3 -display-allocation julia --project=. "(some path)/test.jl" . It then returns the following:

JOB Started on comp-bc-0384 at Sat Mar 19 22:16:15 EDT 2022

Host key verification failed.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

JOB ended at Sat Mar 19 22:16:16 EDT 2022

What could be the cause here?

julia

mpi

openmpi

pbs

torque

0 Answers

Your Answer

Accepted video resources