setting up the environment by loading in conda environment at Thu Sep 4 11:14:26 CDT 2025 running the olmo labeling job at Thu Sep 4 11:14:26 CDT 2025 ---------------------------------------- srun job start: Thu Sep 4 11:14:27 CDT 2025 Job ID: 3273582 Username: nws8519 Queue: gengpu Account: p32852 ---------------------------------------- The following variables are not guaranteed to be the same in the prologue and the job run script ---------------------------------------- PATH (in prologue) : /home/nws8519/.conda/envs/olmo/bin:/software/miniconda3/4.12.0/condabin:/home/nws8519/.local/bin:/home/nws8519/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/lpp/mmfs/bin:/hpc/usertools WORKDIR is: /home/nws8519 ---------------------------------------- W0904 11:14:40.413000 1736745 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] W0904 11:14:40.413000 1736745 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] ***************************************** W0904 11:14:40.413000 1736745 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0904 11:14:40.413000 1736745 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] ***************************************** W0904 11:14:40.413000 1736746 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] W0904 11:14:40.413000 1736746 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] ***************************************** W0904 11:14:40.413000 1736746 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0904 11:14:40.413000 1736746 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] ***************************************** W0904 11:14:40.413000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] W0904 11:14:40.413000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] ***************************************** W0904 11:14:40.413000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0904 11:14:40.413000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] ***************************************** W0904 11:14:40.413000 2769137 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] W0904 11:14:40.413000 2769137 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] ***************************************** W0904 11:14:40.413000 2769137 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0904 11:14:40.413000 2769137 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] ***************************************** [nltk_data] Downloading package punkt to /home/nws8519/nltk_data... [nltk_data] Downloading package punkt to /home/nws8519/nltk_data... [nltk_data] Downloading package punkt to /home/nws8519/nltk_data... [nltk_data] Downloading package punkt to /home/nws8519/nltk_data... [nltk_data] Package punkt is already up-to-date![nltk_data] Package punkt is already up-to-date! [nltk_data] Package punkt is already up-to-date![nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package punkt_tab to [nltk_data] /home/nws8519/nltk_data...[nltk_data] Downloading package punkt_tab to [nltk_data] /home/nws8519/nltk_data... [nltk_data] Downloading package punkt_tab to [nltk_data] /home/nws8519/nltk_data...[nltk_data] Downloading package punkt_tab to [nltk_data] /home/nws8519/nltk_data... [nltk_data] Package punkt_tab is already up-to-date![nltk_data] Package punkt_tab is already up-to-date! [nltk_data] Package punkt_tab is already up-to-date![nltk_data] Package punkt_tab is already up-to-date! /home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py:120: DtypeWarning: Columns (21) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_labels.csv") /home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py:120: DtypeWarning: Columns (21) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_labels.csv") /home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py:120: DtypeWarning: Columns (21) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_labels.csv") [rank3]:[W904 11:15:22.374478896 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 3] using GPU 1 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device. /home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py:120: DtypeWarning: Columns (21) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_labels.csv") [rank1]:[W904 11:15:22.049509730 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device. [rank2]:[W904 11:15:22.461549051 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 2] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device. Fetching 12 files: 0%| | 0/12 [00:00 [rank2]: main() [rank2]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 143, in main [rank2]: ddp_olmo = DDP(olmo, device_ids=[local_rank]) [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 850, in __init__ [rank2]: self._ddp_init_helper( [rank2]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1201, in _ddp_init_helper [rank2]: self.reducer = dist.Reducer( [rank2]: ^^^^^^^^^^^^^ [rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 51.10 GiB. GPU 0 has a total capacity of 79.25 GiB of which 27.52 GiB is free. Including non-PyTorch memory, this process has 51.72 GiB memory in use. Of the allocated memory 51.10 GiB is allocated by PyTorch, and 875.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [rank3]: Traceback (most recent call last): [rank3]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 188, in [rank3]: main() [rank3]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 143, in main [rank3]: ddp_olmo = DDP(olmo, device_ids=[local_rank]) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 850, in __init__ [rank3]: self._ddp_init_helper( [rank3]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1201, in _ddp_init_helper [rank3]: self.reducer = dist.Reducer( [rank3]: ^^^^^^^^^^^^^ [rank3]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 51.10 GiB. GPU 1 has a total capacity of 79.25 GiB of which 27.52 GiB is free. Including non-PyTorch memory, this process has 51.72 GiB memory in use. Of the allocated memory 51.10 GiB is allocated by PyTorch, and 875.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [rank0]: Traceback (most recent call last): [rank0]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 188, in [rank0]: main() [rank0]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 143, in main [rank0]: ddp_olmo = DDP(olmo, device_ids=[local_rank]) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 850, in __init__ [rank0]: self._ddp_init_helper( [rank0]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1201, in _ddp_init_helper [rank0]: self.reducer = dist.Reducer( [rank0]: ^^^^^^^^^^^^^ [rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 51.10 GiB. GPU 0 has a total capacity of 79.25 GiB of which 27.52 GiB is free. Including non-PyTorch memory, this process has 51.72 GiB memory in use. Of the allocated memory 51.10 GiB is allocated by PyTorch, and 875.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [rank1]: Traceback (most recent call last): [rank1]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 188, in [rank1]: main() [rank1]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 143, in main [rank1]: ddp_olmo = DDP(olmo, device_ids=[local_rank]) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 850, in __init__ [rank1]: self._ddp_init_helper( [rank1]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1201, in _ddp_init_helper [rank1]: self.reducer = dist.Reducer( [rank1]: ^^^^^^^^^^^^^ [rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 51.10 GiB. GPU 1 has a total capacity of 79.25 GiB of which 27.52 GiB is free. Including non-PyTorch memory, this process has 51.72 GiB memory in use. Of the allocated memory 51.10 GiB is allocated by PyTorch, and 875.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [rank2]:[W904 11:27:15.787618003 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) [rank0]:[W904 11:27:15.409824698 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) W0904 11:27:17.571000 1736746 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 1736801 closing signal SIGTERM E0904 11:27:17.635000 1736746 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 1736802) of binary: /home/nws8519/.conda/envs/olmo/bin/python3.11 Traceback (most recent call last): File "/home/nws8519/.conda/envs/olmo/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main run(args) File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run elastic_launch( File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-09-04_11:27:17 host : qgpu2013 rank : 1 (local_rank: 1) exitcode : 1 (pid: 1736802) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [W904 11:27:17.168398358 TCPStore.cpp:115] [c10d] recvVector failed on SocketImpl(fd=3, addr=[qgpu2014]:57300, remote=[qgpu2013]:29502): failed to recv, got 0 bytes Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:678 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0x98 (0x14873d0d85e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: + 0x5ba8afe (0x1487811fbafe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #2: + 0x5baa0d0 (0x1487811fd0d0 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #3: + 0x5baa81d (0x1487811fd81d in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #4: + 0x5bab4a9 (0x1487811fe4a9 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #5: c10d::TCPStore::compareSet(std::__cxx11::basic_string, std::allocator > const&, std::vector > const&, std::vector > const&) + 0x1fb (0x1487811f84cb in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #6: + 0xc2a761 (0x148790587761 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #7: + 0x38a0cc (0x14878fce70cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #8: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17] frame #9: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #10: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9] frame #11: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #12: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #13: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #14: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #15: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #16: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5581df] frame #17: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557a20] frame #18: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x62a8a3] frame #19: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fa3c4] frame #20: + 0x81ca (0x1487a64991ca in /lib64/libpthread.so.0) frame #21: clone + 0x43 (0x1487a596a8d3 in /lib64/libc.so.6) W0904 11:27:17.959000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1341] The node 'qgpu2014_2769136_0' has failed to send a keep-alive heartbeat to the rendezvous '3273582' due to an error of type RendezvousConnectionError. W0904 11:27:17.959000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 2769219 closing signal SIGTERM [W904 11:27:17.170100534 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2014]:57290, remote=[qgpu2013]:29502): Broken pipe Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0x98 (0x14c218b8e5e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: + 0x5ba8afe (0x14c25ccb1afe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #2: + 0x5baa358 (0x14c25ccb3358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #3: + 0x5babb3e (0x14c25ccb4b3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #4: c10d::TCPStore::doWait(c10::ArrayRef, std::allocator > >, std::chrono::duration >) + 0x1a6 (0x14c25ccaeac6 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #5: c10d::TCPStore::doGet(std::__cxx11::basic_string, std::allocator > const&) + 0x33 (0x14c25ccaeea3 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #6: c10d::TCPStore::get(std::__cxx11::basic_string, std::allocator > const&) + 0xab (0x14c25ccaff8b in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #7: + 0xc2a390 (0x14c26c03d390 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #8: + 0x38a0cc (0x14c26b79d0cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17] frame #10: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #11: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9] frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #14: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #15: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #16: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #17: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #18: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7] frame #19: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #21: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #22: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa] frame #24: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7] frame #26: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740] frame #27: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2] frame #28: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #29: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #30: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #31: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #32: __libc_start_main + 0xe5 (0x14c2814217e5 in /lib64/libc.so.6) frame #33: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93] W0904 11:27:17.963000 2769137 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2014_2769137_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError. [W904 11:27:17.194777840 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2014]:57290, remote=[qgpu2013]:29502): Broken pipe Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0x98 (0x14c218b8e5e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: + 0x5ba8afe (0x14c25ccb1afe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #2: + 0x5baa358 (0x14c25ccb3358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #3: + 0x5babb3e (0x14c25ccb4b3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #4: c10d::TCPStore::doWait(c10::ArrayRef, std::allocator > >, std::chrono::duration >) + 0x1a6 (0x14c25ccaeac6 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #5: c10d::TCPStore::doGet(std::__cxx11::basic_string, std::allocator > const&) + 0x33 (0x14c25ccaeea3 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #6: c10d::TCPStore::get(std::__cxx11::basic_string, std::allocator > const&) + 0xab (0x14c25ccaff8b in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #7: + 0xc2a390 (0x14c26c03d390 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #8: + 0x38a0cc (0x14c26b79d0cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17] frame #10: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #11: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9] frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #14: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #15: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #16: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7] frame #17: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #18: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #19: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #21: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa] frame #22: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7] frame #24: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740] frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2] frame #26: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #27: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #28: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #29: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #30: __libc_start_main + 0xe5 (0x14c2814217e5 in /lib64/libc.so.6) frame #31: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93] W0904 11:27:17.986000 2769137 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2014_2769137_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 117, in _call_store return getattr(self._store, store_op)(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.distributed.DistNetworkError: failed to recv, got 0 bytes The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/nws8519/.conda/envs/olmo/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main run(args) File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run elastic_launch( File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent result = agent.run() ^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper result = f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 711, in run result = self._invoke_run(role) ^^^^^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run self._initialize_workers(self._worker_group) File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper result = f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 683, in _initialize_workers self._rendezvous(worker_group) File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper result = f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 500, in _rendezvous rdzv_info = spec.rdzv_handler.next_rendezvous() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1170, in next_rendezvous self._op_executor.run(join_op, deadline, self._get_deadline) File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 648, in run has_set = self._state_holder.sync() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 437, in sync get_response = self._backend.get_state() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 75, in get_state base64_state: bytes = self._call_store("get", self._key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 119, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. E0904 11:27:18.023000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 2769218) of binary: /home/nws8519/.conda/envs/olmo/bin/python3.11 [W904 11:27:18.239612027 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2014]:57300, remote=[qgpu2013]:29502): Broken pipe Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0x98 (0x14873d0d85e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: + 0x5ba8afe (0x1487811fbafe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #2: + 0x5baa358 (0x1487811fd358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #3: + 0x5babb3e (0x1487811feb3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #4: c10d::TCPStore::compareSet(std::__cxx11::basic_string, std::allocator > const&, std::vector > const&, std::vector > const&) + 0x299 (0x1487811f8569 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #5: + 0xc2a761 (0x148790587761 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #6: + 0x38a0cc (0x14878fce70cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #7: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17] frame #8: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9] frame #10: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #11: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #14: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #15: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #16: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #17: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #18: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7] frame #19: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #21: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #22: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa] frame #24: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7] frame #26: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740] frame #27: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2] frame #28: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #29: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #30: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #31: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #32: __libc_start_main + 0xe5 (0x1487a596b7e5 in /lib64/libc.so.6) frame #33: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93] W0904 11:27:18.030000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2014_2769136_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError. [W904 11:27:18.248039930 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2014]:57300, remote=[qgpu2013]:29502): Broken pipe Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0x98 (0x14873d0d85e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: + 0x5ba8afe (0x1487811fbafe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #2: + 0x5baa358 (0x1487811fd358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #3: + 0x5babb3e (0x1487811feb3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #4: c10d::TCPStore::compareSet(std::__cxx11::basic_string, std::allocator > const&, std::vector > const&, std::vector > const&) + 0x299 (0x1487811f8569 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #5: + 0xc2a761 (0x148790587761 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #6: + 0x38a0cc (0x14878fce70cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #7: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17] frame #8: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9] frame #10: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #11: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #14: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #15: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #16: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7] frame #17: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #18: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #19: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #21: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa] frame #22: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7] frame #24: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740] frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2] frame #26: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #27: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #28: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #29: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #30: __libc_start_main + 0xe5 (0x1487a596b7e5 in /lib64/libc.so.6) frame #31: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93] W0904 11:27:18.038000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2014_2769136_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError. [W904 11:27:18.255885548 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2014]:57300, remote=[qgpu2013]:29502): Broken pipe Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0x98 (0x14873d0d85e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: + 0x5ba8afe (0x1487811fbafe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #2: + 0x5baa358 (0x1487811fd358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #3: + 0x5babb3e (0x1487811feb3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #4: c10d::TCPStore::compareSet(std::__cxx11::basic_string, std::allocator > const&, std::vector > const&, std::vector > const&) + 0x299 (0x1487811f8569 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #5: + 0xc2a761 (0x148790587761 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #6: + 0x38a0cc (0x14878fce70cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #7: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17] frame #8: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9] frame #10: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #11: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #12: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #13: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #14: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7] frame #15: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #16: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #17: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #18: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #19: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa] frame #20: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #21: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7] frame #22: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740] frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2] frame #24: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #25: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #26: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #27: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #28: __libc_start_main + 0xe5 (0x1487a596b7e5 in /lib64/libc.so.6) frame #29: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93] W0904 11:27:18.046000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2014_2769136_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/home/nws8519/.conda/envs/olmo/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main run(args) File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run elastic_launch( File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-09-04_11:27:17 host : qgpu2014 rank : 2 (local_rank: 0) exitcode : 1 (pid: 2769218) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: qgpu2013: task 1: Exited with exit code 1 srun: error: qgpu2014: tasks 2-3: Exited with exit code 1 [W904 11:27:18.383886513 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2013]:36246, remote=[qgpu2013]:29502): Broken pipe Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0x98 (0x14977ddbe5e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: + 0x5ba8afe (0x1497c1ee1afe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #2: + 0x5baa358 (0x1497c1ee3358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #3: + 0x5babb3e (0x1497c1ee4b3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #4: c10d::TCPStore::doWait(c10::ArrayRef, std::allocator > >, std::chrono::duration >) + 0x1a6 (0x1497c1edeac6 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #5: c10d::TCPStore::doGet(std::__cxx11::basic_string, std::allocator > const&) + 0x33 (0x1497c1edeea3 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #6: c10d::TCPStore::get(std::__cxx11::basic_string, std::allocator > const&) + 0xab (0x1497c1edff8b in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #7: + 0xc2a390 (0x1497d126d390 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #8: + 0x38a0cc (0x1497d09cd0cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17] frame #10: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #11: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9] frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #14: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #15: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #16: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #17: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #18: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7] frame #19: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #21: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #22: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa] frame #24: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7] frame #26: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740] frame #27: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2] frame #28: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #29: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #30: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #31: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #32: __libc_start_main + 0xe5 (0x1497e66517e5 in /lib64/libc.so.6) frame #33: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93] W0904 11:27:18.554000 1736745 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2013_1736745_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError. [W904 11:27:18.394906553 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2013]:36246, remote=[qgpu2013]:29502): Broken pipe Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0x98 (0x14977ddbe5e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: + 0x5ba8afe (0x1497c1ee1afe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #2: + 0x5baa358 (0x1497c1ee3358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #3: + 0x5babb3e (0x1497c1ee4b3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #4: c10d::TCPStore::doWait(c10::ArrayRef, std::allocator > >, std::chrono::duration >) + 0x1a6 (0x1497c1edeac6 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #5: c10d::TCPStore::doGet(std::__cxx11::basic_string, std::allocator > const&) + 0x33 (0x1497c1edeea3 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #6: c10d::TCPStore::get(std::__cxx11::basic_string, std::allocator > const&) + 0xab (0x1497c1edff8b in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #7: + 0xc2a390 (0x1497d126d390 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #8: + 0x38a0cc (0x1497d09cd0cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17] frame #10: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #11: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9] frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #14: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #15: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #16: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7] frame #17: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #18: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #19: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #21: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa] frame #22: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7] frame #24: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740] frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2] frame #26: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #27: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #28: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #29: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11) frame #30: __libc_start_main + 0xe5 (0x1497e66517e5 in /lib64/libc.so.6) frame #31: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93] W0904 11:27:18.565000 1736745 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2013_1736745_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 117, in _call_store return getattr(self._store, store_op)(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.distributed.DistNetworkError: failed to recv, got 0 bytes The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/nws8519/.conda/envs/olmo/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main run(args) File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run elastic_launch( File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent result = agent.run() ^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper result = f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 711, in run result = self._invoke_run(role) ^^^^^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run self._initialize_workers(self._worker_group) File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper result = f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 683, in _initialize_workers self._rendezvous(worker_group) File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper result = f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 500, in _rendezvous rdzv_info = spec.rdzv_handler.next_rendezvous() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1170, in next_rendezvous self._op_executor.run(join_op, deadline, self._get_deadline) File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 648, in run has_set = self._state_holder.sync() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 437, in sync get_response = self._backend.get_state() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 75, in get_state base64_state: bytes = self._call_store("get", self._key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 119, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. srun: error: qgpu2013: task 0: Exited with exit code 1 unsupervised olmo categorization pau at Thu Sep 4 11:27:18 CDT 2025