602 lines
66 KiB
Plaintext
602 lines
66 KiB
Plaintext
setting up the environment by loading in conda environment at Thu Sep 4 11:14:26 CDT 2025
|
|
running the olmo labeling job at Thu Sep 4 11:14:26 CDT 2025
|
|
----------------------------------------
|
|
srun job start: Thu Sep 4 11:14:27 CDT 2025
|
|
Job ID: 3273582
|
|
Username: nws8519
|
|
Queue: gengpu
|
|
Account: p32852
|
|
----------------------------------------
|
|
The following variables are not
|
|
guaranteed to be the same in the
|
|
prologue and the job run script
|
|
----------------------------------------
|
|
PATH (in prologue) : /home/nws8519/.conda/envs/olmo/bin:/software/miniconda3/4.12.0/condabin:/home/nws8519/.local/bin:/home/nws8519/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/lpp/mmfs/bin:/hpc/usertools
|
|
WORKDIR is: /home/nws8519
|
|
----------------------------------------
|
|
W0904 11:14:40.413000 1736745 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766]
|
|
W0904 11:14:40.413000 1736745 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] *****************************************
|
|
W0904 11:14:40.413000 1736745 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
|
|
W0904 11:14:40.413000 1736745 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] *****************************************
|
|
W0904 11:14:40.413000 1736746 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766]
|
|
W0904 11:14:40.413000 1736746 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] *****************************************
|
|
W0904 11:14:40.413000 1736746 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
|
|
W0904 11:14:40.413000 1736746 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] *****************************************
|
|
W0904 11:14:40.413000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766]
|
|
W0904 11:14:40.413000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] *****************************************
|
|
W0904 11:14:40.413000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
|
|
W0904 11:14:40.413000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] *****************************************
|
|
W0904 11:14:40.413000 2769137 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766]
|
|
W0904 11:14:40.413000 2769137 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] *****************************************
|
|
W0904 11:14:40.413000 2769137 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
|
|
W0904 11:14:40.413000 2769137 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py:766] *****************************************
|
|
[nltk_data] Downloading package punkt to /home/nws8519/nltk_data...
|
|
[nltk_data] Downloading package punkt to /home/nws8519/nltk_data...
|
|
[nltk_data] Downloading package punkt to /home/nws8519/nltk_data...
|
|
[nltk_data] Downloading package punkt to /home/nws8519/nltk_data...
|
|
[nltk_data] Package punkt is already up-to-date![nltk_data] Package punkt is already up-to-date!
|
|
|
|
[nltk_data] Package punkt is already up-to-date![nltk_data] Package punkt is already up-to-date!
|
|
|
|
[nltk_data] Downloading package punkt_tab to
|
|
[nltk_data] /home/nws8519/nltk_data...[nltk_data] Downloading package punkt_tab to
|
|
[nltk_data] /home/nws8519/nltk_data...
|
|
|
|
[nltk_data] Downloading package punkt_tab to
|
|
[nltk_data] /home/nws8519/nltk_data...[nltk_data] Downloading package punkt_tab to
|
|
[nltk_data] /home/nws8519/nltk_data...
|
|
|
|
[nltk_data] Package punkt_tab is already up-to-date![nltk_data] Package punkt_tab is already up-to-date!
|
|
|
|
[nltk_data] Package punkt_tab is already up-to-date![nltk_data] Package punkt_tab is already up-to-date!
|
|
|
|
/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py:120: DtypeWarning: Columns (21) have mixed types. Specify dtype option on import or set low_memory=False.
|
|
df = pd.read_csv("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_labels.csv")
|
|
/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py:120: DtypeWarning: Columns (21) have mixed types. Specify dtype option on import or set low_memory=False.
|
|
df = pd.read_csv("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_labels.csv")
|
|
/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py:120: DtypeWarning: Columns (21) have mixed types. Specify dtype option on import or set low_memory=False.
|
|
df = pd.read_csv("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_labels.csv")
|
|
[rank3]:[W904 11:15:22.374478896 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 3] using GPU 1 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
|
|
/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py:120: DtypeWarning: Columns (21) have mixed types. Specify dtype option on import or set low_memory=False.
|
|
df = pd.read_csv("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_labels.csv")
|
|
[rank1]:[W904 11:15:22.049509730 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
|
|
[rank2]:[W904 11:15:22.461549051 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 2] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
|
|
|
|
Fetching 12 files: 0%| | 0/12 [00:00<?, ?it/s]Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
|
|
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
|
|
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
|
|
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
|
|
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
|
|
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
|
|
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
|
|
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
|
|
|
|
Fetching 12 files: 8%|▊ | 1/12 [03:13<35:25, 193.27s/it]Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
|
|
|
|
Fetching 12 files: 17%|█▋ | 2/12 [04:23<20:11, 121.10s/it]Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
|
|
|
|
Fetching 12 files: 25%|██▌ | 3/12 [05:03<12:33, 83.77s/it] Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
|
|
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
|
|
|
|
Fetching 12 files: 33%|███▎ | 4/12 [06:42<11:59, 89.90s/it]
|
|
Fetching 12 files: 42%|████▏ | 5/12 [07:04<07:37, 65.41s/it]
|
|
Fetching 12 files: 50%|█████ | 6/12 [07:08<04:26, 44.47s/it]
|
|
Fetching 12 files: 75%|███████▌ | 9/12 [07:37<01:11, 23.91s/it]
|
|
Fetching 12 files: 83%|████████▎ | 10/12 [07:49<00:42, 21.14s/it]
|
|
Fetching 12 files: 100%|██████████| 12/12 [07:49<00:00, 39.09s/it]
|
|
|
|
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s]
|
|
Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:04, 2.49it/s]
|
|
Loading checkpoint shards: 17%|█▋ | 2/12 [00:01<00:05, 1.81it/s]
|
|
Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:04, 1.83it/s]
|
|
Loading checkpoint shards: 33%|███▎ | 4/12 [00:02<00:04, 1.94it/s]
|
|
Loading checkpoint shards: 42%|████▏ | 5/12 [00:02<00:03, 2.09it/s]
|
|
Loading checkpoint shards: 50%|█████ | 6/12 [00:02<00:02, 2.21it/s]
|
|
Loading checkpoint shards: 58%|█████▊ | 7/12 [00:03<00:02, 2.31it/s]
|
|
Loading checkpoint shards: 67%|██████▋ | 8/12 [00:03<00:01, 2.14it/s]
|
|
Loading checkpoint shards: 75%|███████▌ | 9/12 [00:04<00:01, 2.39it/s]
|
|
Loading checkpoint shards: 83%|████████▎ | 10/12 [00:04<00:00, 2.92it/s]
|
|
Loading checkpoint shards: 100%|██████████| 12/12 [00:04<00:00, 4.85it/s]
|
|
Loading checkpoint shards: 100%|██████████| 12/12 [00:04<00:00, 2.73it/s]
|
|
[rank0]:[W904 11:25:15.000410288 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
|
|
|
|
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s]
|
|
Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:03, 3.04it/s]
|
|
Loading checkpoint shards: 17%|█▋ | 2/12 [00:00<00:04, 2.20it/s]
|
|
Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:04, 2.10it/s]
|
|
Loading checkpoint shards: 33%|███▎ | 4/12 [00:01<00:04, 1.91it/s]
|
|
Loading checkpoint shards: 100%|██████████| 12/12 [00:02<00:00, 5.96it/s]
|
|
|
|
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s]
|
|
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s]
|
|
Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:03, 3.02it/s]
|
|
Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:03, 3.02it/s]
|
|
Loading checkpoint shards: 17%|█▋ | 2/12 [00:00<00:04, 2.18it/s]
|
|
Loading checkpoint shards: 17%|█▋ | 2/12 [00:00<00:04, 2.18it/s]
|
|
Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:04, 2.09it/s]
|
|
Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:04, 2.09it/s]
|
|
Loading checkpoint shards: 33%|███▎ | 4/12 [00:01<00:04, 1.89it/s]
|
|
Loading checkpoint shards: 33%|███▎ | 4/12 [00:01<00:04, 1.89it/s]
|
|
Loading checkpoint shards: 42%|████▏ | 5/12 [00:02<00:03, 1.80it/s]
|
|
Loading checkpoint shards: 42%|████▏ | 5/12 [00:02<00:03, 1.80it/s]
|
|
Loading checkpoint shards: 50%|█████ | 6/12 [00:03<00:03, 1.63it/s]
|
|
Loading checkpoint shards: 50%|█████ | 6/12 [00:03<00:03, 1.63it/s]
|
|
Loading checkpoint shards: 58%|█████▊ | 7/12 [00:03<00:03, 1.65it/s]
|
|
Loading checkpoint shards: 58%|█████▊ | 7/12 [00:03<00:03, 1.65it/s]
|
|
Loading checkpoint shards: 67%|██████▋ | 8/12 [00:04<00:02, 1.75it/s]
|
|
Loading checkpoint shards: 67%|██████▋ | 8/12 [00:04<00:02, 1.75it/s]
|
|
Loading checkpoint shards: 75%|███████▌ | 9/12 [00:04<00:01, 1.82it/s]
|
|
Loading checkpoint shards: 75%|███████▌ | 9/12 [00:04<00:01, 1.82it/s]
|
|
Loading checkpoint shards: 83%|████████▎ | 10/12 [00:05<00:01, 1.94it/s]
|
|
Loading checkpoint shards: 83%|████████▎ | 10/12 [00:05<00:01, 1.94it/s]
|
|
Loading checkpoint shards: 92%|█████████▏| 11/12 [00:05<00:00, 2.08it/s]
|
|
Loading checkpoint shards: 92%|█████████▏| 11/12 [00:05<00:00, 2.08it/s]
|
|
Loading checkpoint shards: 100%|██████████| 12/12 [00:05<00:00, 2.09it/s]
|
|
Loading checkpoint shards: 100%|██████████| 12/12 [00:05<00:00, 2.09it/s]
|
|
|
|
[rank2]: Traceback (most recent call last):
|
|
[rank2]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 188, in <module>
|
|
[rank2]: main()
|
|
[rank2]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 143, in main
|
|
[rank2]: ddp_olmo = DDP(olmo, device_ids=[local_rank])
|
|
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
[rank2]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 850, in __init__
|
|
[rank2]: self._ddp_init_helper(
|
|
[rank2]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1201, in _ddp_init_helper
|
|
[rank2]: self.reducer = dist.Reducer(
|
|
[rank2]: ^^^^^^^^^^^^^
|
|
[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 51.10 GiB. GPU 0 has a total capacity of 79.25 GiB of which 27.52 GiB is free. Including non-PyTorch memory, this process has 51.72 GiB memory in use. Of the allocated memory 51.10 GiB is allocated by PyTorch, and 875.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
|
|
[rank3]: Traceback (most recent call last):
|
|
[rank3]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 188, in <module>
|
|
[rank3]: main()
|
|
[rank3]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 143, in main
|
|
[rank3]: ddp_olmo = DDP(olmo, device_ids=[local_rank])
|
|
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
[rank3]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 850, in __init__
|
|
[rank3]: self._ddp_init_helper(
|
|
[rank3]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1201, in _ddp_init_helper
|
|
[rank3]: self.reducer = dist.Reducer(
|
|
[rank3]: ^^^^^^^^^^^^^
|
|
[rank3]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 51.10 GiB. GPU 1 has a total capacity of 79.25 GiB of which 27.52 GiB is free. Including non-PyTorch memory, this process has 51.72 GiB memory in use. Of the allocated memory 51.10 GiB is allocated by PyTorch, and 875.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
|
|
[rank0]: Traceback (most recent call last):
|
|
[rank0]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 188, in <module>
|
|
[rank0]: main()
|
|
[rank0]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 143, in main
|
|
[rank0]: ddp_olmo = DDP(olmo, device_ids=[local_rank])
|
|
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
[rank0]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 850, in __init__
|
|
[rank0]: self._ddp_init_helper(
|
|
[rank0]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1201, in _ddp_init_helper
|
|
[rank0]: self.reducer = dist.Reducer(
|
|
[rank0]: ^^^^^^^^^^^^^
|
|
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 51.10 GiB. GPU 0 has a total capacity of 79.25 GiB of which 27.52 GiB is free. Including non-PyTorch memory, this process has 51.72 GiB memory in use. Of the allocated memory 51.10 GiB is allocated by PyTorch, and 875.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
|
|
[rank1]: Traceback (most recent call last):
|
|
[rank1]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 188, in <module>
|
|
[rank1]: main()
|
|
[rank1]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 143, in main
|
|
[rank1]: ddp_olmo = DDP(olmo, device_ids=[local_rank])
|
|
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
[rank1]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 850, in __init__
|
|
[rank1]: self._ddp_init_helper(
|
|
[rank1]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1201, in _ddp_init_helper
|
|
[rank1]: self.reducer = dist.Reducer(
|
|
[rank1]: ^^^^^^^^^^^^^
|
|
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 51.10 GiB. GPU 1 has a total capacity of 79.25 GiB of which 27.52 GiB is free. Including non-PyTorch memory, this process has 51.72 GiB memory in use. Of the allocated memory 51.10 GiB is allocated by PyTorch, and 875.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
|
|
[rank2]:[W904 11:27:15.787618003 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
|
|
[rank0]:[W904 11:27:15.409824698 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
|
|
W0904 11:27:17.571000 1736746 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 1736801 closing signal SIGTERM
|
|
E0904 11:27:17.635000 1736746 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 1736802) of binary: /home/nws8519/.conda/envs/olmo/bin/python3.11
|
|
Traceback (most recent call last):
|
|
File "/home/nws8519/.conda/envs/olmo/bin/torchrun", line 8, in <module>
|
|
sys.exit(main())
|
|
^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
|
|
return f(*args, **kwargs)
|
|
^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main
|
|
run(args)
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run
|
|
elastic_launch(
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
|
|
return launch_agent(self._config, self._entrypoint, list(args))
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
|
|
raise ChildFailedError(
|
|
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
|
|
============================================================
|
|
/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py FAILED
|
|
------------------------------------------------------------
|
|
Failures:
|
|
<NO_OTHER_FAILURES>
|
|
------------------------------------------------------------
|
|
Root Cause (first observed failure):
|
|
[0]:
|
|
time : 2025-09-04_11:27:17
|
|
host : qgpu2013
|
|
rank : 1 (local_rank: 1)
|
|
exitcode : 1 (pid: 1736802)
|
|
error_file: <N/A>
|
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
|
|
============================================================
|
|
[W904 11:27:17.168398358 TCPStore.cpp:115] [c10d] recvVector failed on SocketImpl(fd=3, addr=[qgpu2014]:57300, remote=[qgpu2013]:29502): failed to recv, got 0 bytes
|
|
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:678 (most recent call first):
|
|
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14873d0d85e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
|
|
frame #1: <unknown function> + 0x5ba8afe (0x1487811fbafe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #2: <unknown function> + 0x5baa0d0 (0x1487811fd0d0 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #3: <unknown function> + 0x5baa81d (0x1487811fd81d in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #4: <unknown function> + 0x5bab4a9 (0x1487811fe4a9 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #5: c10d::TCPStore::compareSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&) + 0x1fb (0x1487811f84cb in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #6: <unknown function> + 0xc2a761 (0x148790587761 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
|
frame #7: <unknown function> + 0x38a0cc (0x14878fce70cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
|
frame #8: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
|
|
frame #9: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #10: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
|
|
frame #11: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #12: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #13: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #14: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #15: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #16: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5581df]
|
|
frame #17: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557a20]
|
|
frame #18: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x62a8a3]
|
|
frame #19: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fa3c4]
|
|
frame #20: <unknown function> + 0x81ca (0x1487a64991ca in /lib64/libpthread.so.0)
|
|
frame #21: clone + 0x43 (0x1487a596a8d3 in /lib64/libc.so.6)
|
|
|
|
W0904 11:27:17.959000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1341] The node 'qgpu2014_2769136_0' has failed to send a keep-alive heartbeat to the rendezvous '3273582' due to an error of type RendezvousConnectionError.
|
|
W0904 11:27:17.959000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 2769219 closing signal SIGTERM
|
|
[W904 11:27:17.170100534 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2014]:57290, remote=[qgpu2013]:29502): Broken pipe
|
|
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
|
|
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14c218b8e5e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
|
|
frame #1: <unknown function> + 0x5ba8afe (0x14c25ccb1afe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #2: <unknown function> + 0x5baa358 (0x14c25ccb3358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #3: <unknown function> + 0x5babb3e (0x14c25ccb4b3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #4: c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x1a6 (0x14c25ccaeac6 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #5: c10d::TCPStore::doGet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x33 (0x14c25ccaeea3 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #6: c10d::TCPStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xab (0x14c25ccaff8b in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #7: <unknown function> + 0xc2a390 (0x14c26c03d390 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
|
frame #8: <unknown function> + 0x38a0cc (0x14c26b79d0cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
|
frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
|
|
frame #10: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #11: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
|
|
frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #14: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #15: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #16: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #17: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #18: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7]
|
|
frame #19: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #21: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #22: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa]
|
|
frame #24: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7]
|
|
frame #26: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740]
|
|
frame #27: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2]
|
|
frame #28: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #29: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #30: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #31: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #32: __libc_start_main + 0xe5 (0x14c2814217e5 in /lib64/libc.so.6)
|
|
frame #33: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93]
|
|
|
|
W0904 11:27:17.963000 2769137 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2014_2769137_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError.
|
|
[W904 11:27:17.194777840 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2014]:57290, remote=[qgpu2013]:29502): Broken pipe
|
|
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
|
|
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14c218b8e5e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
|
|
frame #1: <unknown function> + 0x5ba8afe (0x14c25ccb1afe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #2: <unknown function> + 0x5baa358 (0x14c25ccb3358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #3: <unknown function> + 0x5babb3e (0x14c25ccb4b3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #4: c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x1a6 (0x14c25ccaeac6 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #5: c10d::TCPStore::doGet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x33 (0x14c25ccaeea3 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #6: c10d::TCPStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xab (0x14c25ccaff8b in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #7: <unknown function> + 0xc2a390 (0x14c26c03d390 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
|
frame #8: <unknown function> + 0x38a0cc (0x14c26b79d0cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
|
frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
|
|
frame #10: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #11: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
|
|
frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #14: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #15: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #16: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7]
|
|
frame #17: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #18: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #19: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #21: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa]
|
|
frame #22: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7]
|
|
frame #24: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740]
|
|
frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2]
|
|
frame #26: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #27: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #28: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #29: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #30: __libc_start_main + 0xe5 (0x14c2814217e5 in /lib64/libc.so.6)
|
|
frame #31: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93]
|
|
|
|
W0904 11:27:17.986000 2769137 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2014_2769137_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError.
|
|
Traceback (most recent call last):
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 117, in _call_store
|
|
return getattr(self._store, store_op)(*args, **kwargs)
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
torch.distributed.DistNetworkError: failed to recv, got 0 bytes
|
|
|
|
The above exception was the direct cause of the following exception:
|
|
|
|
Traceback (most recent call last):
|
|
File "/home/nws8519/.conda/envs/olmo/bin/torchrun", line 8, in <module>
|
|
sys.exit(main())
|
|
^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
|
|
return f(*args, **kwargs)
|
|
^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main
|
|
run(args)
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run
|
|
elastic_launch(
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
|
|
return launch_agent(self._config, self._entrypoint, list(args))
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
|
|
result = agent.run()
|
|
^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
|
|
result = f(*args, **kwargs)
|
|
^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 711, in run
|
|
result = self._invoke_run(role)
|
|
^^^^^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
|
|
self._initialize_workers(self._worker_group)
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
|
|
result = f(*args, **kwargs)
|
|
^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 683, in _initialize_workers
|
|
self._rendezvous(worker_group)
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
|
|
result = f(*args, **kwargs)
|
|
^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 500, in _rendezvous
|
|
rdzv_info = spec.rdzv_handler.next_rendezvous()
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1170, in next_rendezvous
|
|
self._op_executor.run(join_op, deadline, self._get_deadline)
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 648, in run
|
|
has_set = self._state_holder.sync()
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 437, in sync
|
|
get_response = self._backend.get_state()
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 75, in get_state
|
|
base64_state: bytes = self._call_store("get", self._key)
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 119, in _call_store
|
|
raise RendezvousConnectionError(
|
|
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
|
|
E0904 11:27:18.023000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 2769218) of binary: /home/nws8519/.conda/envs/olmo/bin/python3.11
|
|
[W904 11:27:18.239612027 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2014]:57300, remote=[qgpu2013]:29502): Broken pipe
|
|
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
|
|
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14873d0d85e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
|
|
frame #1: <unknown function> + 0x5ba8afe (0x1487811fbafe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #2: <unknown function> + 0x5baa358 (0x1487811fd358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #3: <unknown function> + 0x5babb3e (0x1487811feb3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #4: c10d::TCPStore::compareSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&) + 0x299 (0x1487811f8569 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #5: <unknown function> + 0xc2a761 (0x148790587761 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
|
frame #6: <unknown function> + 0x38a0cc (0x14878fce70cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
|
frame #7: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
|
|
frame #8: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
|
|
frame #10: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #11: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #14: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #15: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #16: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #17: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #18: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7]
|
|
frame #19: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #21: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #22: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa]
|
|
frame #24: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7]
|
|
frame #26: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740]
|
|
frame #27: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2]
|
|
frame #28: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #29: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #30: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #31: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #32: __libc_start_main + 0xe5 (0x1487a596b7e5 in /lib64/libc.so.6)
|
|
frame #33: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93]
|
|
|
|
W0904 11:27:18.030000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2014_2769136_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError.
|
|
[W904 11:27:18.248039930 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2014]:57300, remote=[qgpu2013]:29502): Broken pipe
|
|
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
|
|
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14873d0d85e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
|
|
frame #1: <unknown function> + 0x5ba8afe (0x1487811fbafe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #2: <unknown function> + 0x5baa358 (0x1487811fd358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #3: <unknown function> + 0x5babb3e (0x1487811feb3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #4: c10d::TCPStore::compareSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&) + 0x299 (0x1487811f8569 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #5: <unknown function> + 0xc2a761 (0x148790587761 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
|
frame #6: <unknown function> + 0x38a0cc (0x14878fce70cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
|
frame #7: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
|
|
frame #8: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
|
|
frame #10: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #11: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #14: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #15: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #16: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7]
|
|
frame #17: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #18: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #19: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #21: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa]
|
|
frame #22: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7]
|
|
frame #24: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740]
|
|
frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2]
|
|
frame #26: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #27: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #28: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #29: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #30: __libc_start_main + 0xe5 (0x1487a596b7e5 in /lib64/libc.so.6)
|
|
frame #31: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93]
|
|
|
|
W0904 11:27:18.038000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2014_2769136_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError.
|
|
[W904 11:27:18.255885548 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2014]:57300, remote=[qgpu2013]:29502): Broken pipe
|
|
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
|
|
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14873d0d85e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
|
|
frame #1: <unknown function> + 0x5ba8afe (0x1487811fbafe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #2: <unknown function> + 0x5baa358 (0x1487811fd358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #3: <unknown function> + 0x5babb3e (0x1487811feb3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #4: c10d::TCPStore::compareSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&) + 0x299 (0x1487811f8569 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #5: <unknown function> + 0xc2a761 (0x148790587761 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
|
frame #6: <unknown function> + 0x38a0cc (0x14878fce70cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
|
frame #7: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
|
|
frame #8: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
|
|
frame #10: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #11: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #12: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #13: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #14: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7]
|
|
frame #15: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #16: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #17: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #18: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #19: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa]
|
|
frame #20: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #21: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7]
|
|
frame #22: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740]
|
|
frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2]
|
|
frame #24: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #25: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #26: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #27: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #28: __libc_start_main + 0xe5 (0x1487a596b7e5 in /lib64/libc.so.6)
|
|
frame #29: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93]
|
|
|
|
W0904 11:27:18.046000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2014_2769136_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError.
|
|
Traceback (most recent call last):
|
|
File "/home/nws8519/.conda/envs/olmo/bin/torchrun", line 8, in <module>
|
|
sys.exit(main())
|
|
^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
|
|
return f(*args, **kwargs)
|
|
^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main
|
|
run(args)
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run
|
|
elastic_launch(
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
|
|
return launch_agent(self._config, self._entrypoint, list(args))
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
|
|
raise ChildFailedError(
|
|
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
|
|
============================================================
|
|
/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py FAILED
|
|
------------------------------------------------------------
|
|
Failures:
|
|
<NO_OTHER_FAILURES>
|
|
------------------------------------------------------------
|
|
Root Cause (first observed failure):
|
|
[0]:
|
|
time : 2025-09-04_11:27:17
|
|
host : qgpu2014
|
|
rank : 2 (local_rank: 0)
|
|
exitcode : 1 (pid: 2769218)
|
|
error_file: <N/A>
|
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
|
|
============================================================
|
|
srun: error: qgpu2013: task 1: Exited with exit code 1
|
|
srun: error: qgpu2014: tasks 2-3: Exited with exit code 1
|
|
[W904 11:27:18.383886513 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2013]:36246, remote=[qgpu2013]:29502): Broken pipe
|
|
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
|
|
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14977ddbe5e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
|
|
frame #1: <unknown function> + 0x5ba8afe (0x1497c1ee1afe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #2: <unknown function> + 0x5baa358 (0x1497c1ee3358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #3: <unknown function> + 0x5babb3e (0x1497c1ee4b3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #4: c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x1a6 (0x1497c1edeac6 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #5: c10d::TCPStore::doGet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x33 (0x1497c1edeea3 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #6: c10d::TCPStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xab (0x1497c1edff8b in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #7: <unknown function> + 0xc2a390 (0x1497d126d390 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
|
frame #8: <unknown function> + 0x38a0cc (0x1497d09cd0cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
|
frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
|
|
frame #10: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #11: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
|
|
frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #14: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #15: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #16: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #17: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #18: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7]
|
|
frame #19: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #21: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #22: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa]
|
|
frame #24: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7]
|
|
frame #26: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740]
|
|
frame #27: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2]
|
|
frame #28: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #29: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #30: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #31: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #32: __libc_start_main + 0xe5 (0x1497e66517e5 in /lib64/libc.so.6)
|
|
frame #33: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93]
|
|
|
|
W0904 11:27:18.554000 1736745 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2013_1736745_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError.
|
|
[W904 11:27:18.394906553 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2013]:36246, remote=[qgpu2013]:29502): Broken pipe
|
|
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
|
|
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14977ddbe5e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
|
|
frame #1: <unknown function> + 0x5ba8afe (0x1497c1ee1afe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #2: <unknown function> + 0x5baa358 (0x1497c1ee3358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #3: <unknown function> + 0x5babb3e (0x1497c1ee4b3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #4: c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x1a6 (0x1497c1edeac6 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #5: c10d::TCPStore::doGet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x33 (0x1497c1edeea3 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #6: c10d::TCPStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xab (0x1497c1edff8b in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
|
frame #7: <unknown function> + 0xc2a390 (0x1497d126d390 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
|
frame #8: <unknown function> + 0x38a0cc (0x1497d09cd0cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
|
frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
|
|
frame #10: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #11: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
|
|
frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #14: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #15: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #16: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7]
|
|
frame #17: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #18: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #19: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #21: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa]
|
|
frame #22: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7]
|
|
frame #24: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740]
|
|
frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2]
|
|
frame #26: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #27: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #28: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #29: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
|
frame #30: __libc_start_main + 0xe5 (0x1497e66517e5 in /lib64/libc.so.6)
|
|
frame #31: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93]
|
|
|
|
W0904 11:27:18.565000 1736745 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2013_1736745_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError.
|
|
Traceback (most recent call last):
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 117, in _call_store
|
|
return getattr(self._store, store_op)(*args, **kwargs)
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
torch.distributed.DistNetworkError: failed to recv, got 0 bytes
|
|
|
|
The above exception was the direct cause of the following exception:
|
|
|
|
Traceback (most recent call last):
|
|
File "/home/nws8519/.conda/envs/olmo/bin/torchrun", line 8, in <module>
|
|
sys.exit(main())
|
|
^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
|
|
return f(*args, **kwargs)
|
|
^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main
|
|
run(args)
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run
|
|
elastic_launch(
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
|
|
return launch_agent(self._config, self._entrypoint, list(args))
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
|
|
result = agent.run()
|
|
^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
|
|
result = f(*args, **kwargs)
|
|
^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 711, in run
|
|
result = self._invoke_run(role)
|
|
^^^^^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
|
|
self._initialize_workers(self._worker_group)
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
|
|
result = f(*args, **kwargs)
|
|
^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 683, in _initialize_workers
|
|
self._rendezvous(worker_group)
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
|
|
result = f(*args, **kwargs)
|
|
^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 500, in _rendezvous
|
|
rdzv_info = spec.rdzv_handler.next_rendezvous()
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1170, in next_rendezvous
|
|
self._op_executor.run(join_op, deadline, self._get_deadline)
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 648, in run
|
|
has_set = self._state_holder.sync()
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 437, in sync
|
|
get_response = self._backend.get_state()
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 75, in get_state
|
|
base64_state: bytes = self._call_store("get", self._key)
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 119, in _call_store
|
|
raise RendezvousConnectionError(
|
|
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
|
|
srun: error: qgpu2013: task 0: Exited with exit code 1
|
|
unsupervised olmo categorization pau at Thu Sep 4 11:27:18 CDT 2025
|