1
0

should be updated and refined pca analysis

This commit is contained in:
mgaughan 2025-09-04 15:47:11 -05:00
parent a770d9c668
commit f2afb7c981
13 changed files with 136413 additions and 15 deletions

Binary file not shown.

Before

Width:  |  Height:  |  Size: 266 KiB

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.4 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 2.1 MiB

View File

Before

Width:  |  Height:  |  Size: 1.4 MiB

After

Width:  |  Height:  |  Size: 1.4 MiB

View File

@ -1,9 +1,66 @@
starting the job at: Thu Sep 4 11:02:03 CDT 2025
starting the job at: Thu Sep 4 15:41:55 CDT 2025
setting up the environment
running the neurobiber labeling script
Variance of each PCA component: [259.38215213 83.11803664 67.16301107 61.78747188 38.94875996
32.78688889 26.45592105 21.9280629 18.734197 16.29485568
13.48304855 11.50594609 10.77855857 9.30674176 8.96113511
8.35521401 8.17815209 7.13194427]
Variance of each PCA component: [88.92832185 39.46471687 32.34601523 20.19544345 14.0083261 11.5837521
7.82584723 6.89064989 6.07988254 5.80726367 5.49782354 4.50587747
4.31482409 2.81997326 2.62989708 2.27205352 2.09396341 2.00076119]
Top 10 PC1 values:
PC1 PC2 ... priority closed_relevance
19873 125.128650 24.461032 ... Medium False
21956 125.128650 24.461032 ... Needs Triage True
22010 125.128650 24.461032 ... Needs Triage True
24528 125.128650 24.461032 ... Needs Triage True
24529 125.128650 24.461032 ... Needs Triage False
25549 125.128650 24.461032 ... Medium False
6329 72.728923 28.262157 ... Medium False
11288 72.728923 28.262157 ... Low False
22332 72.728923 28.262157 ... High True
22731 72.728923 28.262157 ... Medium True
[10 rows x 26 columns]
Bottom 10 PC1 values:
PC1 PC2 ... priority closed_relevance
12503 -16.333841 17.142328 ... Low NaN
3462 -15.759184 15.368325 ... High NaN
23838 -14.821270 17.471553 ... Low False
25791 -14.806017 12.439508 ... Medium True
23053 -14.399838 15.867529 ... Medium False
24180 -14.046494 12.993193 ... Low True
11814 -14.009692 13.953416 ... Low False
24699 -13.848945 15.308788 ... Needs Triage True
24214 -13.701324 11.951003 ... Low False
24467 -13.680693 11.614764 ... Needs Triage True
[10 rows x 26 columns]
Top 10 PC2 values:
PC1 PC2 PC3 ... week_index priority closed_relevance
6329 72.728923 28.262157 -52.466963 ... 10 Medium False
11288 72.728923 28.262157 -52.466963 ... 2 Low False
22332 72.728923 28.262157 -52.466963 ... 4 High True
22731 72.728923 28.262157 -52.466963 ... 10 Medium True
23016 72.728923 28.262157 -52.466963 ... 7 Medium False
23022 72.728923 28.262157 -52.466963 ... 7 Medium False
23086 72.728923 28.262157 -52.466963 ... 6 Medium False
23238 72.728923 28.262157 -52.466963 ... 4 Medium False
25606 72.728923 28.262157 -52.466963 ... -22 Medium True
25843 72.728923 28.262157 -52.466963 ... -31 Medium True
[10 rows x 26 columns]
Bottom 10 PC2 values:
PC1 PC2 PC3 ... week_index priority closed_relevance
741 1.197394 -18.726602 -5.305851 ... -33 Unbreak Now! True
6492 1.197394 -18.726602 -5.305851 ... 8 Medium False
6495 1.197394 -18.726602 -5.305851 ... 8 Medium False
8834 1.197394 -18.726602 -5.305851 ... -2 Medium False
9292 1.197394 -18.726602 -5.305851 ... -4 Medium True
9419 1.197394 -18.726602 -5.305851 ... -6 Medium NaN
10686 1.197394 -18.726602 -5.305851 ... 8 Low NaN
11301 1.197394 -18.726602 -5.305851 ... 2 Low True
11306 1.197394 -18.726602 -5.305851 ... 2 Low True
11312 1.197394 -18.726602 -5.305851 ... 2 Low True
[10 rows x 26 columns]
job finished, cleaning up
job pau at: Thu Sep 4 11:02:32 CDT 2025
job pau at: Thu Sep 4 15:42:13 CDT 2025

View File

@ -71,3 +71,531 @@ Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed.
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Fetching 12 files: 8%|▊ | 1/12 [03:13<35:25, 193.27s/it]Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Fetching 12 files: 17%|█▋ | 2/12 [04:23<20:11, 121.10s/it]Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Fetching 12 files: 25%|██▌ | 3/12 [05:03<12:33, 83.77s/it] Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Fetching 12 files: 33%|███▎ | 4/12 [06:42<11:59, 89.90s/it] Fetching 12 files: 42%|████▏ | 5/12 [07:04<07:37, 65.41s/it] Fetching 12 files: 50%|█████ | 6/12 [07:08<04:26, 44.47s/it] Fetching 12 files: 75%|███████▌ | 9/12 [07:37<01:11, 23.91s/it] Fetching 12 files: 83%|████████▎ | 10/12 [07:49<00:42, 21.14s/it] Fetching 12 files: 100%|██████████| 12/12 [07:49<00:00, 39.09s/it]
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s] Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:04, 2.49it/s] Loading checkpoint shards: 17%|█▋ | 2/12 [00:01<00:05, 1.81it/s] Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:04, 1.83it/s] Loading checkpoint shards: 33%|███▎ | 4/12 [00:02<00:04, 1.94it/s] Loading checkpoint shards: 42%|████▏ | 5/12 [00:02<00:03, 2.09it/s] Loading checkpoint shards: 50%|█████ | 6/12 [00:02<00:02, 2.21it/s] Loading checkpoint shards: 58%|█████▊ | 7/12 [00:03<00:02, 2.31it/s] Loading checkpoint shards: 67%|██████▋ | 8/12 [00:03<00:01, 2.14it/s] Loading checkpoint shards: 75%|███████▌ | 9/12 [00:04<00:01, 2.39it/s] Loading checkpoint shards: 83%|████████▎ | 10/12 [00:04<00:00, 2.92it/s] Loading checkpoint shards: 100%|██████████| 12/12 [00:04<00:00, 4.85it/s] Loading checkpoint shards: 100%|██████████| 12/12 [00:04<00:00, 2.73it/s]
[rank0]:[W904 11:25:15.000410288 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s] Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:03, 3.04it/s] Loading checkpoint shards: 17%|█▋ | 2/12 [00:00<00:04, 2.20it/s] Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:04, 2.10it/s] Loading checkpoint shards: 33%|███▎ | 4/12 [00:01<00:04, 1.91it/s] Loading checkpoint shards: 100%|██████████| 12/12 [00:02<00:00, 5.96it/s]
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s] Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:03, 3.02it/s] Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:03, 3.02it/s] Loading checkpoint shards: 17%|█▋ | 2/12 [00:00<00:04, 2.18it/s] Loading checkpoint shards: 17%|█▋ | 2/12 [00:00<00:04, 2.18it/s] Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:04, 2.09it/s] Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:04, 2.09it/s] Loading checkpoint shards: 33%|███▎ | 4/12 [00:01<00:04, 1.89it/s] Loading checkpoint shards: 33%|███▎ | 4/12 [00:01<00:04, 1.89it/s] Loading checkpoint shards: 42%|████▏ | 5/12 [00:02<00:03, 1.80it/s] Loading checkpoint shards: 42%|████▏ | 5/12 [00:02<00:03, 1.80it/s] Loading checkpoint shards: 50%|█████ | 6/12 [00:03<00:03, 1.63it/s] Loading checkpoint shards: 50%|█████ | 6/12 [00:03<00:03, 1.63it/s] Loading checkpoint shards: 58%|█████▊ | 7/12 [00:03<00:03, 1.65it/s] Loading checkpoint shards: 58%|█████▊ | 7/12 [00:03<00:03, 1.65it/s] Loading checkpoint shards: 67%|██████▋ | 8/12 [00:04<00:02, 1.75it/s] Loading checkpoint shards: 67%|██████▋ | 8/12 [00:04<00:02, 1.75it/s] Loading checkpoint shards: 75%|███████▌ | 9/12 [00:04<00:01, 1.82it/s] Loading checkpoint shards: 75%|███████▌ | 9/12 [00:04<00:01, 1.82it/s] Loading checkpoint shards: 83%|████████▎ | 10/12 [00:05<00:01, 1.94it/s] Loading checkpoint shards: 83%|████████▎ | 10/12 [00:05<00:01, 1.94it/s] Loading checkpoint shards: 92%|█████████▏| 11/12 [00:05<00:00, 2.08it/s] Loading checkpoint shards: 92%|█████████▏| 11/12 [00:05<00:00, 2.08it/s] Loading checkpoint shards: 100%|██████████| 12/12 [00:05<00:00, 2.09it/s] Loading checkpoint shards: 100%|██████████| 12/12 [00:05<00:00, 2.09it/s]
[rank2]: Traceback (most recent call last):
[rank2]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 188, in <module>
[rank2]: main()
[rank2]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 143, in main
[rank2]: ddp_olmo = DDP(olmo, device_ids=[local_rank])
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 850, in __init__
[rank2]: self._ddp_init_helper(
[rank2]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1201, in _ddp_init_helper
[rank2]: self.reducer = dist.Reducer(
[rank2]: ^^^^^^^^^^^^^
[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 51.10 GiB. GPU 0 has a total capacity of 79.25 GiB of which 27.52 GiB is free. Including non-PyTorch memory, this process has 51.72 GiB memory in use. Of the allocated memory 51.10 GiB is allocated by PyTorch, and 875.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 188, in <module>
[rank3]: main()
[rank3]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 143, in main
[rank3]: ddp_olmo = DDP(olmo, device_ids=[local_rank])
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 850, in __init__
[rank3]: self._ddp_init_helper(
[rank3]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1201, in _ddp_init_helper
[rank3]: self.reducer = dist.Reducer(
[rank3]: ^^^^^^^^^^^^^
[rank3]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 51.10 GiB. GPU 1 has a total capacity of 79.25 GiB of which 27.52 GiB is free. Including non-PyTorch memory, this process has 51.72 GiB memory in use. Of the allocated memory 51.10 GiB is allocated by PyTorch, and 875.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 188, in <module>
[rank0]: main()
[rank0]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 143, in main
[rank0]: ddp_olmo = DDP(olmo, device_ids=[local_rank])
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 850, in __init__
[rank0]: self._ddp_init_helper(
[rank0]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1201, in _ddp_init_helper
[rank0]: self.reducer = dist.Reducer(
[rank0]: ^^^^^^^^^^^^^
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 51.10 GiB. GPU 0 has a total capacity of 79.25 GiB of which 27.52 GiB is free. Including non-PyTorch memory, this process has 51.72 GiB memory in use. Of the allocated memory 51.10 GiB is allocated by PyTorch, and 875.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 188, in <module>
[rank1]: main()
[rank1]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 143, in main
[rank1]: ddp_olmo = DDP(olmo, device_ids=[local_rank])
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 850, in __init__
[rank1]: self._ddp_init_helper(
[rank1]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1201, in _ddp_init_helper
[rank1]: self.reducer = dist.Reducer(
[rank1]: ^^^^^^^^^^^^^
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 51.10 GiB. GPU 1 has a total capacity of 79.25 GiB of which 27.52 GiB is free. Including non-PyTorch memory, this process has 51.72 GiB memory in use. Of the allocated memory 51.10 GiB is allocated by PyTorch, and 875.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank2]:[W904 11:27:15.787618003 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank0]:[W904 11:27:15.409824698 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0904 11:27:17.571000 1736746 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 1736801 closing signal SIGTERM
E0904 11:27:17.635000 1736746 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 1736802) of binary: /home/nws8519/.conda/envs/olmo/bin/python3.11
Traceback (most recent call last):
File "/home/nws8519/.conda/envs/olmo/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-09-04_11:27:17
host : qgpu2013
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1736802)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[W904 11:27:17.168398358 TCPStore.cpp:115] [c10d] recvVector failed on SocketImpl(fd=3, addr=[qgpu2014]:57300, remote=[qgpu2013]:29502): failed to recv, got 0 bytes
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:678 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14873d0d85e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ba8afe (0x1487811fbafe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5baa0d0 (0x1487811fd0d0 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5baa81d (0x1487811fd81d in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x5bab4a9 (0x1487811fe4a9 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::TCPStore::compareSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&) + 0x1fb (0x1487811f84cb in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0xc2a761 (0x148790587761 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x38a0cc (0x14878fce70cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #8: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
frame #9: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #10: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
frame #11: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #12: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #13: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #14: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #15: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #16: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5581df]
frame #17: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557a20]
frame #18: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x62a8a3]
frame #19: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fa3c4]
frame #20: <unknown function> + 0x81ca (0x1487a64991ca in /lib64/libpthread.so.0)
frame #21: clone + 0x43 (0x1487a596a8d3 in /lib64/libc.so.6)
W0904 11:27:17.959000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1341] The node 'qgpu2014_2769136_0' has failed to send a keep-alive heartbeat to the rendezvous '3273582' due to an error of type RendezvousConnectionError.
W0904 11:27:17.959000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 2769219 closing signal SIGTERM
[W904 11:27:17.170100534 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2014]:57290, remote=[qgpu2013]:29502): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14c218b8e5e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ba8afe (0x14c25ccb1afe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5baa358 (0x14c25ccb3358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5babb3e (0x14c25ccb4b3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x1a6 (0x14c25ccaeac6 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::TCPStore::doGet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x33 (0x14c25ccaeea3 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::TCPStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xab (0x14c25ccaff8b in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0xc2a390 (0x14c26c03d390 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x38a0cc (0x14c26b79d0cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
frame #10: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #11: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #14: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #15: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #16: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #17: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #18: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7]
frame #19: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #21: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #22: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa]
frame #24: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7]
frame #26: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740]
frame #27: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2]
frame #28: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #29: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #30: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #31: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #32: __libc_start_main + 0xe5 (0x14c2814217e5 in /lib64/libc.so.6)
frame #33: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93]
W0904 11:27:17.963000 2769137 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2014_2769137_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError.
[W904 11:27:17.194777840 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2014]:57290, remote=[qgpu2013]:29502): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14c218b8e5e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ba8afe (0x14c25ccb1afe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5baa358 (0x14c25ccb3358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5babb3e (0x14c25ccb4b3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x1a6 (0x14c25ccaeac6 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::TCPStore::doGet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x33 (0x14c25ccaeea3 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::TCPStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xab (0x14c25ccaff8b in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0xc2a390 (0x14c26c03d390 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x38a0cc (0x14c26b79d0cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
frame #10: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #11: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #14: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #15: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #16: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7]
frame #17: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #18: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #19: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #21: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa]
frame #22: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7]
frame #24: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740]
frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2]
frame #26: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #27: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #28: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #29: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #30: __libc_start_main + 0xe5 (0x14c2814217e5 in /lib64/libc.so.6)
frame #31: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93]
W0904 11:27:17.986000 2769137 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2014_2769137_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 117, in _call_store
return getattr(self._store, store_op)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistNetworkError: failed to recv, got 0 bytes
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/nws8519/.conda/envs/olmo/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
result = agent.run()
^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 711, in run
result = self._invoke_run(role)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
self._initialize_workers(self._worker_group)
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 683, in _initialize_workers
self._rendezvous(worker_group)
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 500, in _rendezvous
rdzv_info = spec.rdzv_handler.next_rendezvous()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1170, in next_rendezvous
self._op_executor.run(join_op, deadline, self._get_deadline)
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 648, in run
has_set = self._state_holder.sync()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 437, in sync
get_response = self._backend.get_state()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 75, in get_state
base64_state: bytes = self._call_store("get", self._key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 119, in _call_store
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
E0904 11:27:18.023000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 2769218) of binary: /home/nws8519/.conda/envs/olmo/bin/python3.11
[W904 11:27:18.239612027 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2014]:57300, remote=[qgpu2013]:29502): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14873d0d85e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ba8afe (0x1487811fbafe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5baa358 (0x1487811fd358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5babb3e (0x1487811feb3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::compareSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&) + 0x299 (0x1487811f8569 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0xc2a761 (0x148790587761 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x38a0cc (0x14878fce70cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #7: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
frame #8: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
frame #10: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #11: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #14: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #15: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #16: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #17: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #18: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7]
frame #19: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #21: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #22: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa]
frame #24: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7]
frame #26: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740]
frame #27: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2]
frame #28: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #29: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #30: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #31: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #32: __libc_start_main + 0xe5 (0x1487a596b7e5 in /lib64/libc.so.6)
frame #33: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93]
W0904 11:27:18.030000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2014_2769136_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError.
[W904 11:27:18.248039930 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2014]:57300, remote=[qgpu2013]:29502): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14873d0d85e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ba8afe (0x1487811fbafe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5baa358 (0x1487811fd358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5babb3e (0x1487811feb3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::compareSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&) + 0x299 (0x1487811f8569 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0xc2a761 (0x148790587761 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x38a0cc (0x14878fce70cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #7: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
frame #8: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
frame #10: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #11: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #14: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #15: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #16: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7]
frame #17: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #18: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #19: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #21: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa]
frame #22: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7]
frame #24: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740]
frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2]
frame #26: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #27: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #28: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #29: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #30: __libc_start_main + 0xe5 (0x1487a596b7e5 in /lib64/libc.so.6)
frame #31: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93]
W0904 11:27:18.038000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2014_2769136_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError.
[W904 11:27:18.255885548 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2014]:57300, remote=[qgpu2013]:29502): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14873d0d85e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ba8afe (0x1487811fbafe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5baa358 (0x1487811fd358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5babb3e (0x1487811feb3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::compareSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&) + 0x299 (0x1487811f8569 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0xc2a761 (0x148790587761 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x38a0cc (0x14878fce70cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #7: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
frame #8: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
frame #10: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #11: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #12: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #13: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #14: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7]
frame #15: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #16: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #17: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #18: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #19: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa]
frame #20: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #21: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7]
frame #22: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740]
frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2]
frame #24: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #25: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #26: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #27: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #28: __libc_start_main + 0xe5 (0x1487a596b7e5 in /lib64/libc.so.6)
frame #29: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93]
W0904 11:27:18.046000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2014_2769136_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/home/nws8519/.conda/envs/olmo/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-09-04_11:27:17
host : qgpu2014
rank : 2 (local_rank: 0)
exitcode : 1 (pid: 2769218)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: qgpu2013: task 1: Exited with exit code 1
srun: error: qgpu2014: tasks 2-3: Exited with exit code 1
[W904 11:27:18.383886513 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2013]:36246, remote=[qgpu2013]:29502): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14977ddbe5e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ba8afe (0x1497c1ee1afe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5baa358 (0x1497c1ee3358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5babb3e (0x1497c1ee4b3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x1a6 (0x1497c1edeac6 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::TCPStore::doGet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x33 (0x1497c1edeea3 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::TCPStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xab (0x1497c1edff8b in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0xc2a390 (0x1497d126d390 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x38a0cc (0x1497d09cd0cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
frame #10: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #11: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #14: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #15: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #16: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #17: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #18: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7]
frame #19: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #21: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #22: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa]
frame #24: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7]
frame #26: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740]
frame #27: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2]
frame #28: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #29: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #30: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #31: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #32: __libc_start_main + 0xe5 (0x1497e66517e5 in /lib64/libc.so.6)
frame #33: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93]
W0904 11:27:18.554000 1736745 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2013_1736745_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError.
[W904 11:27:18.394906553 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2013]:36246, remote=[qgpu2013]:29502): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14977ddbe5e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ba8afe (0x1497c1ee1afe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5baa358 (0x1497c1ee3358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5babb3e (0x1497c1ee4b3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x1a6 (0x1497c1edeac6 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::TCPStore::doGet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x33 (0x1497c1edeea3 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::TCPStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xab (0x1497c1edff8b in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0xc2a390 (0x1497d126d390 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x38a0cc (0x1497d09cd0cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
frame #10: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #11: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #14: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #15: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #16: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7]
frame #17: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #18: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #19: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #21: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa]
frame #22: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7]
frame #24: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740]
frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2]
frame #26: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #27: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #28: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #29: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
frame #30: __libc_start_main + 0xe5 (0x1497e66517e5 in /lib64/libc.so.6)
frame #31: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93]
W0904 11:27:18.565000 1736745 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2013_1736745_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 117, in _call_store
return getattr(self._store, store_op)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistNetworkError: failed to recv, got 0 bytes
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/nws8519/.conda/envs/olmo/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
result = agent.run()
^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 711, in run
result = self._invoke_run(role)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
self._initialize_workers(self._worker_group)
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 683, in _initialize_workers
self._rendezvous(worker_group)
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 500, in _rendezvous
rdzv_info = spec.rdzv_handler.next_rendezvous()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1170, in next_rendezvous
self._op_executor.run(join_op, deadline, self._get_deadline)
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 648, in run
has_set = self._state_holder.sync()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 437, in sync
get_response = self._backend.get_state()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 75, in get_state
base64_state: bytes = self._call_store("get", self._key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 119, in _call_store
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
srun: error: qgpu2013: task 0: Exited with exit code 1
unsupervised olmo categorization pau at Thu Sep 4 11:27:18 CDT 2025

View File

@ -17,7 +17,9 @@ def format_df_data(df):
if __name__ == "__main__":
biber_vec_df = pd.read_csv("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_labels.csv", low_memory=False)
biber_vec_df = biber_vec_df[biber_vec_df['comment_type'] == 'task_subcomment']
biber_vec_df = biber_vec_df[biber_vec_df['comment_type'] == 'task_description']
biber_vec_df = biber_vec_df[biber_vec_df['AuthorPHID'] != "PHID-USER-idceizaw6elwiwm5xshb"]
#biber_vec_df = biber_vec_df[biber_vec_df['comment_text'] != 'nan']
biber_vecs = format_df_data(biber_vec_df)
#handoff to PCA model
'''
@ -37,17 +39,33 @@ if __name__ == "__main__":
component_variances = np.var(biber_vecs_pca, axis=0)
print("Variance of each PCA component:", component_variances)
#first looking at comment_type
le = LabelEncoder()
colors = le.fit_transform(biber_vec_df[selected_axis])
pc_dict = {f"PC{i+1}": biber_vecs_pca[:, i] for i in range(18)}
pc_dict[selected_axis] = biber_vec_df[selected_axis].astype(str)
pc_dict["source"] = biber_vec_df['source'].astype(str)
pc_dict["phase"] = biber_vec_df['phase'].astype(str)
pc_dict["text"] = biber_vec_df['comment_text'].astype(str)
pc_dict['id'] = biber_vec_df['id']
pc_dict['week_index'] = biber_vec_df['week_index']
pc_dict['priority'] = biber_vec_df['priority']
pc_dict['closed_relevance'] = biber_vec_df['closed_relevance']
plot_df = pd.DataFrame({
"PC1": biber_vecs_pca[:, 0],
"PC2": biber_vecs_pca[:, 1],
selected_axis: biber_vec_df[selected_axis].astype(str),
"source":biber_vec_df['source'].astype(str),
"phase":biber_vec_df['phase'].astype(str)
})
plot_df = pd.DataFrame(pc_dict)
plot_df.to_csv("090425_description_PCA_df.csv", index=False)
print("Top 10 PC1 values:")
print(plot_df.nlargest(10, "PC1"))
print("\nBottom 10 PC1 values:")
print(plot_df.nsmallest(10, "PC1"))
print("Top 10 PC2 values:")
print(plot_df.nlargest(10, "PC2"))
print("\nBottom 10 PC2 values:")
print(plot_df.nsmallest(10, "PC2"))
g = sns.FacetGrid(plot_df, col="source", row="phase", hue=selected_axis, palette="tab10", height=4, sharex=False, sharey=False)
@ -74,5 +92,5 @@ if __name__ == "__main__":
plt.legend(title=selected_axis, bbox_to_anchor=(1.05, 1), loc=2)
'''
g.fig.tight_layout()
g.savefig(f"subcomment_{selected_axis}_090425_biber_pca.png", dpi=300)
g.savefig(f"description_{selected_axis}_090425_biber_pca_final.png", dpi=300)
plt.show()

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.5 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 2.4 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 2.4 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.1 MiB