should be updated and refined pca analysis
Before Width: | Height: | Size: 266 KiB |
38874
p2/quest/090425_description_PCA_df.csv
Normal file
96921
p2/quest/090425_subcomment_PCA_df.csv
Normal file
Before Width: | Height: | Size: 1.4 MiB |
Before Width: | Height: | Size: 2.1 MiB |
Before Width: | Height: | Size: 1.4 MiB After Width: | Height: | Size: 1.4 MiB |
@ -1,9 +1,66 @@
|
|||||||
starting the job at: Thu Sep 4 11:02:03 CDT 2025
|
starting the job at: Thu Sep 4 15:41:55 CDT 2025
|
||||||
setting up the environment
|
setting up the environment
|
||||||
running the neurobiber labeling script
|
running the neurobiber labeling script
|
||||||
Variance of each PCA component: [259.38215213 83.11803664 67.16301107 61.78747188 38.94875996
|
Variance of each PCA component: [88.92832185 39.46471687 32.34601523 20.19544345 14.0083261 11.5837521
|
||||||
32.78688889 26.45592105 21.9280629 18.734197 16.29485568
|
7.82584723 6.89064989 6.07988254 5.80726367 5.49782354 4.50587747
|
||||||
13.48304855 11.50594609 10.77855857 9.30674176 8.96113511
|
4.31482409 2.81997326 2.62989708 2.27205352 2.09396341 2.00076119]
|
||||||
8.35521401 8.17815209 7.13194427]
|
Top 10 PC1 values:
|
||||||
|
PC1 PC2 ... priority closed_relevance
|
||||||
|
19873 125.128650 24.461032 ... Medium False
|
||||||
|
21956 125.128650 24.461032 ... Needs Triage True
|
||||||
|
22010 125.128650 24.461032 ... Needs Triage True
|
||||||
|
24528 125.128650 24.461032 ... Needs Triage True
|
||||||
|
24529 125.128650 24.461032 ... Needs Triage False
|
||||||
|
25549 125.128650 24.461032 ... Medium False
|
||||||
|
6329 72.728923 28.262157 ... Medium False
|
||||||
|
11288 72.728923 28.262157 ... Low False
|
||||||
|
22332 72.728923 28.262157 ... High True
|
||||||
|
22731 72.728923 28.262157 ... Medium True
|
||||||
|
|
||||||
|
[10 rows x 26 columns]
|
||||||
|
|
||||||
|
Bottom 10 PC1 values:
|
||||||
|
PC1 PC2 ... priority closed_relevance
|
||||||
|
12503 -16.333841 17.142328 ... Low NaN
|
||||||
|
3462 -15.759184 15.368325 ... High NaN
|
||||||
|
23838 -14.821270 17.471553 ... Low False
|
||||||
|
25791 -14.806017 12.439508 ... Medium True
|
||||||
|
23053 -14.399838 15.867529 ... Medium False
|
||||||
|
24180 -14.046494 12.993193 ... Low True
|
||||||
|
11814 -14.009692 13.953416 ... Low False
|
||||||
|
24699 -13.848945 15.308788 ... Needs Triage True
|
||||||
|
24214 -13.701324 11.951003 ... Low False
|
||||||
|
24467 -13.680693 11.614764 ... Needs Triage True
|
||||||
|
|
||||||
|
[10 rows x 26 columns]
|
||||||
|
Top 10 PC2 values:
|
||||||
|
PC1 PC2 PC3 ... week_index priority closed_relevance
|
||||||
|
6329 72.728923 28.262157 -52.466963 ... 10 Medium False
|
||||||
|
11288 72.728923 28.262157 -52.466963 ... 2 Low False
|
||||||
|
22332 72.728923 28.262157 -52.466963 ... 4 High True
|
||||||
|
22731 72.728923 28.262157 -52.466963 ... 10 Medium True
|
||||||
|
23016 72.728923 28.262157 -52.466963 ... 7 Medium False
|
||||||
|
23022 72.728923 28.262157 -52.466963 ... 7 Medium False
|
||||||
|
23086 72.728923 28.262157 -52.466963 ... 6 Medium False
|
||||||
|
23238 72.728923 28.262157 -52.466963 ... 4 Medium False
|
||||||
|
25606 72.728923 28.262157 -52.466963 ... -22 Medium True
|
||||||
|
25843 72.728923 28.262157 -52.466963 ... -31 Medium True
|
||||||
|
|
||||||
|
[10 rows x 26 columns]
|
||||||
|
|
||||||
|
Bottom 10 PC2 values:
|
||||||
|
PC1 PC2 PC3 ... week_index priority closed_relevance
|
||||||
|
741 1.197394 -18.726602 -5.305851 ... -33 Unbreak Now! True
|
||||||
|
6492 1.197394 -18.726602 -5.305851 ... 8 Medium False
|
||||||
|
6495 1.197394 -18.726602 -5.305851 ... 8 Medium False
|
||||||
|
8834 1.197394 -18.726602 -5.305851 ... -2 Medium False
|
||||||
|
9292 1.197394 -18.726602 -5.305851 ... -4 Medium True
|
||||||
|
9419 1.197394 -18.726602 -5.305851 ... -6 Medium NaN
|
||||||
|
10686 1.197394 -18.726602 -5.305851 ... 8 Low NaN
|
||||||
|
11301 1.197394 -18.726602 -5.305851 ... 2 Low True
|
||||||
|
11306 1.197394 -18.726602 -5.305851 ... 2 Low True
|
||||||
|
11312 1.197394 -18.726602 -5.305851 ... 2 Low True
|
||||||
|
|
||||||
|
[10 rows x 26 columns]
|
||||||
job finished, cleaning up
|
job finished, cleaning up
|
||||||
job pau at: Thu Sep 4 11:02:32 CDT 2025
|
job pau at: Thu Sep 4 15:42:13 CDT 2025
|
||||||
|
@ -71,3 +71,531 @@ Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed.
|
|||||||
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
|
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
|
||||||
Fetching 12 files: 8%|▊ | 1/12 [03:13<35:25, 193.27s/it]Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
|
Fetching 12 files: 8%|▊ | 1/12 [03:13<35:25, 193.27s/it]Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
|
||||||
Fetching 12 files: 17%|█▋ | 2/12 [04:23<20:11, 121.10s/it]Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
|
Fetching 12 files: 17%|█▋ | 2/12 [04:23<20:11, 121.10s/it]Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
|
||||||
|
Fetching 12 files: 25%|██▌ | 3/12 [05:03<12:33, 83.77s/it] Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
|
||||||
|
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
|
||||||
|
Fetching 12 files: 33%|███▎ | 4/12 [06:42<11:59, 89.90s/it]
Fetching 12 files: 42%|████▏ | 5/12 [07:04<07:37, 65.41s/it]
Fetching 12 files: 50%|█████ | 6/12 [07:08<04:26, 44.47s/it]
Fetching 12 files: 75%|███████▌ | 9/12 [07:37<01:11, 23.91s/it]
Fetching 12 files: 83%|████████▎ | 10/12 [07:49<00:42, 21.14s/it]
Fetching 12 files: 100%|██████████| 12/12 [07:49<00:00, 39.09s/it]
|
||||||
|
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s]
Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:04, 2.49it/s]
Loading checkpoint shards: 17%|█▋ | 2/12 [00:01<00:05, 1.81it/s]
Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:04, 1.83it/s]
Loading checkpoint shards: 33%|███▎ | 4/12 [00:02<00:04, 1.94it/s]
Loading checkpoint shards: 42%|████▏ | 5/12 [00:02<00:03, 2.09it/s]
Loading checkpoint shards: 50%|█████ | 6/12 [00:02<00:02, 2.21it/s]
Loading checkpoint shards: 58%|█████▊ | 7/12 [00:03<00:02, 2.31it/s]
Loading checkpoint shards: 67%|██████▋ | 8/12 [00:03<00:01, 2.14it/s]
Loading checkpoint shards: 75%|███████▌ | 9/12 [00:04<00:01, 2.39it/s]
Loading checkpoint shards: 83%|████████▎ | 10/12 [00:04<00:00, 2.92it/s]
Loading checkpoint shards: 100%|██████████| 12/12 [00:04<00:00, 4.85it/s]
Loading checkpoint shards: 100%|██████████| 12/12 [00:04<00:00, 2.73it/s]
|
||||||
|
[rank0]:[W904 11:25:15.000410288 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
|
||||||
|
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s]
Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:03, 3.04it/s]
Loading checkpoint shards: 17%|█▋ | 2/12 [00:00<00:04, 2.20it/s]
Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:04, 2.10it/s]
Loading checkpoint shards: 33%|███▎ | 4/12 [00:01<00:04, 1.91it/s]
Loading checkpoint shards: 100%|██████████| 12/12 [00:02<00:00, 5.96it/s]
|
||||||
|
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s]
Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:03, 3.02it/s]
Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:03, 3.02it/s]
Loading checkpoint shards: 17%|█▋ | 2/12 [00:00<00:04, 2.18it/s]
Loading checkpoint shards: 17%|█▋ | 2/12 [00:00<00:04, 2.18it/s]
Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:04, 2.09it/s]
Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:04, 2.09it/s]
Loading checkpoint shards: 33%|███▎ | 4/12 [00:01<00:04, 1.89it/s]
Loading checkpoint shards: 33%|███▎ | 4/12 [00:01<00:04, 1.89it/s]
Loading checkpoint shards: 42%|████▏ | 5/12 [00:02<00:03, 1.80it/s]
Loading checkpoint shards: 42%|████▏ | 5/12 [00:02<00:03, 1.80it/s]
Loading checkpoint shards: 50%|█████ | 6/12 [00:03<00:03, 1.63it/s]
Loading checkpoint shards: 50%|█████ | 6/12 [00:03<00:03, 1.63it/s]
Loading checkpoint shards: 58%|█████▊ | 7/12 [00:03<00:03, 1.65it/s]
Loading checkpoint shards: 58%|█████▊ | 7/12 [00:03<00:03, 1.65it/s]
Loading checkpoint shards: 67%|██████▋ | 8/12 [00:04<00:02, 1.75it/s]
Loading checkpoint shards: 67%|██████▋ | 8/12 [00:04<00:02, 1.75it/s]
Loading checkpoint shards: 75%|███████▌ | 9/12 [00:04<00:01, 1.82it/s]
Loading checkpoint shards: 75%|███████▌ | 9/12 [00:04<00:01, 1.82it/s]
Loading checkpoint shards: 83%|████████▎ | 10/12 [00:05<00:01, 1.94it/s]
Loading checkpoint shards: 83%|████████▎ | 10/12 [00:05<00:01, 1.94it/s]
Loading checkpoint shards: 92%|█████████▏| 11/12 [00:05<00:00, 2.08it/s]
Loading checkpoint shards: 92%|█████████▏| 11/12 [00:05<00:00, 2.08it/s]
Loading checkpoint shards: 100%|██████████| 12/12 [00:05<00:00, 2.09it/s]
Loading checkpoint shards: 100%|██████████| 12/12 [00:05<00:00, 2.09it/s]
|
||||||
|
|
||||||
|
[rank2]: Traceback (most recent call last):
|
||||||
|
[rank2]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 188, in <module>
|
||||||
|
[rank2]: main()
|
||||||
|
[rank2]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 143, in main
|
||||||
|
[rank2]: ddp_olmo = DDP(olmo, device_ids=[local_rank])
|
||||||
|
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
[rank2]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 850, in __init__
|
||||||
|
[rank2]: self._ddp_init_helper(
|
||||||
|
[rank2]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1201, in _ddp_init_helper
|
||||||
|
[rank2]: self.reducer = dist.Reducer(
|
||||||
|
[rank2]: ^^^^^^^^^^^^^
|
||||||
|
[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 51.10 GiB. GPU 0 has a total capacity of 79.25 GiB of which 27.52 GiB is free. Including non-PyTorch memory, this process has 51.72 GiB memory in use. Of the allocated memory 51.10 GiB is allocated by PyTorch, and 875.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
|
||||||
|
[rank3]: Traceback (most recent call last):
|
||||||
|
[rank3]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 188, in <module>
|
||||||
|
[rank3]: main()
|
||||||
|
[rank3]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 143, in main
|
||||||
|
[rank3]: ddp_olmo = DDP(olmo, device_ids=[local_rank])
|
||||||
|
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
[rank3]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 850, in __init__
|
||||||
|
[rank3]: self._ddp_init_helper(
|
||||||
|
[rank3]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1201, in _ddp_init_helper
|
||||||
|
[rank3]: self.reducer = dist.Reducer(
|
||||||
|
[rank3]: ^^^^^^^^^^^^^
|
||||||
|
[rank3]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 51.10 GiB. GPU 1 has a total capacity of 79.25 GiB of which 27.52 GiB is free. Including non-PyTorch memory, this process has 51.72 GiB memory in use. Of the allocated memory 51.10 GiB is allocated by PyTorch, and 875.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
|
||||||
|
[rank0]: Traceback (most recent call last):
|
||||||
|
[rank0]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 188, in <module>
|
||||||
|
[rank0]: main()
|
||||||
|
[rank0]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 143, in main
|
||||||
|
[rank0]: ddp_olmo = DDP(olmo, device_ids=[local_rank])
|
||||||
|
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
[rank0]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 850, in __init__
|
||||||
|
[rank0]: self._ddp_init_helper(
|
||||||
|
[rank0]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1201, in _ddp_init_helper
|
||||||
|
[rank0]: self.reducer = dist.Reducer(
|
||||||
|
[rank0]: ^^^^^^^^^^^^^
|
||||||
|
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 51.10 GiB. GPU 0 has a total capacity of 79.25 GiB of which 27.52 GiB is free. Including non-PyTorch memory, this process has 51.72 GiB memory in use. Of the allocated memory 51.10 GiB is allocated by PyTorch, and 875.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
|
||||||
|
[rank1]: Traceback (most recent call last):
|
||||||
|
[rank1]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 188, in <module>
|
||||||
|
[rank1]: main()
|
||||||
|
[rank1]: File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py", line 143, in main
|
||||||
|
[rank1]: ddp_olmo = DDP(olmo, device_ids=[local_rank])
|
||||||
|
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
[rank1]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 850, in __init__
|
||||||
|
[rank1]: self._ddp_init_helper(
|
||||||
|
[rank1]: File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1201, in _ddp_init_helper
|
||||||
|
[rank1]: self.reducer = dist.Reducer(
|
||||||
|
[rank1]: ^^^^^^^^^^^^^
|
||||||
|
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 51.10 GiB. GPU 1 has a total capacity of 79.25 GiB of which 27.52 GiB is free. Including non-PyTorch memory, this process has 51.72 GiB memory in use. Of the allocated memory 51.10 GiB is allocated by PyTorch, and 875.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
|
||||||
|
[rank2]:[W904 11:27:15.787618003 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
|
||||||
|
[rank0]:[W904 11:27:15.409824698 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
|
||||||
|
W0904 11:27:17.571000 1736746 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 1736801 closing signal SIGTERM
|
||||||
|
E0904 11:27:17.635000 1736746 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 1736802) of binary: /home/nws8519/.conda/envs/olmo/bin/python3.11
|
||||||
|
Traceback (most recent call last):
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/bin/torchrun", line 8, in <module>
|
||||||
|
sys.exit(main())
|
||||||
|
^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
|
||||||
|
return f(*args, **kwargs)
|
||||||
|
^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main
|
||||||
|
run(args)
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run
|
||||||
|
elastic_launch(
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
|
||||||
|
return launch_agent(self._config, self._entrypoint, list(args))
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
|
||||||
|
raise ChildFailedError(
|
||||||
|
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
|
||||||
|
============================================================
|
||||||
|
/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py FAILED
|
||||||
|
------------------------------------------------------------
|
||||||
|
Failures:
|
||||||
|
<NO_OTHER_FAILURES>
|
||||||
|
------------------------------------------------------------
|
||||||
|
Root Cause (first observed failure):
|
||||||
|
[0]:
|
||||||
|
time : 2025-09-04_11:27:17
|
||||||
|
host : qgpu2013
|
||||||
|
rank : 1 (local_rank: 1)
|
||||||
|
exitcode : 1 (pid: 1736802)
|
||||||
|
error_file: <N/A>
|
||||||
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
|
||||||
|
============================================================
|
||||||
|
[W904 11:27:17.168398358 TCPStore.cpp:115] [c10d] recvVector failed on SocketImpl(fd=3, addr=[qgpu2014]:57300, remote=[qgpu2013]:29502): failed to recv, got 0 bytes
|
||||||
|
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:678 (most recent call first):
|
||||||
|
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14873d0d85e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
|
||||||
|
frame #1: <unknown function> + 0x5ba8afe (0x1487811fbafe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #2: <unknown function> + 0x5baa0d0 (0x1487811fd0d0 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #3: <unknown function> + 0x5baa81d (0x1487811fd81d in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #4: <unknown function> + 0x5bab4a9 (0x1487811fe4a9 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #5: c10d::TCPStore::compareSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&) + 0x1fb (0x1487811f84cb in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #6: <unknown function> + 0xc2a761 (0x148790587761 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
||||||
|
frame #7: <unknown function> + 0x38a0cc (0x14878fce70cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
||||||
|
frame #8: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
|
||||||
|
frame #9: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #10: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
|
||||||
|
frame #11: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #12: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #13: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #14: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #15: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #16: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5581df]
|
||||||
|
frame #17: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557a20]
|
||||||
|
frame #18: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x62a8a3]
|
||||||
|
frame #19: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fa3c4]
|
||||||
|
frame #20: <unknown function> + 0x81ca (0x1487a64991ca in /lib64/libpthread.so.0)
|
||||||
|
frame #21: clone + 0x43 (0x1487a596a8d3 in /lib64/libc.so.6)
|
||||||
|
|
||||||
|
W0904 11:27:17.959000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1341] The node 'qgpu2014_2769136_0' has failed to send a keep-alive heartbeat to the rendezvous '3273582' due to an error of type RendezvousConnectionError.
|
||||||
|
W0904 11:27:17.959000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 2769219 closing signal SIGTERM
|
||||||
|
[W904 11:27:17.170100534 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2014]:57290, remote=[qgpu2013]:29502): Broken pipe
|
||||||
|
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
|
||||||
|
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14c218b8e5e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
|
||||||
|
frame #1: <unknown function> + 0x5ba8afe (0x14c25ccb1afe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #2: <unknown function> + 0x5baa358 (0x14c25ccb3358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #3: <unknown function> + 0x5babb3e (0x14c25ccb4b3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #4: c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x1a6 (0x14c25ccaeac6 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #5: c10d::TCPStore::doGet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x33 (0x14c25ccaeea3 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #6: c10d::TCPStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xab (0x14c25ccaff8b in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #7: <unknown function> + 0xc2a390 (0x14c26c03d390 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
||||||
|
frame #8: <unknown function> + 0x38a0cc (0x14c26b79d0cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
||||||
|
frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
|
||||||
|
frame #10: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #11: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
|
||||||
|
frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #14: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #15: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #16: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #17: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #18: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7]
|
||||||
|
frame #19: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #21: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #22: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa]
|
||||||
|
frame #24: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7]
|
||||||
|
frame #26: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740]
|
||||||
|
frame #27: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2]
|
||||||
|
frame #28: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #29: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #30: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #31: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #32: __libc_start_main + 0xe5 (0x14c2814217e5 in /lib64/libc.so.6)
|
||||||
|
frame #33: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93]
|
||||||
|
|
||||||
|
W0904 11:27:17.963000 2769137 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2014_2769137_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError.
|
||||||
|
[W904 11:27:17.194777840 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2014]:57290, remote=[qgpu2013]:29502): Broken pipe
|
||||||
|
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
|
||||||
|
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14c218b8e5e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
|
||||||
|
frame #1: <unknown function> + 0x5ba8afe (0x14c25ccb1afe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #2: <unknown function> + 0x5baa358 (0x14c25ccb3358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #3: <unknown function> + 0x5babb3e (0x14c25ccb4b3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #4: c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x1a6 (0x14c25ccaeac6 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #5: c10d::TCPStore::doGet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x33 (0x14c25ccaeea3 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #6: c10d::TCPStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xab (0x14c25ccaff8b in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #7: <unknown function> + 0xc2a390 (0x14c26c03d390 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
||||||
|
frame #8: <unknown function> + 0x38a0cc (0x14c26b79d0cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
||||||
|
frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
|
||||||
|
frame #10: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #11: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
|
||||||
|
frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #14: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #15: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #16: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7]
|
||||||
|
frame #17: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #18: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #19: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #21: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa]
|
||||||
|
frame #22: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7]
|
||||||
|
frame #24: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740]
|
||||||
|
frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2]
|
||||||
|
frame #26: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #27: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #28: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #29: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #30: __libc_start_main + 0xe5 (0x14c2814217e5 in /lib64/libc.so.6)
|
||||||
|
frame #31: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93]
|
||||||
|
|
||||||
|
W0904 11:27:17.986000 2769137 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2014_2769137_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError.
|
||||||
|
Traceback (most recent call last):
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 117, in _call_store
|
||||||
|
return getattr(self._store, store_op)(*args, **kwargs)
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
torch.distributed.DistNetworkError: failed to recv, got 0 bytes
|
||||||
|
|
||||||
|
The above exception was the direct cause of the following exception:
|
||||||
|
|
||||||
|
Traceback (most recent call last):
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/bin/torchrun", line 8, in <module>
|
||||||
|
sys.exit(main())
|
||||||
|
^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
|
||||||
|
return f(*args, **kwargs)
|
||||||
|
^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main
|
||||||
|
run(args)
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run
|
||||||
|
elastic_launch(
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
|
||||||
|
return launch_agent(self._config, self._entrypoint, list(args))
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
|
||||||
|
result = agent.run()
|
||||||
|
^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
|
||||||
|
result = f(*args, **kwargs)
|
||||||
|
^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 711, in run
|
||||||
|
result = self._invoke_run(role)
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
|
||||||
|
self._initialize_workers(self._worker_group)
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
|
||||||
|
result = f(*args, **kwargs)
|
||||||
|
^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 683, in _initialize_workers
|
||||||
|
self._rendezvous(worker_group)
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
|
||||||
|
result = f(*args, **kwargs)
|
||||||
|
^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 500, in _rendezvous
|
||||||
|
rdzv_info = spec.rdzv_handler.next_rendezvous()
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1170, in next_rendezvous
|
||||||
|
self._op_executor.run(join_op, deadline, self._get_deadline)
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 648, in run
|
||||||
|
has_set = self._state_holder.sync()
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 437, in sync
|
||||||
|
get_response = self._backend.get_state()
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 75, in get_state
|
||||||
|
base64_state: bytes = self._call_store("get", self._key)
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 119, in _call_store
|
||||||
|
raise RendezvousConnectionError(
|
||||||
|
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
|
||||||
|
E0904 11:27:18.023000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 2769218) of binary: /home/nws8519/.conda/envs/olmo/bin/python3.11
|
||||||
|
[W904 11:27:18.239612027 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2014]:57300, remote=[qgpu2013]:29502): Broken pipe
|
||||||
|
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
|
||||||
|
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14873d0d85e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
|
||||||
|
frame #1: <unknown function> + 0x5ba8afe (0x1487811fbafe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #2: <unknown function> + 0x5baa358 (0x1487811fd358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #3: <unknown function> + 0x5babb3e (0x1487811feb3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #4: c10d::TCPStore::compareSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&) + 0x299 (0x1487811f8569 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #5: <unknown function> + 0xc2a761 (0x148790587761 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
||||||
|
frame #6: <unknown function> + 0x38a0cc (0x14878fce70cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
||||||
|
frame #7: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
|
||||||
|
frame #8: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
|
||||||
|
frame #10: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #11: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #14: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #15: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #16: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #17: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #18: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7]
|
||||||
|
frame #19: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #21: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #22: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa]
|
||||||
|
frame #24: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7]
|
||||||
|
frame #26: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740]
|
||||||
|
frame #27: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2]
|
||||||
|
frame #28: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #29: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #30: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #31: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #32: __libc_start_main + 0xe5 (0x1487a596b7e5 in /lib64/libc.so.6)
|
||||||
|
frame #33: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93]
|
||||||
|
|
||||||
|
W0904 11:27:18.030000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2014_2769136_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError.
|
||||||
|
[W904 11:27:18.248039930 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2014]:57300, remote=[qgpu2013]:29502): Broken pipe
|
||||||
|
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
|
||||||
|
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14873d0d85e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
|
||||||
|
frame #1: <unknown function> + 0x5ba8afe (0x1487811fbafe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #2: <unknown function> + 0x5baa358 (0x1487811fd358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #3: <unknown function> + 0x5babb3e (0x1487811feb3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #4: c10d::TCPStore::compareSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&) + 0x299 (0x1487811f8569 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #5: <unknown function> + 0xc2a761 (0x148790587761 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
||||||
|
frame #6: <unknown function> + 0x38a0cc (0x14878fce70cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
||||||
|
frame #7: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
|
||||||
|
frame #8: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
|
||||||
|
frame #10: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #11: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #14: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #15: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #16: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7]
|
||||||
|
frame #17: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #18: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #19: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #21: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa]
|
||||||
|
frame #22: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7]
|
||||||
|
frame #24: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740]
|
||||||
|
frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2]
|
||||||
|
frame #26: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #27: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #28: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #29: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #30: __libc_start_main + 0xe5 (0x1487a596b7e5 in /lib64/libc.so.6)
|
||||||
|
frame #31: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93]
|
||||||
|
|
||||||
|
W0904 11:27:18.038000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2014_2769136_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError.
|
||||||
|
[W904 11:27:18.255885548 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2014]:57300, remote=[qgpu2013]:29502): Broken pipe
|
||||||
|
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
|
||||||
|
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14873d0d85e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
|
||||||
|
frame #1: <unknown function> + 0x5ba8afe (0x1487811fbafe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #2: <unknown function> + 0x5baa358 (0x1487811fd358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #3: <unknown function> + 0x5babb3e (0x1487811feb3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #4: c10d::TCPStore::compareSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, std::vector<unsigned char, std::allocator<unsigned char> > const&) + 0x299 (0x1487811f8569 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #5: <unknown function> + 0xc2a761 (0x148790587761 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
||||||
|
frame #6: <unknown function> + 0x38a0cc (0x14878fce70cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
||||||
|
frame #7: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
|
||||||
|
frame #8: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
|
||||||
|
frame #10: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #11: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #12: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #13: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #14: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7]
|
||||||
|
frame #15: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #16: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #17: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #18: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #19: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa]
|
||||||
|
frame #20: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #21: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7]
|
||||||
|
frame #22: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740]
|
||||||
|
frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2]
|
||||||
|
frame #24: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #25: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #26: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #27: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #28: __libc_start_main + 0xe5 (0x1487a596b7e5 in /lib64/libc.so.6)
|
||||||
|
frame #29: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93]
|
||||||
|
|
||||||
|
W0904 11:27:18.046000 2769136 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2014_2769136_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError.
|
||||||
|
Traceback (most recent call last):
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/bin/torchrun", line 8, in <module>
|
||||||
|
sys.exit(main())
|
||||||
|
^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
|
||||||
|
return f(*args, **kwargs)
|
||||||
|
^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main
|
||||||
|
run(args)
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run
|
||||||
|
elastic_launch(
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
|
||||||
|
return launch_agent(self._config, self._entrypoint, list(args))
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
|
||||||
|
raise ChildFailedError(
|
||||||
|
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
|
||||||
|
============================================================
|
||||||
|
/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_parallel_cat.py FAILED
|
||||||
|
------------------------------------------------------------
|
||||||
|
Failures:
|
||||||
|
<NO_OTHER_FAILURES>
|
||||||
|
------------------------------------------------------------
|
||||||
|
Root Cause (first observed failure):
|
||||||
|
[0]:
|
||||||
|
time : 2025-09-04_11:27:17
|
||||||
|
host : qgpu2014
|
||||||
|
rank : 2 (local_rank: 0)
|
||||||
|
exitcode : 1 (pid: 2769218)
|
||||||
|
error_file: <N/A>
|
||||||
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
|
||||||
|
============================================================
|
||||||
|
srun: error: qgpu2013: task 1: Exited with exit code 1
|
||||||
|
srun: error: qgpu2014: tasks 2-3: Exited with exit code 1
|
||||||
|
[W904 11:27:18.383886513 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2013]:36246, remote=[qgpu2013]:29502): Broken pipe
|
||||||
|
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
|
||||||
|
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14977ddbe5e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
|
||||||
|
frame #1: <unknown function> + 0x5ba8afe (0x1497c1ee1afe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #2: <unknown function> + 0x5baa358 (0x1497c1ee3358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #3: <unknown function> + 0x5babb3e (0x1497c1ee4b3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #4: c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x1a6 (0x1497c1edeac6 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #5: c10d::TCPStore::doGet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x33 (0x1497c1edeea3 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #6: c10d::TCPStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xab (0x1497c1edff8b in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #7: <unknown function> + 0xc2a390 (0x1497d126d390 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
||||||
|
frame #8: <unknown function> + 0x38a0cc (0x1497d09cd0cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
||||||
|
frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
|
||||||
|
frame #10: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #11: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
|
||||||
|
frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #14: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #15: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #16: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #17: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #18: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7]
|
||||||
|
frame #19: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #21: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #22: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa]
|
||||||
|
frame #24: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7]
|
||||||
|
frame #26: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740]
|
||||||
|
frame #27: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2]
|
||||||
|
frame #28: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #29: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #30: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #31: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #32: __libc_start_main + 0xe5 (0x1497e66517e5 in /lib64/libc.so.6)
|
||||||
|
frame #33: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93]
|
||||||
|
|
||||||
|
W0904 11:27:18.554000 1736745 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2013_1736745_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError.
|
||||||
|
[W904 11:27:18.394906553 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=3, addr=[qgpu2013]:36246, remote=[qgpu2013]:29502): Broken pipe
|
||||||
|
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
|
||||||
|
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x14977ddbe5e8 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libc10.so)
|
||||||
|
frame #1: <unknown function> + 0x5ba8afe (0x1497c1ee1afe in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #2: <unknown function> + 0x5baa358 (0x1497c1ee3358 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #3: <unknown function> + 0x5babb3e (0x1497c1ee4b3e in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #4: c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x1a6 (0x1497c1edeac6 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #5: c10d::TCPStore::doGet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x33 (0x1497c1edeea3 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #6: c10d::TCPStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xab (0x1497c1edff8b in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
|
||||||
|
frame #7: <unknown function> + 0xc2a390 (0x1497d126d390 in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
||||||
|
frame #8: <unknown function> + 0x38a0cc (0x1497d09cd0cc in /home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
|
||||||
|
frame #9: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x528b17]
|
||||||
|
frame #10: _PyObject_MakeTpCall + 0x27c (0x50452c in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #11: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x557ac9]
|
||||||
|
frame #12: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #13: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #14: _PyObject_FastCallDictTstate + 0x65 (0x508e05 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #15: _PyObject_Call_Prepend + 0x66 (0x540ac6 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #16: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x611dd7]
|
||||||
|
frame #17: PyObject_Call + 0xbd (0x54303d in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #18: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #19: _PyFunction_Vectorcall + 0x173 (0x539153 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #20: _PyEval_EvalFrameDefault + 0x47c0 (0x515b90 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #21: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5cc3aa]
|
||||||
|
frame #22: PyEval_EvalCode + 0x9f (0x5cba7f in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #23: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5ecba7]
|
||||||
|
frame #24: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5e8740]
|
||||||
|
frame #25: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5fd5f2]
|
||||||
|
frame #26: _PyRun_SimpleFileObject + 0x19f (0x5fc9bf in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #27: _PyRun_AnyFileObject + 0x43 (0x5fc6e3 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #28: Py_RunMain + 0x2ee (0x5f73fe in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #29: Py_BytesMain + 0x39 (0x5bc149 in /home/nws8519/.conda/envs/olmo/bin/python3.11)
|
||||||
|
frame #30: __libc_start_main + 0xe5 (0x1497e66517e5 in /lib64/libc.so.6)
|
||||||
|
frame #31: /home/nws8519/.conda/envs/olmo/bin/python3.11() [0x5bbf93]
|
||||||
|
|
||||||
|
W0904 11:27:18.565000 1736745 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1292] The node 'qgpu2013_1736745_0' has failed to shutdown the rendezvous '3273582' due to an error of type RendezvousConnectionError.
|
||||||
|
Traceback (most recent call last):
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 117, in _call_store
|
||||||
|
return getattr(self._store, store_op)(*args, **kwargs)
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
torch.distributed.DistNetworkError: failed to recv, got 0 bytes
|
||||||
|
|
||||||
|
The above exception was the direct cause of the following exception:
|
||||||
|
|
||||||
|
Traceback (most recent call last):
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/bin/torchrun", line 8, in <module>
|
||||||
|
sys.exit(main())
|
||||||
|
^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
|
||||||
|
return f(*args, **kwargs)
|
||||||
|
^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main
|
||||||
|
run(args)
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run
|
||||||
|
elastic_launch(
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
|
||||||
|
return launch_agent(self._config, self._entrypoint, list(args))
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
|
||||||
|
result = agent.run()
|
||||||
|
^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
|
||||||
|
result = f(*args, **kwargs)
|
||||||
|
^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 711, in run
|
||||||
|
result = self._invoke_run(role)
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
|
||||||
|
self._initialize_workers(self._worker_group)
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
|
||||||
|
result = f(*args, **kwargs)
|
||||||
|
^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 683, in _initialize_workers
|
||||||
|
self._rendezvous(worker_group)
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
|
||||||
|
result = f(*args, **kwargs)
|
||||||
|
^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 500, in _rendezvous
|
||||||
|
rdzv_info = spec.rdzv_handler.next_rendezvous()
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1170, in next_rendezvous
|
||||||
|
self._op_executor.run(join_op, deadline, self._get_deadline)
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 648, in run
|
||||||
|
has_set = self._state_holder.sync()
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 437, in sync
|
||||||
|
get_response = self._backend.get_state()
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 75, in get_state
|
||||||
|
base64_state: bytes = self._call_store("get", self._key)
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 119, in _call_store
|
||||||
|
raise RendezvousConnectionError(
|
||||||
|
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
|
||||||
|
srun: error: qgpu2013: task 0: Exited with exit code 1
|
||||||
|
unsupervised olmo categorization pau at Thu Sep 4 11:27:18 CDT 2025
|
||||||
|
@ -17,7 +17,9 @@ def format_df_data(df):
|
|||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
biber_vec_df = pd.read_csv("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_labels.csv", low_memory=False)
|
biber_vec_df = pd.read_csv("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_labels.csv", low_memory=False)
|
||||||
biber_vec_df = biber_vec_df[biber_vec_df['comment_type'] == 'task_subcomment']
|
biber_vec_df = biber_vec_df[biber_vec_df['comment_type'] == 'task_description']
|
||||||
|
biber_vec_df = biber_vec_df[biber_vec_df['AuthorPHID'] != "PHID-USER-idceizaw6elwiwm5xshb"]
|
||||||
|
#biber_vec_df = biber_vec_df[biber_vec_df['comment_text'] != 'nan']
|
||||||
biber_vecs = format_df_data(biber_vec_df)
|
biber_vecs = format_df_data(biber_vec_df)
|
||||||
#handoff to PCA model
|
#handoff to PCA model
|
||||||
'''
|
'''
|
||||||
@ -37,17 +39,33 @@ if __name__ == "__main__":
|
|||||||
component_variances = np.var(biber_vecs_pca, axis=0)
|
component_variances = np.var(biber_vecs_pca, axis=0)
|
||||||
print("Variance of each PCA component:", component_variances)
|
print("Variance of each PCA component:", component_variances)
|
||||||
|
|
||||||
|
|
||||||
#first looking at comment_type
|
#first looking at comment_type
|
||||||
le = LabelEncoder()
|
le = LabelEncoder()
|
||||||
colors = le.fit_transform(biber_vec_df[selected_axis])
|
colors = le.fit_transform(biber_vec_df[selected_axis])
|
||||||
|
|
||||||
plot_df = pd.DataFrame({
|
pc_dict = {f"PC{i+1}": biber_vecs_pca[:, i] for i in range(18)}
|
||||||
"PC1": biber_vecs_pca[:, 0],
|
pc_dict[selected_axis] = biber_vec_df[selected_axis].astype(str)
|
||||||
"PC2": biber_vecs_pca[:, 1],
|
pc_dict["source"] = biber_vec_df['source'].astype(str)
|
||||||
selected_axis: biber_vec_df[selected_axis].astype(str),
|
pc_dict["phase"] = biber_vec_df['phase'].astype(str)
|
||||||
"source":biber_vec_df['source'].astype(str),
|
pc_dict["text"] = biber_vec_df['comment_text'].astype(str)
|
||||||
"phase":biber_vec_df['phase'].astype(str)
|
pc_dict['id'] = biber_vec_df['id']
|
||||||
})
|
pc_dict['week_index'] = biber_vec_df['week_index']
|
||||||
|
pc_dict['priority'] = biber_vec_df['priority']
|
||||||
|
pc_dict['closed_relevance'] = biber_vec_df['closed_relevance']
|
||||||
|
|
||||||
|
plot_df = pd.DataFrame(pc_dict)
|
||||||
|
plot_df.to_csv("090425_description_PCA_df.csv", index=False)
|
||||||
|
|
||||||
|
print("Top 10 PC1 values:")
|
||||||
|
print(plot_df.nlargest(10, "PC1"))
|
||||||
|
print("\nBottom 10 PC1 values:")
|
||||||
|
print(plot_df.nsmallest(10, "PC1"))
|
||||||
|
|
||||||
|
print("Top 10 PC2 values:")
|
||||||
|
print(plot_df.nlargest(10, "PC2"))
|
||||||
|
print("\nBottom 10 PC2 values:")
|
||||||
|
print(plot_df.nsmallest(10, "PC2"))
|
||||||
|
|
||||||
|
|
||||||
g = sns.FacetGrid(plot_df, col="source", row="phase", hue=selected_axis, palette="tab10", height=4, sharex=False, sharey=False)
|
g = sns.FacetGrid(plot_df, col="source", row="phase", hue=selected_axis, palette="tab10", height=4, sharex=False, sharey=False)
|
||||||
@ -74,5 +92,5 @@ if __name__ == "__main__":
|
|||||||
plt.legend(title=selected_axis, bbox_to_anchor=(1.05, 1), loc=2)
|
plt.legend(title=selected_axis, bbox_to_anchor=(1.05, 1), loc=2)
|
||||||
'''
|
'''
|
||||||
g.fig.tight_layout()
|
g.fig.tight_layout()
|
||||||
g.savefig(f"subcomment_{selected_axis}_090425_biber_pca.png", dpi=300)
|
g.savefig(f"description_{selected_axis}_090425_biber_pca_final.png", dpi=300)
|
||||||
plt.show()
|
plt.show()
|
||||||
|
Before Width: | Height: | Size: 1.5 MiB |
Before Width: | Height: | Size: 2.4 MiB |
Before Width: | Height: | Size: 2.4 MiB |
BIN
p2/quest/subcomment_AuthorWMFAffil_090425_biber_pca_final.png
Normal file
After Width: | Height: | Size: 2.1 MiB |