1
0

updating PCA to account for sentence count and median length

This commit is contained in:
mgaughan 2025-10-14 23:15:14 -05:00
parent cb2fe737cd
commit f60f3ef120
12 changed files with 146695 additions and 40 deletions

View File

@ -1,15 +1,16 @@
setting up the environment by loading in conda environment at Sat Oct 11 00:24:37 CDT 2025
running the batched olmo categorization job at Sat Oct 11 00:24:37 CDT 2025
setting up the environment by loading in conda environment at Sat Oct 11 07:52:03 CDT 2025
running the batched olmo categorization job at Sat Oct 11 07:52:03 CDT 2025
[nltk_data] Downloading package punkt_tab to
[nltk_data] /home/nws8519/nltk_data...
[nltk_data] Package punkt_tab is already up-to-date!
cuda
NVIDIA A100-SXM4-80GB
_CudaDeviceProperties(name='NVIDIA A100-SXM4-80GB', major=8, minor=0, total_memory=81153MB, multi_processor_count=108, uuid=393ab5c3-2bcb-e4c6-52ad-eb4896a9d4fe, L2_cache_size=40MB)
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s] Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:03, 2.77it/s] Loading checkpoint shards: 17%|█▋ | 2/12 [00:00<00:04, 2.16it/s] Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:04, 1.93it/s] Loading checkpoint shards: 33%|███▎ | 4/12 [00:02<00:04, 1.79it/s] Loading checkpoint shards: 42%|████▏ | 5/12 [00:02<00:03, 1.77it/s] Loading checkpoint shards: 50%|█████ | 6/12 [00:03<00:03, 1.80it/s] Loading checkpoint shards: 58%|█████▊ | 7/12 [00:03<00:02, 1.80it/s] Loading checkpoint shards: 67%|██████▋ | 8/12 [00:04<00:02, 1.76it/s] Loading checkpoint shards: 75%|███████▌ | 9/12 [00:04<00:01, 1.77it/s] Loading checkpoint shards: 83%|████████▎ | 10/12 [00:05<00:01, 1.82it/s] Loading checkpoint shards: 92%|█████████▏| 11/12 [00:05<00:00, 1.92it/s] Loading checkpoint shards: 100%|██████████| 12/12 [00:05<00:00, 2.02it/s]
_CudaDeviceProperties(name='NVIDIA A100-SXM4-80GB', major=8, minor=0, total_memory=81153MB, multi_processor_count=108, uuid=ee0bd2a7-af54-5f2e-c2d3-fcd3f57270c9, L2_cache_size=40MB)
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s] Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:04, 2.72it/s] Loading checkpoint shards: 17%|█▋ | 2/12 [00:00<00:04, 2.27it/s] Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:04, 2.12it/s] Loading checkpoint shards: 33%|███▎ | 4/12 [00:01<00:03, 2.12it/s] Loading checkpoint shards: 42%|████▏ | 5/12 [00:02<00:03, 1.96it/s] Loading checkpoint shards: 50%|█████ | 6/12 [00:02<00:03, 1.98it/s] Loading checkpoint shards: 58%|█████▊ | 7/12 [00:03<00:02, 1.87it/s] Loading checkpoint shards: 67%|██████▋ | 8/12 [00:03<00:02, 1.94it/s] Loading checkpoint shards: 75%|███████▌ | 9/12 [00:04<00:01, 1.91it/s] Loading checkpoint shards: 83%|████████▎ | 10/12 [00:05<00:01, 1.87it/s] Loading checkpoint shards: 92%|█████████▏| 11/12 [00:05<00:00, 2.05it/s] Loading checkpoint shards: 100%|██████████| 12/12 [00:05<00:00, 2.18it/s]
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
Traceback (most recent call last):
File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/090425_batched_olmo_cat.py", line 62, in <module>
with open("/home/nws8519/git/mw-lifecycle-analysis/analysis_data/100325_unified_phab.csv", mode='r', newline='') as file:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/home/nws8519/git/mw-lifecycle-analysis/analysis_data/100325_unified_phab.csv'
unsupervised batched olmo categorization pau at Sat Oct 11 00:27:22 CDT 2025
File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/090425_batched_olmo_cat.py", line 66, in <module>
for row in reader:
_csv.Error: field larger than field limit (131072)
unsupervised batched olmo categorization pau at Sun Oct 12 14:11:43 CDT 2025

View File

@ -0,0 +1,11 @@
setting up the environment by loading in conda environment at Mon Oct 13 09:25:24 CDT 2025
running the batched olmo categorization job at Mon Oct 13 09:25:24 CDT 2025
[nltk_data] Downloading package punkt_tab to
[nltk_data] /home/nws8519/nltk_data...
[nltk_data] Package punkt_tab is already up-to-date!
cuda
NVIDIA A100-SXM4-80GB
_CudaDeviceProperties(name='NVIDIA A100-SXM4-80GB', major=8, minor=0, total_memory=81153MB, multi_processor_count=108, uuid=19efa4d6-01cd-d825-4cd9-637cc23cebd3, L2_cache_size=40MB)
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s] Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:03, 3.50it/s] Loading checkpoint shards: 17%|█▋ | 2/12 [00:00<00:03, 2.64it/s] Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:03, 2.39it/s] Loading checkpoint shards: 33%|███▎ | 4/12 [00:01<00:03, 2.19it/s] Loading checkpoint shards: 42%|████▏ | 5/12 [00:02<00:03, 2.19it/s] Loading checkpoint shards: 50%|█████ | 6/12 [00:02<00:02, 2.08it/s] Loading checkpoint shards: 58%|█████▊ | 7/12 [00:03<00:02, 2.04it/s] Loading checkpoint shards: 67%|██████▋ | 8/12 [00:03<00:01, 2.01it/s] Loading checkpoint shards: 75%|███████▌ | 9/12 [00:04<00:01, 2.03it/s] Loading checkpoint shards: 83%|████████▎ | 10/12 [00:04<00:00, 2.07it/s] Loading checkpoint shards: 92%|█████████▏| 11/12 [00:05<00:00, 2.19it/s] Loading checkpoint shards: 100%|██████████| 12/12 [00:05<00:00, 2.36it/s]
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,239 @@
starting the job at: Tue Oct 14 15:08:48 CDT 2025
setting up the environment
running the neurobiber labeling script
[[13. ]
[14. ]
[11. ]
...
[10. ]
[14. ]
[12.5]]
Number of PCs explaining 90% variance: 15
Variance of each PCA component: [138.60156907 44.29951603 25.63179594 21.39857213 14.99271754
10.88014877 8.72969328 8.11497994 6.78712318 5.50912497
5.25006184 4.96444801 4.62359041 3.68257699 3.28506433]
PC1:
median_sentence_length: 0.994
normalized_CAP: -0.069
normalized_NNP: -0.050
normalized_NOMZ: -0.029
normalized_NUM: 0.026
normalized_DET: 0.024
normalized_ART: 0.020
normalized_PREP: 0.019
normalized_PIN: 0.019
normalized_RB: 0.016
PC2:
normalized_CAP: 0.555
normalized_NNP: 0.554
normalized_DET: -0.298
normalized_ART: -0.232
normalized_PREP: -0.220
normalized_PIN: -0.220
sentence_count: -0.189
normalized_RB: -0.125
normalized_PRP: -0.110
normalized_SBJP: -0.110
PC3:
normalized_NN: 0.509
normalized_PREP: 0.491
normalized_PIN: 0.491
normalized_CAP: 0.304
normalized_NNP: 0.279
normalized_DET: 0.143
sentence_count: -0.115
normalized_ART: 0.109
normalized_NOMZ: -0.098
normalized_INF: 0.095
PC4:
normalized_NN: 0.683
sentence_count: -0.412
normalized_NNP: -0.295
normalized_PIN: -0.217
normalized_PREP: -0.217
normalized_CAP: -0.174
normalized_PRP: -0.173
normalized_SBJP: -0.173
normalized_RB: -0.142
normalized_JJ: 0.117
PC5:
sentence_count: 0.718
normalized_NN: 0.358
normalized_DET: 0.228
normalized_PIN: -0.223
normalized_PREP: -0.223
normalized_ART: 0.221
normalized_NOMZ: -0.190
normalized_CAP: 0.186
normalized_INF: -0.137
normalized_JJ: -0.123
PC6:
normalized_DET: 0.538
normalized_ART: 0.483
sentence_count: -0.398
normalized_PREP: -0.216
normalized_PIN: -0.216
normalized_CAP: 0.206
normalized_VPRT: 0.204
normalized_INDA: 0.186
normalized_NN: -0.142
normalized_X: -0.132
PC7:
normalized_RB: 0.442
normalized_CAP: 0.343
normalized_PRP: 0.313
normalized_SBJP: 0.313
normalized_NNP: -0.278
normalized_VPRT: 0.234
normalized_ART: -0.232
normalized_NN: 0.229
normalized_DET: -0.208
normalized_NOMZ: -0.164
PC8:
normalized_JJ: 0.504
normalized_CAP: 0.502
normalized_NNP: -0.468
normalized_NOMZ: 0.296
sentence_count: 0.150
normalized_X: -0.146
normalized_QUOT: -0.145
normalized_NN: -0.142
normalized_VPRT: -0.131
normalized_RB: -0.128
PC9:
normalized_JJ: 0.637
normalized_VPRT: 0.357
normalized_NNP: 0.337
normalized_CAP: -0.265
normalized_INF: -0.258
normalized_QUOT: -0.224
normalized_RB: 0.204
normalized_X: -0.145
normalized_AUXB: 0.143
sentence_count: 0.135
PC10:
normalized_INF: 0.691
normalized_QUOT: -0.415
normalized_VPRT: -0.263
normalized_RB: 0.222
normalized_TO: 0.184
normalized_PRP: -0.155
normalized_SBJP: -0.155
normalized_CONT: -0.138
normalized_NNP: 0.122
normalized_PIN: -0.120
PC11:
normalized_QUOT: 0.714
normalized_JJ: 0.402
normalized_CONT: 0.295
normalized_INF: 0.294
normalized_NOMZ: -0.246
normalized_PRP: -0.126
normalized_SBJP: -0.126
normalized_X: 0.107
normalized_NUM: -0.076
normalized_CAP: 0.067
PC12:
normalized_RB: 0.521
normalized_PRP: -0.426
normalized_SBJP: -0.426
normalized_JJ: -0.236
normalized_INF: -0.229
normalized_FPP1: -0.211
normalized_NNP: -0.184
normalized_CONJ: 0.129
normalized_XX0: 0.125
normalized_TO: -0.124
PC13:
normalized_X: 0.808
normalized_NOMZ: -0.391
normalized_QUOT: -0.249
normalized_JJ: 0.163
sentence_count: -0.146
normalized_NNP: -0.126
normalized_CAP: 0.107
normalized_CONT: -0.097
normalized_VPRT: 0.096
normalized_RB: -0.071
PC14:
normalized_VPRT: 0.514
normalized_AUXB: 0.496
normalized_RB: -0.346
normalized_PASS: 0.221
normalized_INF: 0.218
normalized_NOMZ: 0.215
normalized_BEMA: 0.161
normalized_JJ: -0.161
normalized_VBD: -0.137
normalized_NUM: -0.136
PC15:
normalized_NOMZ: 0.554
normalized_NUM: -0.544
normalized_X: 0.438
normalized_RB: 0.239
sentence_count: 0.146
normalized_NNP: 0.135
normalized_AUXB: -0.116
normalized_NN: 0.105
normalized_CONT: 0.104
normalized_INF: -0.101
Top 10 PC1 values:
PC1 PC2 ... AuthorPHID date_created
16080 525.102703 48.280630 ... PHID-USER-zjzhrhmn36icnzbckqy4 1350678600
18859 77.466344 3.706703 ... PHID-USER-ll6tmaogat2b5q7tnqas 1405358040
20378 69.473292 -5.921977 ... PHID-USER-ynivjflmc2dcl6w5ut5v 1407551580
8874 67.305410 6.587019 ... PHID-USER-ydswvwhh5pm4lshahjje 1371667800
6468 52.113083 12.698065 ... PHID-USER-azy72hrp3tpetr52aob6 1378208100
18692 43.220624 8.230008 ... PHID-USER-arjqb24x4oae7awzpfp6 1411431840
5607 42.720768 1.581160 ... PHID-USER-ynivjflmc2dcl6w5ut5v 1360124400
19479 41.065047 8.286151 ... PHID-USER-ynivjflmc2dcl6w5ut5v 1406854860
13751 38.405351 6.445956 ... PHID-USER-v7vgzvvcw7v2umf737ri 1380947640
6503 37.060191 -2.635433 ... PHID-USER-qgqq35kbi5wss2tlgmhg 1377865740
[10 rows x 25 columns]
Bottom 10 PC1 values:
PC1 PC2 ... AuthorPHID date_created
19173 -14.819594 38.839843 ... PHID-USER-doeppszazlm3r7xah4il 1416964345
23533 -14.098760 31.956092 ... PHID-USER-sai77mtxmpqnm6pycyvz 1424498718
24553 -14.098553 31.953701 ... PHID-USER-sai77mtxmpqnm6pycyvz 1424498559
23532 -14.098346 31.951309 ... PHID-USER-sai77mtxmpqnm6pycyvz 1424498772
129 -13.767257 2.442547 ... PHID-USER-hyfm4swq76s4j642w46x 1375120080
22245 -12.327433 30.418183 ... PHID-USER-v7vgzvvcw7v2umf737ri 1438377936
752 -12.170613 17.171274 ... PHID-USER-sx63fwaih5kjt7bz4u6z 1380590700
2120 -11.607147 -10.509373 ... PHID-USER-xfe43w2lb5gpvglf4coa 1367008080
22153 -11.098587 7.351805 ... PHID-USER-a6p24cvyblhfzc7we7nc 1438982860
24847 -10.908633 15.377024 ... PHID-USER-srhlj2447vmpmrfhqnfa 1417632210
[10 rows x 25 columns]
Top 10 PC2 values:
PC1 PC2 ... AuthorPHID date_created
16080 525.102703 48.280630 ... PHID-USER-zjzhrhmn36icnzbckqy4 1350678600
19173 -14.819594 38.839843 ... PHID-USER-doeppszazlm3r7xah4il 1416964345
23127 -1.787399 32.727692 ... PHID-USER-myidf5vlkwvrgp2iwn76 1433839792
23533 -14.098760 31.956092 ... PHID-USER-sai77mtxmpqnm6pycyvz 1424498718
24553 -14.098553 31.953701 ... PHID-USER-sai77mtxmpqnm6pycyvz 1424498559
23532 -14.098346 31.951309 ... PHID-USER-sai77mtxmpqnm6pycyvz 1424498772
18500 13.647382 30.709395 ... PHID-USER-hbffue25ov3attlvclze 1387662960
22245 -12.327433 30.418183 ... PHID-USER-v7vgzvvcw7v2umf737ri 1438377936
22023 -7.400000 29.037196 ... PHID-USER-a6p24cvyblhfzc7we7nc 1440568477
14809 -2.186555 28.072103 ... PHID-USER-zjzhrhmn36icnzbckqy4 1379900100
[10 rows x 25 columns]
Bottom 10 PC2 values:
PC1 PC2 ... AuthorPHID date_created
23485 -1.065773 -15.903250 ... PHID-USER-u7udgblfyop6qd5wxot6 1425991276
22060 4.042133 -15.132236 ... PHID-USER-2nnm76h4ykalvvref2ye 1440412099
5792 -2.696513 -15.036399 ... PHID-USER-grpjkpfolt5gz4ljlbfg 1355334540
1436 -0.480107 -15.016999 ... PHID-USER-tyjmn7xcw6s2b6rqagj7 1373878680
22799 -5.139569 -14.977697 ... PHID-USER-fjve3gq5wsmaaccti7pb 1430752987
22845 -0.877723 -14.762675 ... PHID-USER-2nnm76h4ykalvvref2ye 1440085454
7451 9.897529 -14.392291 ... PHID-USER-ysftv67jxeaxdwcakvwo 1374347580
9423 10.013728 -14.381035 ... PHID-USER-zzvqlvm6i6kml4tfnqvq 1369411380
1228 -2.448487 -13.906291 ... PHID-USER-ysftv67jxeaxdwcakvwo 1374765240
2775 3.664323 -13.485623 ... PHID-USER-dw53c5cb2qfhyemej57o 1377068880
[10 rows x 25 columns]
job finished, cleaning up
job pau at: Tue Oct 14 15:09:18 CDT 2025

Binary file not shown.

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,352 @@
starting the job at: Tue Oct 14 15:54:24 CDT 2025
setting up the environment
running the neurobiber labeling script
1 [Change 86685 merged by jenkins-bot:\nFollow-u...
2 [*** Bug 54785 has been marked as a duplicate ...
3 [Change 86685 had a related patch set uploaded...
5 [**Wikifram** wrote:\n\nAllright, thanks to bo...
6 [(In reply to comment #4)\nQUOTE\n\nVE product...
...
25022 [Er... drag and drop from what?, Is there no n...
25023 [Could you attach a screenshot please?, Drag &...
25025 [Sorry for not reply-ing., I did a test and co...
25026 [SCREEN_NAME: Please answer.]
25027 [I cannot replicate this., What's the name of ...
Name: olmo_cleaned_sentences, Length: 21901, dtype: object
[[18. ]
[ 6.5]
[23. ]
...
[ 5.5]
[ 3. ]
[ 6. ]]
Number of PCs explaining 90% variance: 24
Variance of each PCA component: [273.55786883 135.16197459 82.94008657 63.12754897 60.39119505
38.84258991 32.35268417 26.32979149 21.57186105 18.691479
16.21404524 13.63887204 13.3960516 11.40372708 10.25820109
9.13513531 8.8549811 8.29863619 7.99933399 7.06165956
6.73377968 6.4742109 5.92152116 5.75533066]
PC1:
normalized_CAP: 0.670
normalized_NNP: 0.604
median_sentence_length: -0.283
normalized_DET: -0.142
normalized_PREP: -0.122
normalized_PIN: -0.122
normalized_ART: -0.089
normalized_VPRT: -0.082
normalized_RB: -0.077
normalized_PRP: -0.071
PC2:
median_sentence_length: 0.929
normalized_NNP: 0.319
normalized_RB: -0.074
normalized_VPRT: -0.070
normalized_DET: -0.066
normalized_AUXB: -0.055
normalized_PRP: -0.045
normalized_SBJP: -0.045
normalized_X: 0.038
normalized_CAP: 0.035
PC3:
normalized_NN: 0.750
normalized_NNP: -0.291
normalized_RB: -0.266
normalized_PRP: -0.232
normalized_SBJP: -0.232
normalized_CAP: 0.211
normalized_VPRT: -0.169
normalized_FPP1: -0.117
normalized_NUM: 0.106
normalized_INF: -0.097
PC4:
normalized_CAP: 0.577
normalized_PREP: 0.426
normalized_PIN: 0.426
normalized_NNP: -0.281
normalized_PRP: 0.187
normalized_SBJP: 0.187
median_sentence_length: 0.159
normalized_X: -0.148
normalized_RB: 0.141
normalized_INF: 0.128
PC5:
normalized_PIN: 0.507
normalized_PREP: 0.507
normalized_NNP: 0.435
normalized_CAP: -0.349
normalized_RB: -0.256
median_sentence_length: -0.147
normalized_CONJ: 0.125
normalized_SBJP: -0.120
normalized_PRP: -0.120
normalized_VPRT: -0.100
PC6:
normalized_DET: 0.618
normalized_ART: 0.383
normalized_X: -0.278
normalized_NN: 0.273
normalized_VPRT: 0.261
normalized_NNP: 0.246
normalized_AUXB: 0.215
normalized_NUM: -0.191
normalized_INF: -0.163
normalized_INDA: 0.156
PC7:
normalized_NN: 0.477
normalized_PRP: 0.459
normalized_SBJP: 0.459
normalized_NNP: 0.247
normalized_FPP1: 0.236
normalized_DET: -0.196
normalized_AUXB: -0.171
normalized_CAP: -0.163
normalized_PASS: -0.138
normalized_PIT: 0.126
PC8:
normalized_RB: 0.781
normalized_NN: 0.265
normalized_DET: -0.188
normalized_PRP: -0.187
normalized_SBJP: -0.186
normalized_JJ: -0.169
normalized_NNP: 0.154
normalized_X: -0.153
normalized_TIME: 0.139
normalized_ART: -0.136
PC9:
normalized_JJ: 0.672
normalized_INF: 0.353
normalized_VPRT: -0.324
normalized_PASS: -0.219
normalized_AUXB: -0.218
normalized_NUM: -0.214
normalized_ART: 0.214
normalized_CONJ: -0.147
normalized_RB: 0.132
normalized_PEAS: -0.117
PC10:
normalized_INF: 0.652
normalized_JJ: -0.543
normalized_VPRT: -0.298
normalized_DET: 0.248
normalized_TO: 0.131
normalized_ART: 0.128
normalized_PRIV: 0.108
normalized_NUM: 0.086
normalized_RB: -0.077
normalized_POMD: 0.072
PC11:
normalized_INF: 0.420
normalized_VPRT: 0.383
normalized_AUXB: 0.379
normalized_ART: -0.261
normalized_JJ: 0.251
normalized_RB: -0.249
normalized_VBD: -0.247
normalized_X: -0.223
normalized_DET: -0.212
normalized_PASS: 0.174
PC12:
sentence_count: 0.651
normalized_X: -0.619
normalized_VPRT: -0.180
normalized_PUBV: 0.169
normalized_RB: -0.115
normalized_CONJ: 0.114
normalized_INF: -0.104
normalized_CCONJ: 0.100
normalized_QUOT: 0.099
normalized_DET: -0.091
PC13:
sentence_count: 0.637
normalized_X: 0.496
normalized_VBD: -0.299
normalized_NUM: -0.287
normalized_PUBV: -0.223
normalized_JJ: -0.198
normalized_VPRT: 0.186
normalized_CONJ: -0.099
normalized_QUOT: 0.067
normalized_PASS: -0.061
PC14:
normalized_NUM: 0.714
normalized_VBD: -0.354
normalized_VPRT: 0.233
normalized_AUXB: -0.186
normalized_PASS: -0.171
normalized_ART: 0.153
normalized_UH: -0.150
normalized_RB: 0.141
normalized_INDA: 0.138
normalized_PUBV: -0.134
PC15:
normalized_QUOT: 0.422
normalized_VBD: -0.380
normalized_AUXB: -0.331
sentence_count: -0.322
normalized_CONT: 0.315
normalized_UH: 0.255
normalized_NUM: -0.221
normalized_PASS: -0.221
normalized_X: -0.206
normalized_VPRT: 0.154
PC16:
normalized_PUBV: 0.481
normalized_CONJ: -0.394
normalized_UH: -0.360
normalized_VBD: 0.317
normalized_QUOT: 0.267
normalized_VPRT: 0.248
normalized_CONT: 0.201
normalized_NUM: 0.151
normalized_PASS: -0.137
normalized_TO: 0.128
PC17:
normalized_QUOT: 0.520
normalized_CONT: 0.417
normalized_PUBV: -0.301
normalized_PGAS: -0.290
normalized_UH: -0.260
normalized_CONJ: 0.234
normalized_VBD: 0.200
normalized_NOMZ: -0.194
normalized_PASS: 0.193
normalized_AUXB: 0.175
PC18:
normalized_CONJ: 0.631
normalized_PUBV: 0.523
normalized_NUM: -0.253
normalized_PGAS: -0.211
normalized_VPRT: 0.168
normalized_X: 0.160
normalized_ART: 0.155
normalized_DEMP: -0.126
normalized_UH: -0.118
normalized_TIME: -0.106
PC19:
normalized_UH: 0.659
normalized_PGAS: -0.517
normalized_VBD: 0.237
normalized_CCONJ: -0.196
normalized_CONJ: -0.175
normalized_NOMZ: -0.153
normalized_VPRT: 0.149
normalized_ART: 0.109
normalized_INDA: 0.101
normalized_RB: 0.099
PC20:
normalized_ART: 0.461
normalized_DET: -0.342
normalized_DEMO: -0.294
normalized_INDA: 0.293
normalized_DEMP: -0.288
normalized_AUXB: 0.230
normalized_PIT: 0.222
normalized_FPP1: -0.215
normalized_PGAS: 0.208
normalized_CCONJ: 0.185
PC21:
normalized_PGAS: 0.594
normalized_CCONJ: -0.353
normalized_UH: 0.330
normalized_CONJ: 0.272
normalized_AUXB: 0.250
normalized_PRIV: 0.241
normalized_BEMA: 0.153
normalized_TIME: -0.141
normalized_PROD: -0.130
normalized_NUM: 0.125
PC22:
normalized_PRIV: 0.445
normalized_QUES: -0.422
normalized_CCONJ: 0.395
normalized_VPRT: 0.242
normalized_FPP1: 0.221
normalized_AUXB: -0.207
normalized_VBD: 0.200
normalized_BEMA: -0.178
normalized_PIT: -0.151
normalized_SPP2: -0.148
PC23:
normalized_NOMZ: 0.504
normalized_PRIV: 0.457
normalized_CCONJ: -0.327
normalized_PUBV: -0.283
normalized_NUM: -0.184
normalized_VBD: 0.180
normalized_SCONJ: 0.170
normalized_UH: -0.168
normalized_DEMP: -0.161
normalized_PGAS: -0.161
PC24:
normalized_CCONJ: 0.506
normalized_QUES: 0.414
normalized_CONJ: 0.251
normalized_PASS: -0.238
normalized_BEMA: 0.207
normalized_WH: 0.207
normalized_VBD: 0.186
normalized_DEMO: -0.180
normalized_PEAS: -0.164
normalized_SCONJ: 0.161
Top 10 PC1 values:
PC1 PC2 ... AuthorPHID date_created
23531 123.243897 22.112164 ... PHID-USER-arjqb24x4oae7awzpfp6 1424754141
707 123.226678 22.102265 ... PHID-USER-pun3sjvg3cemjzbgyo2t 1363132183
744 123.226678 22.102265 ... PHID-USER-fovtl67ew4l4cc3oeypc 1353551242
749 123.226678 22.102265 ... PHID-USER-fovtl67ew4l4cc3oeypc 1353384355
2243 123.226678 22.102265 ... PHID-USER-fovtl67ew4l4cc3oeypc 1356175107
5921 123.226678 22.102265 ... PHID-USER-fovtl67ew4l4cc3oeypc 1353366778
5933 123.226678 22.102265 ... PHID-USER-fovtl67ew4l4cc3oeypc 1353123761
5935 123.226678 22.102265 ... PHID-USER-fovtl67ew4l4cc3oeypc 1353386649
10080 123.226678 22.102265 ... PHID-USER-fovtl67ew4l4cc3oeypc 1366298361
10418 123.226678 22.102265 ... PHID-USER-fovtl67ew4l4cc3oeypc 1355363288
[10 rows x 34 columns]
Bottom 10 PC1 values:
PC1 PC2 ... AuthorPHID date_created
24812 -131.318535 438.637876 ... PHID-USER-fo56wm4wxiwpoofn2xdu 1463441072
24813 -131.130989 438.728132 ... PHID-USER-fo56wm4wxiwpoofn2xdu 1463441050
13983 -88.027511 274.892016 ... PHID-USER-v7vgzvvcw7v2umf737ri 1380947348
16510 -82.500013 294.909402 ... PHID-USER-izojihzr4ja3jsgzn5wv 1354470131
161 -68.446710 197.206426 ... PHID-USER-hyfm4swq76s4j642w46x 1374730027
24815 -60.440128 175.352637 ... PHID-USER-fo56wm4wxiwpoofn2xdu 1463439992
6163 -59.523505 195.679514 ... PHID-USER-4bjsher5mqcoikeqnnec 1379611711
22005 -59.492044 211.972278 ... PHID-USER-maceogqtxg4qfaefx7wd 1440633395
24010 -53.793798 153.114760 ... PHID-USER-lhtlnmkdbzlz6pbxaqdd 1428469742
24009 -53.614161 153.284397 ... PHID-USER-lhtlnmkdbzlz6pbxaqdd 1428538077
[10 rows x 34 columns]
Top 10 PC2 values:
PC1 PC2 ... AuthorPHID date_created
24813 -131.130989 438.728132 ... PHID-USER-fo56wm4wxiwpoofn2xdu 1463441050
24812 -131.318535 438.637876 ... PHID-USER-fo56wm4wxiwpoofn2xdu 1463441072
16510 -82.500013 294.909402 ... PHID-USER-izojihzr4ja3jsgzn5wv 1354470131
13983 -88.027511 274.892016 ... PHID-USER-v7vgzvvcw7v2umf737ri 1380947348
22005 -59.492044 211.972278 ... PHID-USER-maceogqtxg4qfaefx7wd 1440633395
161 -68.446710 197.206426 ... PHID-USER-hyfm4swq76s4j642w46x 1374730027
6163 -59.523505 195.679514 ... PHID-USER-4bjsher5mqcoikeqnnec 1379611711
20858 -52.549327 192.146265 ... PHID-USER-22bsa5u75jz3ci3wnplu 1441031208
24815 -60.440128 175.352637 ... PHID-USER-fo56wm4wxiwpoofn2xdu 1463439992
18294 -43.267655 159.973982 ... PHID-USER-vk6mlmacfhx77egryy5i 1394419981
[10 rows x 34 columns]
Bottom 10 PC2 values:
PC1 PC2 ... AuthorPHID date_created
17259 -12.413915 -20.310670 ... PHID-USER-6vzzsmi22zem6yttr6vp 1321220595
22246 2.436022 -19.030642 ... PHID-USER-2nnm76h4ykalvvref2ye 1461480989
24780 -8.420485 -18.295879 ... PHID-USER-lsveyqlsb4acoowxr5yj 1420344576
7427 12.144652 -18.033451 ... PHID-USER-wz5bw3q6zykhqbbeohzq 1375791780
7055 -1.553566 -17.924389 ... PHID-USER-cfsvvgbtlqnbt2yokfjf 1377020909
23122 9.656987 -17.642747 ... PHID-USER-2nnm76h4ykalvvref2ye 1467721812
16776 6.551795 -17.537527 ... PHID-USER-6vzzsmi22zem6yttr6vp 1317838205
7471 -0.812161 -17.516875 ... PHID-USER-wkpnidxoctuhawexig5p 1386166246
13670 3.270330 -17.516754 ... PHID-USER-5dwuaigmkz2vzg65lape 1401902866
20682 3.694061 -17.391146 ... PHID-USER-uciss2jl2e4ifxqqk7wk 1440083315
[10 rows x 34 columns]
job finished, cleaning up
job pau at: Tue Oct 14 15:54:56 CDT 2025

Binary file not shown.

View File

@ -2,7 +2,8 @@ from transformers import AutoModelForCausalLM, AutoTokenizer, OlmoForCausalLM
import torch
import csv
import pandas as pd
import re
import re
import sys
import nltk
nltk.download('punkt_tab')
@ -18,7 +19,7 @@ print(torch.cuda.get_device_properties(0))
olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-1124-13B", cache_dir=cache_directory).to(device)
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-2-1124-13B", padding_side='left')
priming = "For the **GIVEN SENTENCE**, please categorize it into one of the defined [[CATEGORIES]]. Each [[CATEGORY]] is described in the TYPOLOGY for reference.Your task is to match the **GIVEN SENTENCE** to the **[[CATEGORY]]** that most accurately describes the content of the comment. Only provide the sentence category as your output. Do not provide any text beyond the category name."
priming = "You will be provided with a sentence from a software engineering task discussions. For the **GIVEN SENTENCE**, please categorize it into one of the defined [[CATEGORIES]]. Each [[CATEGORY]] is described in the TYPOLOGY for reference.Your task is to match the **GIVEN SENTENCE** to the **[[CATEGORY]]** that most accurately describes the content of the comment. Only provide the sentence category as your output. Do not provide any text beyond the category name."
typology = """
TYPOLOGY:
@ -59,6 +60,7 @@ TYPOLOGY:
"""
instructions="The sentence's category is: "
csv.field_size_limit(sys.maxsize)
with open("/home/nws8519/git/mw-lifecycle-analysis/analysis_data/100325_unified_phab.csv", mode='r', newline='') as file:
reader = csv.reader(file)
array_of_categorizations = []
@ -107,7 +109,7 @@ with open("/home/nws8519/git/mw-lifecycle-analysis/analysis_data/100325_unified_
batch = comment_sentences[i:i+batch_size]
prompts = []
for sent in batch:
given_data = f"**GIVEN SENTENCE: \n ' Type -text_dict['task_title'] \n Text -{sent}**'\n"
given_data = f"**GIVEN SENTENCE: \n ' Task Title -text_dict['task_title'] \n Text -{sent}**'\n"
prompt = f"{priming}\n{typology}\n\n{given_data}\n{instructions}"
prompts.append(prompt)
inputs = tokenizer(prompts, return_tensors='pt', return_token_type_ids=False, padding=True, truncation=True).to(device)
@ -127,7 +129,7 @@ with open("/home/nws8519/git/mw-lifecycle-analysis/analysis_data/100325_unified_
array_of_categorizations.append(text_dict)
df = pd.DataFrame(array_of_categorizations)
#print(df.head())
df.to_csv('all_101025_olmo_batched_categorized.csv', index=False)
df.to_csv('all_101325_olmo_batched_categorized.csv', index=False)

View File

@ -7,6 +7,7 @@ import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import ast
# List of the 96 features that Neurobiber can predict
BIBER_FEATURES = [
@ -25,25 +26,72 @@ BIBER_FEATURES = [
"BIN_DET","BIN_EMOJ","BIN_EMOT","BIN_EXCL","BIN_HASH","BIN_INF",
"BIN_UH","BIN_NUM","BIN_LAUGH","BIN_PRP","BIN_PREP","BIN_NNP",
"BIN_QUES","BIN_QUOT","BIN_AT","BIN_SBJP","BIN_URL","BIN_WH",
"BIN_INDA","BIN_ACCU","BIN_PGAS","BIN_CMADJ","BIN_SPADJ","BIN_X"
"BIN_INDA","BIN_ACCU","BIN_PGAS","BIN_CMADJ","BIN_SPADJ","BIN_X",
"sentence_count", "median_sentence_length"
]
selected_cols = [
"normalized_QUAN","normalized_QUPR","normalized_AMP","normalized_PASS","normalized_XX0","normalized_JJ",
"normalized_BEMA","normalized_CAUS","normalized_CONC","normalized_COND","normalized_CONJ","normalized_CONT",
"normalized_DPAR","normalized_DWNT","normalized_EX","normalized_FPP1","normalized_GER","normalized_RB",
"normalized_PIN","normalized_INPR","normalized_TO","normalized_NEMD","normalized_OSUB","normalized_PASTP",
"normalized_VBD","normalized_PHC","normalized_PIRE","normalized_PLACE","normalized_POMD","normalized_PRMD",
"normalized_WZPRES","normalized_VPRT","normalized_PRIV","normalized_PIT","normalized_PUBV","normalized_SPP2",
"normalized_SMP","normalized_SERE","normalized_STPR","normalized_SUAV","normalized_SYNE","normalized_TPP3",
"normalized_TIME","normalized_NOMZ","normalized_BYPA","normalized_PRED","normalized_TOBJ","normalized_TSUB",
"normalized_THVC","normalized_NN","normalized_DEMP","normalized_DEMO","normalized_WHQU","normalized_EMPH",
"normalized_HDG","normalized_WZPAST","normalized_THAC","normalized_PEAS","normalized_ANDC","normalized_PRESP",
"normalized_PROD","normalized_SPAU","normalized_SPIN","normalized_THATD","normalized_WHOBJ","normalized_WHSUB",
"normalized_WHCL","normalized_ART","normalized_AUXB","normalized_CAP","normalized_SCONJ","normalized_CCONJ",
"normalized_DET","normalized_EMOJ","normalized_EMOT","normalized_EXCL","normalized_HASH","normalized_INF",
"normalized_UH","normalized_NUM","normalized_LAUGH","normalized_PRP","normalized_PREP","normalized_NNP",
"normalized_QUES","normalized_QUOT","normalized_AT","normalized_SBJP","normalized_URL","normalized_WH",
"normalized_INDA","normalized_ACCU","normalized_PGAS","normalized_CMADJ","normalized_SPADJ","normalized_X",
"normalized_AWL", "normalized_TTR","sentence_count", "median_sentence_length"
]
def safe_parse(x):
# If NaN or float, treat as empty list
if isinstance(x, float) and np.isnan(x):
return []
if isinstance(x, str):
try:
return ast.literal_eval(x)
except Exception:
return []
if isinstance(x, list):
return x
return []
def format_df_data(df):
#this accounts for the somewhat idiosyncratic way that I saved my data
normalized_cols = [col for col in df.columns if col.startswith('normalized_')]
#selected_features = [col for col in df.columns if col in selected_cols]
x = df[normalized_cols].astype(float).values
#101325_additions to account for length
df['olmo_cleaned_sentences'] = df['olmo_cleaned_sentences'].apply(safe_parse)
print(df['olmo_cleaned_sentences'])
sentence_count = df['olmo_cleaned_sentences'].apply(len).values.reshape(-1, 1)
median_sentence_length = df['olmo_cleaned_sentences'].apply(
lambda sents: np.median([len(sent.split()) for sent in sents]) if len(sents) > 0 else 0
).values.reshape(-1, 1)
print(median_sentence_length)
x = np.hstack([x, sentence_count, median_sentence_length])
#x = np.vstack(df['features'].values)
return x
if __name__ == "__main__":
biber_vec_df = pd.read_csv("/home/nws8519/git/mw-lifecycle-analysis/analysis_data/092925_unified_phab.csv", low_memory=False)
biber_vec_df = pd.read_csv("/home/nws8519/git/mw-lifecycle-analysis/analysis_data/100325_unified_phab.csv", low_memory=False)
biber_vec_df = biber_vec_df[biber_vec_df['comment_type'] != 'task_description']
#biber_vec_df = biber_vec_df[biber_vec_df['AuthorPHID'] != "PHID-USER-idceizaw6elwiwm5xshb"]
#biber_vec_df = biber_vec_df[biber_vec_df['comment_text'] != 'nan']
biber_vecs = format_df_data(biber_vec_df)
#handoff to PCA model
pca_trial = PCA()
biber_vecs_pca_trial = pca_trial.fit_transform(biber_vecs)
@ -55,9 +103,9 @@ if __name__ == "__main__":
pca = PCA(n_components=argmax_components)
biber_vecs_pca = pca.fit_transform(biber_vecs)
with open('100125_subcomment_pca.pkl', 'wb') as f:
with open('101325_subcomment_pca.pkl', 'wb') as f:
pickle.dump(pca, f)
selected_axis = "closed_relevance"
selected_axis = "AuthorWMFAffil"
component_variances = np.var(biber_vecs_pca, axis=0)
print("Variance of each PCA component:", component_variances)
@ -66,28 +114,28 @@ if __name__ == "__main__":
print(f"PC{i+1}:")
indices = np.argsort(np.abs(component))[::-1]
for idx in indices[:10]: # Top 10
print(f" {BIBER_FEATURES[idx]}: {component[idx]:.3f}")
print(f"{selected_cols[idx]}: {component[idx]:.3f}")
#first looking at comment_type
le = LabelEncoder()
colors = le.fit_transform(biber_vec_df[selected_axis])
#le = LabelEncoder()
#colors = le.fit_transform(biber_vec_df[selected_axis])
pc_dict = {f"PC{i+1}": biber_vecs_pca[:, i] for i in range(18)}
pc_dict[selected_axis] = biber_vec_df[selected_axis].astype(str)
pc_dict = {f"PC{i+1}": biber_vecs_pca[:, i] for i in range(argmax_components)}
#pc_dict[selected_axis] = biber_vec_df[selected_axis].astype(str)
pc_dict["source"] = biber_vec_df['source'].astype(str)
pc_dict["phase"] = biber_vec_df['phase'].astype(str)
pc_dict["text"] = biber_vec_df['comment_text'].astype(str)
pc_dict['id'] = biber_vec_df['id']
pc_dict['week_index'] = biber_vec_df['week_index']
pc_dict['priority'] = biber_vec_df['priority']
pc_dict['closed_relevance'] = biber_vec_df['closed_relevance']
pc_dict['resolution_outcome'] = biber_vec_df['resolution_outcome']
pc_dict['TaskPHID'] = biber_vec_df['TaskPHID']
pc_dict['AuthorPHID'] = biber_vec_df['AuthorPHID']
pc_dict['date_created'] = biber_vec_df['date_created']
plot_df = pd.DataFrame(pc_dict)
plot_df.to_csv("100125_subcomment_PCA_df.csv", index=False)
plot_df.to_csv("101325_subcomment_PCA_df.csv", index=False)
print("Top 10 PC1 values:")
print(plot_df.nlargest(10, "PC1"))
@ -109,20 +157,7 @@ if __name__ == "__main__":
#plt.savefig("090225_biber_pca_plot.png", dpi=300)
'''
plot_df = pd.DataFrame({
"PC1": biber_vecs_pca[:, 0],
"PC2": biber_vecs_pca[:, 1],
selected_axis: biber_vec_df[selected_axis].astype(str)
})
plt.figure(figsize=(8,6))
sns.scatterplot(
data=plot_df, x="PC1", y="PC2", hue="source",
palette="tab10", s=40, alpha=0.7, edgecolor=None
)
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.legend(title=selected_axis, bbox_to_anchor=(1.05, 1), loc=2)
'''
#g.fig.tight_layout()
#g.savefig(f"subcomment_{selected_axis}_100125_biber_pca_final.png", dpi=300)
#plt.show()
'''

View File

@ -9,7 +9,7 @@
#SBATCH --mem=64G
#SBATCH --cpus-per-task=4
#SBATCH --job-name=batched-MW-info-typology
#SBATCH --output=101025-batched-mw-olmo-info-cat.log
#SBATCH --output=101325-batched-mw-olmo-info-cat.log
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=gaughan@u.northwestern.edu

View File

@ -8,7 +8,7 @@
#SBATCH --mem=64G
#SBATCH --cpus-per-task=4
#SBATCH --job-name=neurobiber-pca
#SBATCH --output=100125_subcomment_neurobiber-pca.log
#SBATCH --output=101325_subcomment_neurobiber-pca.log
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=gaughan@u.northwestern.edu