updating PCA to account for sentence count and median length

2025-10-14 23:15:14 -05:00 · 2025-10-14 23:15:14 -05:00 · f60f3ef120
commit f60f3ef120
parent cb2fe737cd
12 changed files with 146695 additions and 40 deletions
--- a/p2/quest/101025-batched-mw-olmo-info-cat.log
+++ b/p2/quest/101025-batched-mw-olmo-info-cat.log
@ -1,15 +1,16 @@
-setting up the environment by loading in conda environment at Sat Oct 11 00:24:37 CDT 2025
-running the batched olmo categorization job at Sat Oct 11 00:24:37 CDT 2025
+setting up the environment by loading in conda environment at Sat Oct 11 07:52:03 CDT 2025
+running the batched olmo categorization job at Sat Oct 11 07:52:03 CDT 2025
 [nltk_data] Downloading package punkt_tab to
 [nltk_data]     /home/nws8519/nltk_data...
 [nltk_data]   Package punkt_tab is already up-to-date!
 cuda
 NVIDIA A100-SXM4-80GB
-_CudaDeviceProperties(name='NVIDIA A100-SXM4-80GB', major=8, minor=0, total_memory=81153MB, multi_processor_count=108, uuid=393ab5c3-2bcb-e4c6-52ad-eb4896a9d4fe, L2_cache_size=40MB)
-
Loading checkpoint shards:   0%|          | 0/12 [00:00<?, ?it/s]
Loading checkpoint shards:   8%|▊         | 1/12 [00:00<00:03,  2.77it/s]
Loading checkpoint shards:  17%|█▋        | 2/12 [00:00<00:04,  2.16it/s]
Loading checkpoint shards:  25%|██▌       | 3/12 [00:01<00:04,  1.93it/s]
Loading checkpoint shards:  33%|███▎      | 4/12 [00:02<00:04,  1.79it/s]
Loading checkpoint shards:  42%|████▏     | 5/12 [00:02<00:03,  1.77it/s]
Loading checkpoint shards:  50%|█████     | 6/12 [00:03<00:03,  1.80it/s]
Loading checkpoint shards:  58%|█████▊    | 7/12 [00:03<00:02,  1.80it/s]
Loading checkpoint shards:  67%|██████▋   | 8/12 [00:04<00:02,  1.76it/s]
Loading checkpoint shards:  75%|███████▌  | 9/12 [00:04<00:01,  1.77it/s]
Loading checkpoint shards:  83%|████████▎ | 10/12 [00:05<00:01,  1.82it/s]
Loading checkpoint shards:  92%|█████████▏| 11/12 [00:05<00:00,  1.92it/s]
Loading checkpoint shards: 100%|██████████| 12/12 [00:05<00:00,  2.02it/s]
+_CudaDeviceProperties(name='NVIDIA A100-SXM4-80GB', major=8, minor=0, total_memory=81153MB, multi_processor_count=108, uuid=ee0bd2a7-af54-5f2e-c2d3-fcd3f57270c9, L2_cache_size=40MB)
+
Loading checkpoint shards:   0%|          | 0/12 [00:00<?, ?it/s]
Loading checkpoint shards:   8%|▊         | 1/12 [00:00<00:04,  2.72it/s]
Loading checkpoint shards:  17%|█▋        | 2/12 [00:00<00:04,  2.27it/s]
Loading checkpoint shards:  25%|██▌       | 3/12 [00:01<00:04,  2.12it/s]
Loading checkpoint shards:  33%|███▎      | 4/12 [00:01<00:03,  2.12it/s]
Loading checkpoint shards:  42%|████▏     | 5/12 [00:02<00:03,  1.96it/s]
Loading checkpoint shards:  50%|█████     | 6/12 [00:02<00:03,  1.98it/s]
Loading checkpoint shards:  58%|█████▊    | 7/12 [00:03<00:02,  1.87it/s]
Loading checkpoint shards:  67%|██████▋   | 8/12 [00:03<00:02,  1.94it/s]
Loading checkpoint shards:  75%|███████▌  | 9/12 [00:04<00:01,  1.91it/s]
Loading checkpoint shards:  83%|████████▎ | 10/12 [00:05<00:01,  1.87it/s]
Loading checkpoint shards:  92%|█████████▏| 11/12 [00:05<00:00,  2.05it/s]
Loading checkpoint shards: 100%|██████████| 12/12 [00:05<00:00,  2.18it/s]
+Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
+This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
 Traceback (most recent call last):
-  File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/090425_batched_olmo_cat.py", line 62, in <module>
-    with open("/home/nws8519/git/mw-lifecycle-analysis/analysis_data/100325_unified_phab.csv", mode='r', newline='') as file:
-         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-FileNotFoundError: [Errno 2] No such file or directory: '/home/nws8519/git/mw-lifecycle-analysis/analysis_data/100325_unified_phab.csv'
-unsupervised batched olmo categorization pau at Sat Oct 11 00:27:22 CDT 2025
+  File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/090425_batched_olmo_cat.py", line 66, in <module>
+    for row in reader:
+_csv.Error: field larger than field limit (131072)
+unsupervised batched olmo categorization pau at Sun Oct 12 14:11:43 CDT 2025
--- a/p2/quest/101325-batched-mw-olmo-info-cat.log
+++ b/p2/quest/101325-batched-mw-olmo-info-cat.log
@ -0,0 +1,11 @@
+setting up the environment by loading in conda environment at Mon Oct 13 09:25:24 CDT 2025
+running the batched olmo categorization job at Mon Oct 13 09:25:24 CDT 2025
+[nltk_data] Downloading package punkt_tab to
+[nltk_data]     /home/nws8519/nltk_data...
+[nltk_data]   Package punkt_tab is already up-to-date!
+cuda
+NVIDIA A100-SXM4-80GB
+_CudaDeviceProperties(name='NVIDIA A100-SXM4-80GB', major=8, minor=0, total_memory=81153MB, multi_processor_count=108, uuid=19efa4d6-01cd-d825-4cd9-637cc23cebd3, L2_cache_size=40MB)
+
Loading checkpoint shards:   0%|          | 0/12 [00:00<?, ?it/s]
Loading checkpoint shards:   8%|▊         | 1/12 [00:00<00:03,  3.50it/s]
Loading checkpoint shards:  17%|█▋        | 2/12 [00:00<00:03,  2.64it/s]
Loading checkpoint shards:  25%|██▌       | 3/12 [00:01<00:03,  2.39it/s]
Loading checkpoint shards:  33%|███▎      | 4/12 [00:01<00:03,  2.19it/s]
Loading checkpoint shards:  42%|████▏     | 5/12 [00:02<00:03,  2.19it/s]
Loading checkpoint shards:  50%|█████     | 6/12 [00:02<00:02,  2.08it/s]
Loading checkpoint shards:  58%|█████▊    | 7/12 [00:03<00:02,  2.04it/s]
Loading checkpoint shards:  67%|██████▋   | 8/12 [00:03<00:01,  2.01it/s]
Loading checkpoint shards:  75%|███████▌  | 9/12 [00:04<00:01,  2.03it/s]
Loading checkpoint shards:  83%|████████▎ | 10/12 [00:04<00:00,  2.07it/s]
Loading checkpoint shards:  92%|█████████▏| 11/12 [00:05<00:00,  2.19it/s]
Loading checkpoint shards: 100%|██████████| 12/12 [00:05<00:00,  2.36it/s]
+Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
+This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
--- a/p2/quest/101325_description_PCA_df.csv
+++ b/p2/quest/101325_description_PCA_df.csv
--- a/p2/quest/101325_description_neurobiber-pca.log
+++ b/p2/quest/101325_description_neurobiber-pca.log
@ -0,0 +1,239 @@
+starting the job at: Tue Oct 14 15:08:48 CDT 2025
+setting up the environment
+running the neurobiber labeling script
+[[13. ]
+ [14. ]
+ [11. ]
+ ...
+ [10. ]
+ [14. ]
+ [12.5]]
+Number of PCs explaining 90% variance: 15
+Variance of each PCA component: [138.60156907  44.29951603  25.63179594  21.39857213  14.99271754
+  10.88014877   8.72969328   8.11497994   6.78712318   5.50912497
+   5.25006184   4.96444801   4.62359041   3.68257699   3.28506433]
+PC1:
+median_sentence_length: 0.994
+normalized_CAP: -0.069
+normalized_NNP: -0.050
+normalized_NOMZ: -0.029
+normalized_NUM: 0.026
+normalized_DET: 0.024
+normalized_ART: 0.020
+normalized_PREP: 0.019
+normalized_PIN: 0.019
+normalized_RB: 0.016
+PC2:
+normalized_CAP: 0.555
+normalized_NNP: 0.554
+normalized_DET: -0.298
+normalized_ART: -0.232
+normalized_PREP: -0.220
+normalized_PIN: -0.220
+sentence_count: -0.189
+normalized_RB: -0.125
+normalized_PRP: -0.110
+normalized_SBJP: -0.110
+PC3:
+normalized_NN: 0.509
+normalized_PREP: 0.491
+normalized_PIN: 0.491
+normalized_CAP: 0.304
+normalized_NNP: 0.279
+normalized_DET: 0.143
+sentence_count: -0.115
+normalized_ART: 0.109
+normalized_NOMZ: -0.098
+normalized_INF: 0.095
+PC4:
+normalized_NN: 0.683
+sentence_count: -0.412
+normalized_NNP: -0.295
+normalized_PIN: -0.217
+normalized_PREP: -0.217
+normalized_CAP: -0.174
+normalized_PRP: -0.173
+normalized_SBJP: -0.173
+normalized_RB: -0.142
+normalized_JJ: 0.117
+PC5:
+sentence_count: 0.718
+normalized_NN: 0.358
+normalized_DET: 0.228
+normalized_PIN: -0.223
+normalized_PREP: -0.223
+normalized_ART: 0.221
+normalized_NOMZ: -0.190
+normalized_CAP: 0.186
+normalized_INF: -0.137
+normalized_JJ: -0.123
+PC6:
+normalized_DET: 0.538
+normalized_ART: 0.483
+sentence_count: -0.398
+normalized_PREP: -0.216
+normalized_PIN: -0.216
+normalized_CAP: 0.206
+normalized_VPRT: 0.204
+normalized_INDA: 0.186
+normalized_NN: -0.142
+normalized_X: -0.132
+PC7:
+normalized_RB: 0.442
+normalized_CAP: 0.343
+normalized_PRP: 0.313
+normalized_SBJP: 0.313
+normalized_NNP: -0.278
+normalized_VPRT: 0.234
+normalized_ART: -0.232
+normalized_NN: 0.229
+normalized_DET: -0.208
+normalized_NOMZ: -0.164
+PC8:
+normalized_JJ: 0.504
+normalized_CAP: 0.502
+normalized_NNP: -0.468
+normalized_NOMZ: 0.296
+sentence_count: 0.150
+normalized_X: -0.146
+normalized_QUOT: -0.145
+normalized_NN: -0.142
+normalized_VPRT: -0.131
+normalized_RB: -0.128
+PC9:
+normalized_JJ: 0.637
+normalized_VPRT: 0.357
+normalized_NNP: 0.337
+normalized_CAP: -0.265
+normalized_INF: -0.258
+normalized_QUOT: -0.224
+normalized_RB: 0.204
+normalized_X: -0.145
+normalized_AUXB: 0.143
+sentence_count: 0.135
+PC10:
+normalized_INF: 0.691
+normalized_QUOT: -0.415
+normalized_VPRT: -0.263
+normalized_RB: 0.222
+normalized_TO: 0.184
+normalized_PRP: -0.155
+normalized_SBJP: -0.155
+normalized_CONT: -0.138
+normalized_NNP: 0.122
+normalized_PIN: -0.120
+PC11:
+normalized_QUOT: 0.714
+normalized_JJ: 0.402
+normalized_CONT: 0.295
+normalized_INF: 0.294
+normalized_NOMZ: -0.246
+normalized_PRP: -0.126
+normalized_SBJP: -0.126
+normalized_X: 0.107
+normalized_NUM: -0.076
+normalized_CAP: 0.067
+PC12:
+normalized_RB: 0.521
+normalized_PRP: -0.426
+normalized_SBJP: -0.426
+normalized_JJ: -0.236
+normalized_INF: -0.229
+normalized_FPP1: -0.211
+normalized_NNP: -0.184
+normalized_CONJ: 0.129
+normalized_XX0: 0.125
+normalized_TO: -0.124
+PC13:
+normalized_X: 0.808
+normalized_NOMZ: -0.391
+normalized_QUOT: -0.249
+normalized_JJ: 0.163
+sentence_count: -0.146
+normalized_NNP: -0.126
+normalized_CAP: 0.107
+normalized_CONT: -0.097
+normalized_VPRT: 0.096
+normalized_RB: -0.071
+PC14:
+normalized_VPRT: 0.514
+normalized_AUXB: 0.496
+normalized_RB: -0.346
+normalized_PASS: 0.221
+normalized_INF: 0.218
+normalized_NOMZ: 0.215
+normalized_BEMA: 0.161
+normalized_JJ: -0.161
+normalized_VBD: -0.137
+normalized_NUM: -0.136
+PC15:
+normalized_NOMZ: 0.554
+normalized_NUM: -0.544
+normalized_X: 0.438
+normalized_RB: 0.239
+sentence_count: 0.146
+normalized_NNP: 0.135
+normalized_AUXB: -0.116
+normalized_NN: 0.105
+normalized_CONT: 0.104
+normalized_INF: -0.101
+Top 10 PC1 values:
+              PC1        PC2  ...                      AuthorPHID  date_created
+16080  525.102703  48.280630  ...  PHID-USER-zjzhrhmn36icnzbckqy4    1350678600
+18859   77.466344   3.706703  ...  PHID-USER-ll6tmaogat2b5q7tnqas    1405358040
+20378   69.473292  -5.921977  ...  PHID-USER-ynivjflmc2dcl6w5ut5v    1407551580
+8874    67.305410   6.587019  ...  PHID-USER-ydswvwhh5pm4lshahjje    1371667800
+6468    52.113083  12.698065  ...  PHID-USER-azy72hrp3tpetr52aob6    1378208100
+18692   43.220624   8.230008  ...  PHID-USER-arjqb24x4oae7awzpfp6    1411431840
+5607    42.720768   1.581160  ...  PHID-USER-ynivjflmc2dcl6w5ut5v    1360124400
+19479   41.065047   8.286151  ...  PHID-USER-ynivjflmc2dcl6w5ut5v    1406854860
+13751   38.405351   6.445956  ...  PHID-USER-v7vgzvvcw7v2umf737ri    1380947640
+6503    37.060191  -2.635433  ...  PHID-USER-qgqq35kbi5wss2tlgmhg    1377865740
+
+[10 rows x 25 columns]
+
+Bottom 10 PC1 values:
+             PC1        PC2  ...                      AuthorPHID  date_created
+19173 -14.819594  38.839843  ...  PHID-USER-doeppszazlm3r7xah4il    1416964345
+23533 -14.098760  31.956092  ...  PHID-USER-sai77mtxmpqnm6pycyvz    1424498718
+24553 -14.098553  31.953701  ...  PHID-USER-sai77mtxmpqnm6pycyvz    1424498559
+23532 -14.098346  31.951309  ...  PHID-USER-sai77mtxmpqnm6pycyvz    1424498772
+129   -13.767257   2.442547  ...  PHID-USER-hyfm4swq76s4j642w46x    1375120080
+22245 -12.327433  30.418183  ...  PHID-USER-v7vgzvvcw7v2umf737ri    1438377936
+752   -12.170613  17.171274  ...  PHID-USER-sx63fwaih5kjt7bz4u6z    1380590700
+2120  -11.607147 -10.509373  ...  PHID-USER-xfe43w2lb5gpvglf4coa    1367008080
+22153 -11.098587   7.351805  ...  PHID-USER-a6p24cvyblhfzc7we7nc    1438982860
+24847 -10.908633  15.377024  ...  PHID-USER-srhlj2447vmpmrfhqnfa    1417632210
+
+[10 rows x 25 columns]
+Top 10 PC2 values:
+              PC1        PC2  ...                      AuthorPHID  date_created
+16080  525.102703  48.280630  ...  PHID-USER-zjzhrhmn36icnzbckqy4    1350678600
+19173  -14.819594  38.839843  ...  PHID-USER-doeppszazlm3r7xah4il    1416964345
+23127   -1.787399  32.727692  ...  PHID-USER-myidf5vlkwvrgp2iwn76    1433839792
+23533  -14.098760  31.956092  ...  PHID-USER-sai77mtxmpqnm6pycyvz    1424498718
+24553  -14.098553  31.953701  ...  PHID-USER-sai77mtxmpqnm6pycyvz    1424498559
+23532  -14.098346  31.951309  ...  PHID-USER-sai77mtxmpqnm6pycyvz    1424498772
+18500   13.647382  30.709395  ...  PHID-USER-hbffue25ov3attlvclze    1387662960
+22245  -12.327433  30.418183  ...  PHID-USER-v7vgzvvcw7v2umf737ri    1438377936
+22023   -7.400000  29.037196  ...  PHID-USER-a6p24cvyblhfzc7we7nc    1440568477
+14809   -2.186555  28.072103  ...  PHID-USER-zjzhrhmn36icnzbckqy4    1379900100
+
+[10 rows x 25 columns]
+
+Bottom 10 PC2 values:
+             PC1        PC2  ...                      AuthorPHID  date_created
+23485  -1.065773 -15.903250  ...  PHID-USER-u7udgblfyop6qd5wxot6    1425991276
+22060   4.042133 -15.132236  ...  PHID-USER-2nnm76h4ykalvvref2ye    1440412099
+5792   -2.696513 -15.036399  ...  PHID-USER-grpjkpfolt5gz4ljlbfg    1355334540
+1436   -0.480107 -15.016999  ...  PHID-USER-tyjmn7xcw6s2b6rqagj7    1373878680
+22799  -5.139569 -14.977697  ...  PHID-USER-fjve3gq5wsmaaccti7pb    1430752987
+22845  -0.877723 -14.762675  ...  PHID-USER-2nnm76h4ykalvvref2ye    1440085454
+7451    9.897529 -14.392291  ...  PHID-USER-ysftv67jxeaxdwcakvwo    1374347580
+9423   10.013728 -14.381035  ...  PHID-USER-zzvqlvm6i6kml4tfnqvq    1369411380
+1228   -2.448487 -13.906291  ...  PHID-USER-ysftv67jxeaxdwcakvwo    1374765240
+2775    3.664323 -13.485623  ...  PHID-USER-dw53c5cb2qfhyemej57o    1377068880
+
+[10 rows x 25 columns]
+job finished, cleaning up
+job pau at: Tue Oct 14 15:09:18 CDT 2025
--- a/p2/quest/101325_description_pca.pkl
+++ b/p2/quest/101325_description_pca.pkl
--- a/p2/quest/101325_subcomment_PCA_df.csv
+++ b/p2/quest/101325_subcomment_PCA_df.csv
--- a/p2/quest/101325_subcomment_neurobiber-pca.log
+++ b/p2/quest/101325_subcomment_neurobiber-pca.log
@ -0,0 +1,352 @@
+starting the job at: Tue Oct 14 15:54:24 CDT 2025
+setting up the environment
+running the neurobiber labeling script
+1        [Change 86685 merged by jenkins-bot:\nFollow-u...
+2        [*** Bug 54785 has been marked as a duplicate ...
+3        [Change 86685 had a related patch set uploaded...
+5        [**Wikifram** wrote:\n\nAllright, thanks to bo...
+6        [(In reply to comment #4)\nQUOTE\n\nVE product...
+                               ...                        
+25022    [Er... drag and drop from what?, Is there no n...
+25023    [Could you attach a screenshot please?, Drag &...
+25025    [Sorry for not reply-ing., I did a test and co...
+25026                        [SCREEN_NAME: Please answer.]
+25027    [I cannot replicate this., What's the name of ...
+Name: olmo_cleaned_sentences, Length: 21901, dtype: object
+[[18. ]
+ [ 6.5]
+ [23. ]
+ ...
+ [ 5.5]
+ [ 3. ]
+ [ 6. ]]
+Number of PCs explaining 90% variance: 24
+Variance of each PCA component: [273.55786883 135.16197459  82.94008657  63.12754897  60.39119505
+  38.84258991  32.35268417  26.32979149  21.57186105  18.691479
+  16.21404524  13.63887204  13.3960516   11.40372708  10.25820109
+   9.13513531   8.8549811    8.29863619   7.99933399   7.06165956
+   6.73377968   6.4742109    5.92152116   5.75533066]
+PC1:
+normalized_CAP: 0.670
+normalized_NNP: 0.604
+median_sentence_length: -0.283
+normalized_DET: -0.142
+normalized_PREP: -0.122
+normalized_PIN: -0.122
+normalized_ART: -0.089
+normalized_VPRT: -0.082
+normalized_RB: -0.077
+normalized_PRP: -0.071
+PC2:
+median_sentence_length: 0.929
+normalized_NNP: 0.319
+normalized_RB: -0.074
+normalized_VPRT: -0.070
+normalized_DET: -0.066
+normalized_AUXB: -0.055
+normalized_PRP: -0.045
+normalized_SBJP: -0.045
+normalized_X: 0.038
+normalized_CAP: 0.035
+PC3:
+normalized_NN: 0.750
+normalized_NNP: -0.291
+normalized_RB: -0.266
+normalized_PRP: -0.232
+normalized_SBJP: -0.232
+normalized_CAP: 0.211
+normalized_VPRT: -0.169
+normalized_FPP1: -0.117
+normalized_NUM: 0.106
+normalized_INF: -0.097
+PC4:
+normalized_CAP: 0.577
+normalized_PREP: 0.426
+normalized_PIN: 0.426
+normalized_NNP: -0.281
+normalized_PRP: 0.187
+normalized_SBJP: 0.187
+median_sentence_length: 0.159
+normalized_X: -0.148
+normalized_RB: 0.141
+normalized_INF: 0.128
+PC5:
+normalized_PIN: 0.507
+normalized_PREP: 0.507
+normalized_NNP: 0.435
+normalized_CAP: -0.349
+normalized_RB: -0.256
+median_sentence_length: -0.147
+normalized_CONJ: 0.125
+normalized_SBJP: -0.120
+normalized_PRP: -0.120
+normalized_VPRT: -0.100
+PC6:
+normalized_DET: 0.618
+normalized_ART: 0.383
+normalized_X: -0.278
+normalized_NN: 0.273
+normalized_VPRT: 0.261
+normalized_NNP: 0.246
+normalized_AUXB: 0.215
+normalized_NUM: -0.191
+normalized_INF: -0.163
+normalized_INDA: 0.156
+PC7:
+normalized_NN: 0.477
+normalized_PRP: 0.459
+normalized_SBJP: 0.459
+normalized_NNP: 0.247
+normalized_FPP1: 0.236
+normalized_DET: -0.196
+normalized_AUXB: -0.171
+normalized_CAP: -0.163
+normalized_PASS: -0.138
+normalized_PIT: 0.126
+PC8:
+normalized_RB: 0.781
+normalized_NN: 0.265
+normalized_DET: -0.188
+normalized_PRP: -0.187
+normalized_SBJP: -0.186
+normalized_JJ: -0.169
+normalized_NNP: 0.154
+normalized_X: -0.153
+normalized_TIME: 0.139
+normalized_ART: -0.136
+PC9:
+normalized_JJ: 0.672
+normalized_INF: 0.353
+normalized_VPRT: -0.324
+normalized_PASS: -0.219
+normalized_AUXB: -0.218
+normalized_NUM: -0.214
+normalized_ART: 0.214
+normalized_CONJ: -0.147
+normalized_RB: 0.132
+normalized_PEAS: -0.117
+PC10:
+normalized_INF: 0.652
+normalized_JJ: -0.543
+normalized_VPRT: -0.298
+normalized_DET: 0.248
+normalized_TO: 0.131
+normalized_ART: 0.128
+normalized_PRIV: 0.108
+normalized_NUM: 0.086
+normalized_RB: -0.077
+normalized_POMD: 0.072
+PC11:
+normalized_INF: 0.420
+normalized_VPRT: 0.383
+normalized_AUXB: 0.379
+normalized_ART: -0.261
+normalized_JJ: 0.251
+normalized_RB: -0.249
+normalized_VBD: -0.247
+normalized_X: -0.223
+normalized_DET: -0.212
+normalized_PASS: 0.174
+PC12:
+sentence_count: 0.651
+normalized_X: -0.619
+normalized_VPRT: -0.180
+normalized_PUBV: 0.169
+normalized_RB: -0.115
+normalized_CONJ: 0.114
+normalized_INF: -0.104
+normalized_CCONJ: 0.100
+normalized_QUOT: 0.099
+normalized_DET: -0.091
+PC13:
+sentence_count: 0.637
+normalized_X: 0.496
+normalized_VBD: -0.299
+normalized_NUM: -0.287
+normalized_PUBV: -0.223
+normalized_JJ: -0.198
+normalized_VPRT: 0.186
+normalized_CONJ: -0.099
+normalized_QUOT: 0.067
+normalized_PASS: -0.061
+PC14:
+normalized_NUM: 0.714
+normalized_VBD: -0.354
+normalized_VPRT: 0.233
+normalized_AUXB: -0.186
+normalized_PASS: -0.171
+normalized_ART: 0.153
+normalized_UH: -0.150
+normalized_RB: 0.141
+normalized_INDA: 0.138
+normalized_PUBV: -0.134
+PC15:
+normalized_QUOT: 0.422
+normalized_VBD: -0.380
+normalized_AUXB: -0.331
+sentence_count: -0.322
+normalized_CONT: 0.315
+normalized_UH: 0.255
+normalized_NUM: -0.221
+normalized_PASS: -0.221
+normalized_X: -0.206
+normalized_VPRT: 0.154
+PC16:
+normalized_PUBV: 0.481
+normalized_CONJ: -0.394
+normalized_UH: -0.360
+normalized_VBD: 0.317
+normalized_QUOT: 0.267
+normalized_VPRT: 0.248
+normalized_CONT: 0.201
+normalized_NUM: 0.151
+normalized_PASS: -0.137
+normalized_TO: 0.128
+PC17:
+normalized_QUOT: 0.520
+normalized_CONT: 0.417
+normalized_PUBV: -0.301
+normalized_PGAS: -0.290
+normalized_UH: -0.260
+normalized_CONJ: 0.234
+normalized_VBD: 0.200
+normalized_NOMZ: -0.194
+normalized_PASS: 0.193
+normalized_AUXB: 0.175
+PC18:
+normalized_CONJ: 0.631
+normalized_PUBV: 0.523
+normalized_NUM: -0.253
+normalized_PGAS: -0.211
+normalized_VPRT: 0.168
+normalized_X: 0.160
+normalized_ART: 0.155
+normalized_DEMP: -0.126
+normalized_UH: -0.118
+normalized_TIME: -0.106
+PC19:
+normalized_UH: 0.659
+normalized_PGAS: -0.517
+normalized_VBD: 0.237
+normalized_CCONJ: -0.196
+normalized_CONJ: -0.175
+normalized_NOMZ: -0.153
+normalized_VPRT: 0.149
+normalized_ART: 0.109
+normalized_INDA: 0.101
+normalized_RB: 0.099
+PC20:
+normalized_ART: 0.461
+normalized_DET: -0.342
+normalized_DEMO: -0.294
+normalized_INDA: 0.293
+normalized_DEMP: -0.288
+normalized_AUXB: 0.230
+normalized_PIT: 0.222
+normalized_FPP1: -0.215
+normalized_PGAS: 0.208
+normalized_CCONJ: 0.185
+PC21:
+normalized_PGAS: 0.594
+normalized_CCONJ: -0.353
+normalized_UH: 0.330
+normalized_CONJ: 0.272
+normalized_AUXB: 0.250
+normalized_PRIV: 0.241
+normalized_BEMA: 0.153
+normalized_TIME: -0.141
+normalized_PROD: -0.130
+normalized_NUM: 0.125
+PC22:
+normalized_PRIV: 0.445
+normalized_QUES: -0.422
+normalized_CCONJ: 0.395
+normalized_VPRT: 0.242
+normalized_FPP1: 0.221
+normalized_AUXB: -0.207
+normalized_VBD: 0.200
+normalized_BEMA: -0.178
+normalized_PIT: -0.151
+normalized_SPP2: -0.148
+PC23:
+normalized_NOMZ: 0.504
+normalized_PRIV: 0.457
+normalized_CCONJ: -0.327
+normalized_PUBV: -0.283
+normalized_NUM: -0.184
+normalized_VBD: 0.180
+normalized_SCONJ: 0.170
+normalized_UH: -0.168
+normalized_DEMP: -0.161
+normalized_PGAS: -0.161
+PC24:
+normalized_CCONJ: 0.506
+normalized_QUES: 0.414
+normalized_CONJ: 0.251
+normalized_PASS: -0.238
+normalized_BEMA: 0.207
+normalized_WH: 0.207
+normalized_VBD: 0.186
+normalized_DEMO: -0.180
+normalized_PEAS: -0.164
+normalized_SCONJ: 0.161
+Top 10 PC1 values:
+              PC1        PC2  ...                      AuthorPHID  date_created
+23531  123.243897  22.112164  ...  PHID-USER-arjqb24x4oae7awzpfp6    1424754141
+707    123.226678  22.102265  ...  PHID-USER-pun3sjvg3cemjzbgyo2t    1363132183
+744    123.226678  22.102265  ...  PHID-USER-fovtl67ew4l4cc3oeypc    1353551242
+749    123.226678  22.102265  ...  PHID-USER-fovtl67ew4l4cc3oeypc    1353384355
+2243   123.226678  22.102265  ...  PHID-USER-fovtl67ew4l4cc3oeypc    1356175107
+5921   123.226678  22.102265  ...  PHID-USER-fovtl67ew4l4cc3oeypc    1353366778
+5933   123.226678  22.102265  ...  PHID-USER-fovtl67ew4l4cc3oeypc    1353123761
+5935   123.226678  22.102265  ...  PHID-USER-fovtl67ew4l4cc3oeypc    1353386649
+10080  123.226678  22.102265  ...  PHID-USER-fovtl67ew4l4cc3oeypc    1366298361
+10418  123.226678  22.102265  ...  PHID-USER-fovtl67ew4l4cc3oeypc    1355363288
+
+[10 rows x 34 columns]
+
+Bottom 10 PC1 values:
+              PC1         PC2  ...                      AuthorPHID  date_created
+24812 -131.318535  438.637876  ...  PHID-USER-fo56wm4wxiwpoofn2xdu    1463441072
+24813 -131.130989  438.728132  ...  PHID-USER-fo56wm4wxiwpoofn2xdu    1463441050
+13983  -88.027511  274.892016  ...  PHID-USER-v7vgzvvcw7v2umf737ri    1380947348
+16510  -82.500013  294.909402  ...  PHID-USER-izojihzr4ja3jsgzn5wv    1354470131
+161    -68.446710  197.206426  ...  PHID-USER-hyfm4swq76s4j642w46x    1374730027
+24815  -60.440128  175.352637  ...  PHID-USER-fo56wm4wxiwpoofn2xdu    1463439992
+6163   -59.523505  195.679514  ...  PHID-USER-4bjsher5mqcoikeqnnec    1379611711
+22005  -59.492044  211.972278  ...  PHID-USER-maceogqtxg4qfaefx7wd    1440633395
+24010  -53.793798  153.114760  ...  PHID-USER-lhtlnmkdbzlz6pbxaqdd    1428469742
+24009  -53.614161  153.284397  ...  PHID-USER-lhtlnmkdbzlz6pbxaqdd    1428538077
+
+[10 rows x 34 columns]
+Top 10 PC2 values:
+              PC1         PC2  ...                      AuthorPHID  date_created
+24813 -131.130989  438.728132  ...  PHID-USER-fo56wm4wxiwpoofn2xdu    1463441050
+24812 -131.318535  438.637876  ...  PHID-USER-fo56wm4wxiwpoofn2xdu    1463441072
+16510  -82.500013  294.909402  ...  PHID-USER-izojihzr4ja3jsgzn5wv    1354470131
+13983  -88.027511  274.892016  ...  PHID-USER-v7vgzvvcw7v2umf737ri    1380947348
+22005  -59.492044  211.972278  ...  PHID-USER-maceogqtxg4qfaefx7wd    1440633395
+161    -68.446710  197.206426  ...  PHID-USER-hyfm4swq76s4j642w46x    1374730027
+6163   -59.523505  195.679514  ...  PHID-USER-4bjsher5mqcoikeqnnec    1379611711
+20858  -52.549327  192.146265  ...  PHID-USER-22bsa5u75jz3ci3wnplu    1441031208
+24815  -60.440128  175.352637  ...  PHID-USER-fo56wm4wxiwpoofn2xdu    1463439992
+18294  -43.267655  159.973982  ...  PHID-USER-vk6mlmacfhx77egryy5i    1394419981
+
+[10 rows x 34 columns]
+
+Bottom 10 PC2 values:
+             PC1        PC2  ...                      AuthorPHID  date_created
+17259 -12.413915 -20.310670  ...  PHID-USER-6vzzsmi22zem6yttr6vp    1321220595
+22246   2.436022 -19.030642  ...  PHID-USER-2nnm76h4ykalvvref2ye    1461480989
+24780  -8.420485 -18.295879  ...  PHID-USER-lsveyqlsb4acoowxr5yj    1420344576
+7427   12.144652 -18.033451  ...  PHID-USER-wz5bw3q6zykhqbbeohzq    1375791780
+7055   -1.553566 -17.924389  ...  PHID-USER-cfsvvgbtlqnbt2yokfjf    1377020909
+23122   9.656987 -17.642747  ...  PHID-USER-2nnm76h4ykalvvref2ye    1467721812
+16776   6.551795 -17.537527  ...  PHID-USER-6vzzsmi22zem6yttr6vp    1317838205
+7471   -0.812161 -17.516875  ...  PHID-USER-wkpnidxoctuhawexig5p    1386166246
+13670   3.270330 -17.516754  ...  PHID-USER-5dwuaigmkz2vzg65lape    1401902866
+20682   3.694061 -17.391146  ...  PHID-USER-uciss2jl2e4ifxqqk7wk    1440083315
+
+[10 rows x 34 columns]
+job finished, cleaning up
+job pau at: Tue Oct 14 15:54:56 CDT 2025
--- a/p2/quest/101325_subcomment_pca.pkl
+++ b/p2/quest/101325_subcomment_pca.pkl
--- a/p2/quest/python_scripts/090425_batched_olmo_cat.py
+++ b/p2/quest/python_scripts/090425_batched_olmo_cat.py
@ -2,7 +2,8 @@ from transformers import AutoModelForCausalLM, AutoTokenizer, OlmoForCausalLM
 import torch
 import csv 
 import pandas as pd 
-import re 
+import re
+import sys 

 import nltk 
 nltk.download('punkt_tab')
@ -18,7 +19,7 @@ print(torch.cuda.get_device_properties(0))
 olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-1124-13B", cache_dir=cache_directory).to(device)
 tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-2-1124-13B", padding_side='left')

-priming = "For the **GIVEN SENTENCE**, please categorize it into one of the defined [[CATEGORIES]]. Each [[CATEGORY]] is described in the TYPOLOGY for reference.Your task is to match the **GIVEN SENTENCE** to the **[[CATEGORY]]** that most accurately describes the content of the comment. Only provide the sentence category as your output. Do not provide any text beyond the category name."
+priming = "You will be provided with a sentence from a software engineering task discussions. For the **GIVEN SENTENCE**, please categorize it into one of the defined [[CATEGORIES]]. Each [[CATEGORY]] is described in the TYPOLOGY for reference.Your task is to match the **GIVEN SENTENCE** to the **[[CATEGORY]]** that most accurately describes the content of the comment. Only provide the sentence category as your output. Do not provide any text beyond the category name."

 typology = """
 TYPOLOGY: 
@ -59,6 +60,7 @@ TYPOLOGY:
 """
 instructions="The sentence's category is: "

+csv.field_size_limit(sys.maxsize)
 with open("/home/nws8519/git/mw-lifecycle-analysis/analysis_data/100325_unified_phab.csv", mode='r', newline='') as file:
    reader = csv.reader(file)
    array_of_categorizations = []
@ -107,7 +109,7 @@ with open("/home/nws8519/git/mw-lifecycle-analysis/analysis_data/100325_unified_
            batch = comment_sentences[i:i+batch_size]
            prompts = []
            for sent in batch:
-                given_data = f"**GIVEN SENTENCE: \n ' Type -text_dict['task_title']  \n Text -{sent}**'\n"
+                given_data = f"**GIVEN SENTENCE: \n ' Task Title -text_dict['task_title']  \n Text -{sent}**'\n"
                prompt = f"{priming}\n{typology}\n\n{given_data}\n{instructions}"
                prompts.append(prompt)
            inputs = tokenizer(prompts, return_tensors='pt', return_token_type_ids=False, padding=True, truncation=True).to(device)
@ -127,7 +129,7 @@ with open("/home/nws8519/git/mw-lifecycle-analysis/analysis_data/100325_unified_
        array_of_categorizations.append(text_dict)
    df = pd.DataFrame(array_of_categorizations)
    #print(df.head())
-    df.to_csv('all_101025_olmo_batched_categorized.csv', index=False)
+    df.to_csv('all_101325_olmo_batched_categorized.csv', index=False)


 	    
--- a/p2/quest/python_scripts/neurobiber_PCA.py
+++ b/p2/quest/python_scripts/neurobiber_PCA.py
@ -7,6 +7,7 @@ import pandas as pd
 import matplotlib.pyplot as plt
 import seaborn as sns
 import pickle
+import ast

 # List of the 96 features that Neurobiber can predict
 BIBER_FEATURES = [
@ -25,25 +26,72 @@ BIBER_FEATURES = [
    "BIN_DET","BIN_EMOJ","BIN_EMOT","BIN_EXCL","BIN_HASH","BIN_INF",
    "BIN_UH","BIN_NUM","BIN_LAUGH","BIN_PRP","BIN_PREP","BIN_NNP",
    "BIN_QUES","BIN_QUOT","BIN_AT","BIN_SBJP","BIN_URL","BIN_WH",
-    "BIN_INDA","BIN_ACCU","BIN_PGAS","BIN_CMADJ","BIN_SPADJ","BIN_X"
+    "BIN_INDA","BIN_ACCU","BIN_PGAS","BIN_CMADJ","BIN_SPADJ","BIN_X", 
+    "sentence_count", "median_sentence_length" 
+]
+
+selected_cols = [
+    "normalized_QUAN","normalized_QUPR","normalized_AMP","normalized_PASS","normalized_XX0","normalized_JJ",
+    "normalized_BEMA","normalized_CAUS","normalized_CONC","normalized_COND","normalized_CONJ","normalized_CONT",
+    "normalized_DPAR","normalized_DWNT","normalized_EX","normalized_FPP1","normalized_GER","normalized_RB",
+    "normalized_PIN","normalized_INPR","normalized_TO","normalized_NEMD","normalized_OSUB","normalized_PASTP",
+    "normalized_VBD","normalized_PHC","normalized_PIRE","normalized_PLACE","normalized_POMD","normalized_PRMD",
+    "normalized_WZPRES","normalized_VPRT","normalized_PRIV","normalized_PIT","normalized_PUBV","normalized_SPP2",
+    "normalized_SMP","normalized_SERE","normalized_STPR","normalized_SUAV","normalized_SYNE","normalized_TPP3",
+    "normalized_TIME","normalized_NOMZ","normalized_BYPA","normalized_PRED","normalized_TOBJ","normalized_TSUB",
+    "normalized_THVC","normalized_NN","normalized_DEMP","normalized_DEMO","normalized_WHQU","normalized_EMPH",
+    "normalized_HDG","normalized_WZPAST","normalized_THAC","normalized_PEAS","normalized_ANDC","normalized_PRESP",
+    "normalized_PROD","normalized_SPAU","normalized_SPIN","normalized_THATD","normalized_WHOBJ","normalized_WHSUB",
+    "normalized_WHCL","normalized_ART","normalized_AUXB","normalized_CAP","normalized_SCONJ","normalized_CCONJ",
+    "normalized_DET","normalized_EMOJ","normalized_EMOT","normalized_EXCL","normalized_HASH","normalized_INF",
+    "normalized_UH","normalized_NUM","normalized_LAUGH","normalized_PRP","normalized_PREP","normalized_NNP",
+    "normalized_QUES","normalized_QUOT","normalized_AT","normalized_SBJP","normalized_URL","normalized_WH",
+    "normalized_INDA","normalized_ACCU","normalized_PGAS","normalized_CMADJ","normalized_SPADJ","normalized_X",
+    "normalized_AWL", "normalized_TTR","sentence_count", "median_sentence_length"
 ]


+def safe_parse(x):
+    # If NaN or float, treat as empty list
+    if isinstance(x, float) and np.isnan(x):
+        return []
+    if isinstance(x, str):
+        try:
+            return ast.literal_eval(x)
+        except Exception:
+            return []
+    if isinstance(x, list):
+        return x
+    return []

 def format_df_data(df):
    #this accounts for the somewhat idiosyncratic way that I saved my data 
    normalized_cols = [col for col in df.columns if col.startswith('normalized_')]
+    
+    #selected_features = [col for col in df.columns if col in selected_cols]
    x = df[normalized_cols].astype(float).values
+
+    #101325_additions to account for length
+    df['olmo_cleaned_sentences'] = df['olmo_cleaned_sentences'].apply(safe_parse)
+    print(df['olmo_cleaned_sentences'])
+    sentence_count = df['olmo_cleaned_sentences'].apply(len).values.reshape(-1, 1)
+    
+    median_sentence_length = df['olmo_cleaned_sentences'].apply(
+        lambda sents: np.median([len(sent.split()) for sent in sents]) if len(sents) > 0 else 0
+    ).values.reshape(-1, 1)
+    print(median_sentence_length)
+    x = np.hstack([x, sentence_count, median_sentence_length])
    #x = np.vstack(df['features'].values)
    return x

 if __name__ == "__main__":
-    biber_vec_df = pd.read_csv("/home/nws8519/git/mw-lifecycle-analysis/analysis_data/092925_unified_phab.csv", low_memory=False)
+    biber_vec_df = pd.read_csv("/home/nws8519/git/mw-lifecycle-analysis/analysis_data/100325_unified_phab.csv", low_memory=False)
    biber_vec_df = biber_vec_df[biber_vec_df['comment_type'] != 'task_description']
    #biber_vec_df = biber_vec_df[biber_vec_df['AuthorPHID'] != "PHID-USER-idceizaw6elwiwm5xshb"] 
    #biber_vec_df = biber_vec_df[biber_vec_df['comment_text'] != 'nan']
    biber_vecs = format_df_data(biber_vec_df)
    #handoff to PCA model
+    
    pca_trial = PCA()  
    biber_vecs_pca_trial = pca_trial.fit_transform(biber_vecs)

@ -55,9 +103,9 @@ if __name__ == "__main__":
    
    pca = PCA(n_components=argmax_components)
    biber_vecs_pca = pca.fit_transform(biber_vecs)
-    with open('100125_subcomment_pca.pkl', 'wb') as f:
+    with open('101325_subcomment_pca.pkl', 'wb') as f:
        pickle.dump(pca, f)
-    selected_axis = "closed_relevance"    
+    selected_axis = "AuthorWMFAffil"    
    
    component_variances = np.var(biber_vecs_pca, axis=0)
    print("Variance of each PCA component:", component_variances)
@ -66,28 +114,28 @@ if __name__ == "__main__":
        print(f"PC{i+1}:")
        indices = np.argsort(np.abs(component))[::-1]
        for idx in indices[:10]:  # Top 10
-            print(f"  {BIBER_FEATURES[idx]}: {component[idx]:.3f}")
+            print(f"{selected_cols[idx]}: {component[idx]:.3f}")
    
    #first looking at comment_type
-    le = LabelEncoder()
-    colors = le.fit_transform(biber_vec_df[selected_axis])
+    #le = LabelEncoder()
+    #colors = le.fit_transform(biber_vec_df[selected_axis])
    
-    pc_dict = {f"PC{i+1}": biber_vecs_pca[:, i] for i in range(18)}
-    pc_dict[selected_axis] = biber_vec_df[selected_axis].astype(str)
+    pc_dict = {f"PC{i+1}": biber_vecs_pca[:, i] for i in range(argmax_components)}
+    #pc_dict[selected_axis] = biber_vec_df[selected_axis].astype(str)
    pc_dict["source"] = biber_vec_df['source'].astype(str)
    pc_dict["phase"] = biber_vec_df['phase'].astype(str)
    pc_dict["text"] = biber_vec_df['comment_text'].astype(str)
    pc_dict['id'] = biber_vec_df['id']
    pc_dict['week_index'] = biber_vec_df['week_index']
    pc_dict['priority'] = biber_vec_df['priority']
-    pc_dict['closed_relevance'] = biber_vec_df['closed_relevance']
+    pc_dict['resolution_outcome'] = biber_vec_df['resolution_outcome']
    pc_dict['TaskPHID'] = biber_vec_df['TaskPHID']
    pc_dict['AuthorPHID'] = biber_vec_df['AuthorPHID']
    pc_dict['date_created'] = biber_vec_df['date_created'] 


    plot_df = pd.DataFrame(pc_dict)
-    plot_df.to_csv("100125_subcomment_PCA_df.csv", index=False)
+    plot_df.to_csv("101325_subcomment_PCA_df.csv", index=False)

    print("Top 10 PC1 values:")
    print(plot_df.nlargest(10, "PC1"))
@ -109,20 +157,7 @@ if __name__ == "__main__":

    #plt.savefig("090225_biber_pca_plot.png", dpi=300) 
    '''
-    plot_df = pd.DataFrame({
-        "PC1": biber_vecs_pca[:, 0],
-        "PC2": biber_vecs_pca[:, 1],
-        selected_axis: biber_vec_df[selected_axis].astype(str)
-    })
-    plt.figure(figsize=(8,6))
-    sns.scatterplot(
-        data=plot_df, x="PC1", y="PC2", hue="source",
-        palette="tab10", s=40, alpha=0.7, edgecolor=None
-    )
-    plt.xlabel('component 1')
-    plt.ylabel('component 2')
-    plt.legend(title=selected_axis, bbox_to_anchor=(1.05, 1), loc=2)
-    '''
    #g.fig.tight_layout()
    #g.savefig(f"subcomment_{selected_axis}_100125_biber_pca_final.png", dpi=300)
    #plt.show()
+    '''
--- a/p2/quest/slurm_jobs/090425_olmo_batched_cat.sh
+++ b/p2/quest/slurm_jobs/090425_olmo_batched_cat.sh
@ -9,7 +9,7 @@
 #SBATCH --mem=64G
 #SBATCH --cpus-per-task=4
 #SBATCH --job-name=batched-MW-info-typology
-#SBATCH --output=101025-batched-mw-olmo-info-cat.log
+#SBATCH --output=101325-batched-mw-olmo-info-cat.log
 #SBATCH --mail-type=BEGIN,END,FAIL
 #SBATCH --mail-user=gaughan@u.northwestern.edu

--- a/p2/quest/slurm_jobs/pca_run.sh
+++ b/p2/quest/slurm_jobs/pca_run.sh
@ -8,7 +8,7 @@
 #SBATCH --mem=64G
 #SBATCH --cpus-per-task=4
 #SBATCH --job-name=neurobiber-pca 
-#SBATCH --output=100125_subcomment_neurobiber-pca.log
+#SBATCH --output=101325_subcomment_neurobiber-pca.log
 #SBATCH --mail-type=BEGIN,END,FAIL
 #SBATCH --mail-user=gaughan@u.northwestern.edu