1
0

running PCA on subcomment values, adding new plot for closed_relevance

This commit is contained in:
mgaughan 2025-09-25 10:11:47 -05:00
parent e29d4bf59c
commit b21ecb02c3
6 changed files with 113116 additions and 52 deletions

File diff suppressed because one or more lines are too long

View File

@ -1,4 +1,4 @@
starting the job at: Thu Sep 25 09:36:43 CDT 2025
starting the job at: Thu Sep 25 10:05:44 CDT 2025
setting up the environment
running the neurobiber labeling script
Variance of each PCA component: [44.08472997 25.31736287 20.0163717 11.80556907 8.85200058 8.36660391
@ -203,62 +203,62 @@ PC18:
BIN_PRIV: -0.116
BIN_FPP1: 0.103
Top 10 PC1 values:
PC1 PC2 ... priority closed_relevance
19873 40.267200 26.528755 ... Medium False
24120 34.012764 7.436658 ... Low False
24529 33.020514 7.624464 ... Needs Triage False
25549 33.018302 7.622737 ... Medium False
24528 33.016089 7.621010 ... Needs Triage True
23238 31.348286 5.402173 ... Medium False
18729 29.627919 4.690955 ... Needs Triage True
23016 29.595518 8.870229 ... Medium False
14849 28.191116 6.625144 ... Low False
21214 28.191116 6.625144 ... Low True
PC1 PC2 PC3 ... id week_index priority
19873 40.267200 26.528755 -11.406833 ... 75953 -32 Medium
24120 34.012764 7.436658 -10.042571 ... 101814 -4 Low
24529 33.020514 7.624464 0.683570 ... 90329 -19 Needs Triage
25549 33.018302 7.622737 0.683000 ... 90328 -19 Medium
24528 33.016089 7.621010 0.682431 ... 90330 -19 Needs Triage
23238 31.348286 5.402173 -6.263101 ... 107627 4 Medium
18729 29.627919 4.690955 -5.130935 ... 60818 -80 Needs Triage
23016 29.595518 8.870229 -12.256826 ... 110277 7 Medium
14849 28.191116 6.625144 -7.271761 ... 56457 3 Low
21214 28.191116 6.625144 -7.271761 ... 56457 -93 Low
[10 rows x 26 columns]
[10 rows x 25 columns]
Bottom 10 PC1 values:
PC1 PC2 ... priority closed_relevance
24481 -16.862586 13.863453 ... Needs Triage True
23053 -16.174624 12.133559 ... Medium False
23838 -15.421295 13.308099 ... Low False
25791 -15.127553 14.746424 ... Medium True
7451 -14.574686 5.821303 ... Medium False
24467 -13.905417 7.936462 ... Needs Triage True
23436 -13.827143 7.507781 ... Medium False
24293 -13.667374 0.891979 ... Unbreak Now! True
11814 -13.418003 7.854756 ... Low False
968 -13.358491 0.305388 ... Needs Triage True
PC1 PC2 PC3 ... id week_index priority
24481 -16.862586 13.863453 8.545495 ... 92256 -17 Needs Triage
23053 -16.174624 12.133559 0.579284 ... 110020 7 Medium
23838 -15.421295 13.308099 -0.838241 ... 109719 7 Low
25791 -15.127553 14.746424 22.119623 ... 85189 -28 Medium
7451 -14.574686 5.821303 4.386196 ... 53758 2 Medium
24467 -13.905417 7.936462 2.001860 ... 92606 -16 Needs Triage
23436 -13.827143 7.507781 -2.056608 ... 103919 -1 Medium
24293 -13.667374 0.891979 -6.145701 ... 88897 -21 Unbreak Now!
11814 -13.418003 7.854756 -1.595019 ... 52497 0 Low
968 -13.358491 0.305388 -3.980203 ... 55409 8 Needs Triage
[10 rows x 26 columns]
[10 rows x 25 columns]
Top 10 PC2 values:
PC1 PC2 ... priority closed_relevance
25606 6.196829 29.809964 ... Medium True
21956 27.542757 27.763075 ... Needs Triage True
25078 -4.462216 27.186434 ... High False
19873 40.267200 26.528755 ... Medium False
25820 -3.022591 23.093162 ... Medium True
25814 20.151634 22.681554 ... Medium True
13345 6.035595 21.910339 ... Lowest NaN
22013 6.861197 21.673434 ... Needs Triage True
23022 0.808467 21.111863 ... Medium False
21966 -7.056224 20.953599 ... Needs Triage True
PC1 PC2 PC3 ... id week_index priority
25606 6.196829 29.809964 23.877767 ... 88139 -22 Medium
21956 27.542757 27.763075 7.924919 ... 105099 0 Needs Triage
25078 -4.462216 27.186434 1.860348 ... 85326 -27 High
19873 40.267200 26.528755 -11.406833 ... 75953 -32 Medium
25820 -3.022591 23.093162 -0.361349 ... 78160 -30 Medium
25814 20.151634 22.681554 3.346066 ... 78837 -29 Medium
13345 6.035595 21.910339 -6.417684 ... 51999 -2 Lowest
22013 6.861197 21.673434 -7.690901 ... 103771 -1 Needs Triage
23022 0.808467 21.111863 12.735632 ... 110276 7 Medium
21966 -7.056224 20.953599 3.715673 ... 104656 0 Needs Triage
[10 rows x 26 columns]
[10 rows x 25 columns]
Bottom 10 PC2 values:
PC1 PC2 ... priority closed_relevance
3134 5.606805 -12.562127 ... High True
654 -0.797645 -12.364185 ... Unbreak Now! True
16289 -0.897011 -12.328128 ... Medium False
1207 4.714780 -12.127148 ... Needs Triage True
1885 15.889004 -12.071062 ... Needs Triage True
18211 6.521166 -11.920065 ... Needs Triage True
2934 0.069845 -11.739971 ... High False
25122 -1.657588 -11.388235 ... Medium True
13276 15.441209 -11.380360 ... Lowest False
2109 -2.166594 -11.371418 ... Needs Triage True
PC1 PC2 PC3 ... id week_index priority
3134 5.606805 -12.562127 3.104184 ... 54102 3 High
654 -0.797645 -12.364185 4.558365 ... 49434 -11 Unbreak Now!
16289 -0.897011 -12.328128 10.193536 ... 43224 -45 Medium
1207 4.714780 -12.127148 4.095656 ... 54103 3 Needs Triage
1885 15.889004 -12.071062 -5.011946 ... 52106 -1 Needs Triage
18211 6.521166 -11.920065 4.671982 ... 73532 -40 Needs Triage
2934 0.069845 -11.739971 3.000846 ... 54499 4 High
25122 -1.657588 -11.388235 -4.465696 ... 97316 -10 Medium
13276 15.441209 -11.380360 -3.592736 ... 52804 0 Lowest
2109 -2.166594 -11.371418 -0.979264 ... 49816 -9 Needs Triage
[10 rows x 26 columns]
[10 rows x 25 columns]
job finished, cleaning up
job pau at: Thu Sep 25 09:37:24 CDT 2025
job pau at: Thu Sep 25 10:06:30 CDT 2025

View File

@ -0,0 +1,265 @@
starting the job at: Thu Sep 25 09:52:47 CDT 2025
setting up the environment
running the neurobiber labeling script
Variance of each PCA component: [259.38215213 83.11803664 67.16301107 61.78747188 38.94875996
32.78688889 26.45592105 21.9280629 18.734197 16.29485568
13.48304855 11.50594609 10.77855857 9.30674176 8.96113511
8.35521401 8.17815209 7.13194427]
PC1:
BIN_CAP: 0.680
BIN_NNP: 0.647
BIN_DET: -0.151
BIN_PREP: -0.128
BIN_PIN: -0.128
BIN_VPRT: -0.091
BIN_ART: -0.090
BIN_RB: -0.086
BIN_PRP: -0.077
BIN_SBJP: -0.077
PC2:
BIN_NN: 0.744
BIN_NNP: -0.320
BIN_RB: -0.256
BIN_CAP: 0.242
BIN_PRP: -0.224
BIN_SBJP: -0.224
BIN_VPRT: -0.163
BIN_FPP1: -0.113
BIN_NUM: 0.104
BIN_INF: -0.092
PC3:
BIN_CAP: 0.661
BIN_NNP: -0.491
BIN_RB: 0.266
BIN_PRP: 0.223
BIN_SBJP: 0.223
BIN_VPRT: 0.137
BIN_X: -0.128
BIN_PIN: 0.125
BIN_PREP: 0.125
BIN_FPP1: 0.124
PC4:
BIN_PIN: 0.649
BIN_PREP: 0.649
BIN_NNP: 0.256
BIN_CONJ: 0.157
BIN_RB: -0.156
BIN_NN: -0.089
BIN_TO: 0.081
BIN_X: -0.078
BIN_VPRT: -0.057
BIN_INF: 0.052
PC5:
BIN_DET: 0.622
BIN_ART: 0.381
BIN_X: -0.273
BIN_VPRT: 0.264
BIN_NN: 0.262
BIN_NNP: 0.243
BIN_AUXB: 0.222
BIN_NUM: -0.187
BIN_INF: -0.164
BIN_INDA: 0.158
PC6:
BIN_NN: 0.486
BIN_PRP: 0.464
BIN_SBJP: 0.464
BIN_FPP1: 0.239
BIN_NNP: 0.238
BIN_DET: -0.179
BIN_AUXB: -0.174
BIN_PASS: -0.144
BIN_CAP: -0.135
BIN_PIT: 0.128
PC7:
BIN_RB: 0.787
BIN_NN: 0.266
BIN_PRP: -0.188
BIN_SBJP: -0.188
BIN_DET: -0.182
BIN_NNP: 0.154
BIN_JJ: -0.153
BIN_X: -0.150
BIN_TIME: 0.134
BIN_ART: -0.129
PC8:
BIN_JJ: 0.667
BIN_INF: 0.352
BIN_VPRT: -0.326
BIN_ART: 0.234
BIN_PASS: -0.220
BIN_AUXB: -0.219
BIN_NUM: -0.210
BIN_CONJ: -0.149
BIN_RB: 0.126
BIN_PEAS: -0.118
PC9:
BIN_INF: 0.633
BIN_JJ: -0.559
BIN_VPRT: -0.310
BIN_DET: 0.251
BIN_ART: 0.129
BIN_TO: 0.127
BIN_PRIV: 0.100
BIN_NUM: 0.084
BIN_RB: -0.075
BIN_POMD: 0.075
PC10:
BIN_INF: 0.443
BIN_AUXB: 0.372
BIN_VPRT: 0.368
BIN_ART: -0.256
BIN_RB: -0.249
BIN_JJ: 0.247
BIN_VBD: -0.246
BIN_X: -0.231
BIN_DET: -0.211
BIN_PASS: 0.171
PC11:
BIN_X: 0.793
BIN_PUBV: -0.266
BIN_VPRT: 0.258
BIN_VBD: -0.245
BIN_NUM: -0.211
BIN_CONJ: -0.157
BIN_JJ: -0.145
BIN_UH: -0.105
BIN_INF: 0.103
BIN_NOMZ: -0.079
PC12:
BIN_NUM: 0.765
BIN_VBD: -0.239
BIN_UH: -0.217
BIN_VPRT: 0.206
BIN_QUOT: -0.181
BIN_RB: 0.161
BIN_INDA: 0.145
BIN_PGAS: -0.145
BIN_JJ: 0.136
BIN_ART: 0.135
PC13:
BIN_VBD: 0.468
BIN_QUOT: -0.433
BIN_AUXB: 0.357
BIN_CONT: -0.324
BIN_PASS: 0.255
BIN_X: 0.220
BIN_VPRT: -0.214
BIN_UH: -0.185
BIN_TIME: 0.135
BIN_PUBV: 0.123
PC14:
BIN_UH: 0.499
BIN_QUOT: -0.460
BIN_VBD: -0.386
BIN_CONT: -0.361
BIN_PUBV: -0.265
BIN_CONJ: 0.243
BIN_NUM: -0.161
BIN_VPRT: -0.122
BIN_STPR: -0.108
BIN_DEMP: -0.087
PC15:
BIN_PUBV: 0.512
BIN_CONJ: -0.370
BIN_QUOT: -0.318
BIN_PGAS: 0.300
BIN_CONT: -0.268
BIN_VPRT: 0.244
BIN_PASS: -0.241
BIN_NOMZ: 0.229
BIN_AUXB: -0.177
BIN_TO: 0.108
PC16:
BIN_CONJ: 0.633
BIN_UH: -0.460
BIN_PUBV: 0.371
BIN_NUM: -0.258
BIN_VBD: -0.198
BIN_NOMZ: 0.158
BIN_SCONJ: -0.125
BIN_PREP: -0.111
BIN_PIN: -0.111
BIN_X: 0.106
PC17:
BIN_PGAS: 0.513
BIN_UH: -0.500
BIN_PUBV: -0.371
BIN_VPRT: -0.222
BIN_VBD: -0.204
BIN_CONJ: -0.176
BIN_CCONJ: 0.175
BIN_ART: -0.173
BIN_X: -0.134
BIN_INDA: -0.125
PC18:
BIN_ART: 0.456
BIN_DET: -0.342
BIN_DEMO: -0.306
BIN_DEMP: -0.285
BIN_INDA: 0.273
BIN_PIT: 0.221
BIN_FPP1: -0.220
BIN_CCONJ: 0.219
BIN_CONJ: -0.214
BIN_AUXB: 0.211
Top 10 PC1 values:
PC1 PC2 PC3 ... week_index priority closed_relevance
24527 124.414338 -16.859224 4.745394 ... -19 NaN NaN
707 124.395927 -16.870930 4.737835 ... -16 NaN NaN
744 124.395927 -16.870930 4.737835 ... -32 NaN NaN
749 124.395927 -16.870930 4.737835 ... -32 NaN NaN
2243 124.395927 -16.870930 4.737835 ... -28 NaN NaN
5921 124.395927 -16.870930 4.737835 ... -32 NaN NaN
5933 124.395927 -16.870930 4.737835 ... -33 NaN NaN
5935 124.395927 -16.870930 4.737835 ... -32 NaN NaN
10080 124.395927 -16.870930 4.737835 ... -11 NaN NaN
10418 124.395927 -16.870930 4.737835 ... -29 NaN NaN
[10 rows x 26 columns]
Bottom 10 PC1 values:
PC1 PC2 PC3 ... week_index priority closed_relevance
13752 -24.875039 3.698789 0.288475 ... 11 NaN NaN
18942 -24.875039 3.698789 0.288475 ... -85 NaN NaN
14276 -24.572975 0.683877 7.752763 ... 10 NaN NaN
19869 -24.572975 0.683877 7.752763 ... -86 NaN NaN
25556 -23.009401 -10.063628 2.942026 ... 6 NaN NaN
23477 -22.592972 -0.970333 -2.614535 ... 7 NaN NaN
13907 -22.489084 10.362266 -8.549736 ... -9 NaN NaN
14824 -22.001266 -17.807228 5.081855 ... 441 NaN NaN
21189 -22.001266 -17.807228 5.081855 ... 345 NaN NaN
24439 -21.740588 -10.174665 5.702323 ... 110 NaN NaN
[10 rows x 26 columns]
Top 10 PC2 values:
PC1 PC2 PC3 ... week_index priority closed_relevance
117 53.467507 89.625455 44.442476 ... 4 NaN NaN
2447 53.136124 89.414757 44.306417 ... 138 NaN NaN
2471 53.136124 89.414757 44.306417 ... 20 NaN NaN
22224 53.117714 89.403051 44.298858 ... 10 NaN NaN
2728 19.109757 77.304171 11.238337 ... 8 NaN NaN
5024 19.109757 77.304171 11.238337 ... 8 NaN NaN
5135 19.109757 77.304171 11.238337 ... 8 NaN NaN
17701 -14.842968 65.240407 -21.799507 ... -83 NaN NaN
17591 -14.861378 65.228702 -21.807066 ... -100 NaN NaN
24735 -14.916609 65.193586 -21.829743 ... 43 NaN NaN
[10 rows x 26 columns]
Bottom 10 PC2 values:
PC1 PC2 PC3 ... week_index priority closed_relevance
14558 56.232734 -41.162334 -61.443677 ... 10 NaN NaN
6321 56.251144 -41.150628 -61.436118 ... 302 NaN NaN
6322 56.251144 -41.150628 -61.436118 ... 139 NaN NaN
6770 56.251144 -41.150628 -61.436118 ... 120 NaN NaN
6771 56.251144 -41.150628 -61.436118 ... 120 NaN NaN
10442 56.251144 -41.150628 -61.436118 ... 383 NaN NaN
10443 56.251144 -41.150628 -61.436118 ... 383 NaN NaN
10528 56.251144 -41.150628 -61.436118 ... 93 NaN NaN
10529 56.251144 -41.150628 -61.436118 ... 93 NaN NaN
11837 56.251144 -41.150628 -61.436118 ... 133 NaN NaN
[10 rows x 26 columns]
job finished, cleaning up
job pau at: Thu Sep 25 09:53:27 CDT 2025

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.6 MiB

View File

@ -58,7 +58,7 @@ if __name__ == "__main__":
biber_vecs_pca = pca.fit_transform(biber_vecs)
with open('092525_description_pca.pkl', 'wb') as f:
pickle.dump(pca, f)
selected_axis = "AuthorWMFAffil"
selected_axis = "closed_relevance"
component_variances = np.var(biber_vecs_pca, axis=0)
print("Variance of each PCA component:", component_variances)
@ -84,7 +84,7 @@ if __name__ == "__main__":
pc_dict['closed_relevance'] = biber_vec_df['closed_relevance']
plot_df = pd.DataFrame(pc_dict)
plot_df.to_csv("092325_description_PCA_df.csv", index=False)
#plot_df.to_csv("092325_subcomment_PCA_df.csv", index=False)
print("Top 10 PC1 values:")
print(plot_df.nlargest(10, "PC1"))

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.4 MiB