new batched OLMO labels

2025-10-24 10:03:28 -05:00 · 2025-10-24 10:03:28 -05:00 · d6965a33cb
commit d6965a33cb
parent 0ed72af495
4 changed files with 146049 additions and 7 deletions
--- a/p2/quest/python_scripts/102125-batched-mw-olmo-info-cat.log
+++ b/p2/quest/python_scripts/102125-batched-mw-olmo-info-cat.log
@ -0,0 +1,12 @@
+setting up the environment by loading in conda environment at Tue Oct 21 22:09:37 CDT 2025
+running the batched olmo categorization job at Tue Oct 21 22:09:38 CDT 2025
+[nltk_data] Downloading package punkt_tab to
+[nltk_data]     /home/nws8519/nltk_data...
+[nltk_data]   Package punkt_tab is already up-to-date!
+cuda
+NVIDIA A100-SXM4-80GB
+_CudaDeviceProperties(name='NVIDIA A100-SXM4-80GB', major=8, minor=0, total_memory=81153MB, multi_processor_count=108, uuid=f431a288-a223-5654-4599-c2a6f20abe8d, L2_cache_size=40MB)
+
Loading checkpoint shards:   0%|          | 0/12 [00:00<?, ?it/s]
Loading checkpoint shards:   8%|▊         | 1/12 [00:00<00:03,  3.54it/s]
Loading checkpoint shards:  17%|█▋        | 2/12 [00:00<00:03,  2.63it/s]
Loading checkpoint shards:  25%|██▌       | 3/12 [00:01<00:03,  2.47it/s]
Loading checkpoint shards:  33%|███▎      | 4/12 [00:01<00:03,  2.22it/s]
Loading checkpoint shards:  42%|████▏     | 5/12 [00:02<00:03,  2.06it/s]
Loading checkpoint shards:  50%|█████     | 6/12 [00:02<00:02,  2.05it/s]
Loading checkpoint shards:  58%|█████▊    | 7/12 [00:03<00:02,  2.01it/s]
Loading checkpoint shards:  67%|██████▋   | 8/12 [00:03<00:01,  2.00it/s]
Loading checkpoint shards:  75%|███████▌  | 9/12 [00:04<00:01,  2.05it/s]
Loading checkpoint shards:  83%|████████▎ | 10/12 [00:04<00:00,  2.04it/s]
Loading checkpoint shards:  92%|█████████▏| 11/12 [00:05<00:00,  2.22it/s]
Loading checkpoint shards: 100%|██████████| 12/12 [00:05<00:00,  2.36it/s]
+Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
+This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
+unsupervised batched olmo categorization pau at Thu Oct 23 16:52:15 CDT 2025
--- a/p2/quest/python_scripts/102125_batched_olmo_cat.py
+++ b/p2/quest/python_scripts/102125_batched_olmo_cat.py
@ -38,11 +38,11 @@ TYPOLOGY:
 	
 [[CONTRIBUTION AND COMMITMENT]], in which participants call for contributors and/or voice willingness or unwillingness to contribute to resolving the issue. For example, one potential collaborator said: “I will gladly contribute in any way I can, however, this is something I will not be able to do alone. Would be best if a few other people is interested as well...”
 	
-[[TASK PROGRESS]], in which stakeholders request or report progress of tasks and sub-tasks towards the solution of the issue. For example, “I made an initial stab at it... - this is just a proof of concept that gets the version string into nodejs. I’ll start working on adding the swig interfaces...”
+[[TASK PROGRESS]], in which stakeholders request or report progress of tasks and sub-tasks towards the solution of the issue. This includes automated reports of merged code changes. For example, “I made an initial stab at it... - this is just a proof of concept that gets the version string into nodejs. I’ll start working on adding the swig interfaces...”
 	
 [[TESTING]], in which participants discuss the testing procedure and results, as well as the system environment, code, data, and feedback involved in testing. For example, “Tested on ‘0.101’ and ‘master’ - the issue seems to be fixed on ‘master’ not just for the example document, but for the entire corpus...”

-[[NA]], in which the sentence contents are entirely incomprehensible or only consist of punctuation or numerals. For example, "***", "ve-ce-protectedNode", or "T8597".
+[[NA]], only apply this category if the sentence consists of non-English terms or only consist of punctuation or numerals. For example, "***", "ve-ce-protectedNode", or "T8597".
 	
 [[FUTURE PLAN]], in which participants discuss the long-term plan related to the issue; such plans usually involve work/ideas that are not required to close the current issue. For example, “For the futures, stay tuned, as we’re prototyping something in this direction.”
 	
@ -52,11 +52,11 @@ TYPOLOGY:
 	
 [[WORKAROUNDS]] focus on discussions about temporary or alternative solutions that can help overcome the issue until the official fix or enhancement is released. For example, in a discussion regarding memory growth for streamed data, one participant expressed his temporary solution: “For now workaround with reloading / collecting nlp object works quite ok in production.”
 	
-[[ISSUE CONTENT MANAGEMENT]] focuses on redirecting the discussions and controlling the quality of the comments with respect to the issue. For example, “We might want to move this discussion to here: [link to another issue]”
+[[ISSUE CONTENT MANAGEMENT]] focuses on redirecting the discussions and controlling the quality of the comments with respect to the issue. For example, “We might want to move this discussion to here: [link to another issue]” or "This other issue [link to another issue] is a duplicate of this issue".
 	
 [[ACTION ON ISSUE]], in which participants comment on the proper actions to perform on the issue itself. For example, “I’m going to close this issue because it’s old and most of the information here is now out of date.”
 	
-[[SOCIAL CONVERSATION]], in which participants express emotions such as appreciation, disappointment, annoyance, regret, etc. or engage in small talk. For example, “I’m so glad that this has received so much thought and attention!”
+[[SOCIAL CONVERSATION]], in which participants express emotions such as appreciation, disappointment, annoyance, regret, etc. or engage in small talk. For example, “I’m so glad that this has received so much thought and attention!”, "My apologies." or "Thank you!"
 """
 instructions="The sentence's category is: "

@ -129,7 +129,7 @@ with open("/home/nws8519/git/mw-lifecycle-analysis/analysis_data/100325_unified_
        array_of_categorizations.append(text_dict)
    df = pd.DataFrame(array_of_categorizations)
    #print(df.head())
-    df.to_csv('all_101325_olmo_batched_categorized.csv', index=False)
+    df.to_csv('all_102125_olmo_batched_categorized.csv', index=False)


 	    
--- a/p2/quest/python_scripts/all_102125_olmo_batched_categorized.csv
+++ b/p2/quest/python_scripts/all_102125_olmo_batched_categorized.csv
--- a/p2/quest/slurm_jobs/090425_olmo_batched_cat.sh
+++ b/p2/quest/slurm_jobs/090425_olmo_batched_cat.sh
@ -9,7 +9,7 @@
 #SBATCH --mem=64G
 #SBATCH --cpus-per-task=4
 #SBATCH --job-name=batched-MW-info-typology
-#SBATCH --output=101325-batched-mw-olmo-info-cat.log
+#SBATCH --output=102125-batched-mw-olmo-info-cat.log
 #SBATCH --mail-type=BEGIN,END,FAIL
 #SBATCH --mail-user=gaughan@u.northwestern.edu

@ -23,7 +23,7 @@ conda activate olmo

 echo "running the batched olmo categorization job at $(date)"

-python /home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/090425_batched_olmo_cat.py
+python /home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/102125_batched_olmo_cat.py

 #python /home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/label_sampling.py