trying to sample the human label rows again

2025-09-22 20:34:31 -05:00 · 2025-09-22 20:34:31 -05:00 · b4f0c8f885
commit b4f0c8f885
parent bcfa688e11
5 changed files with 4019 additions and 15 deletions
--- a/p2/quest/092225_human_conversation_sample.csv
+++ b/p2/quest/092225_human_conversation_sample.csv
--- a/p2/quest/batched-mw-olmo-info-cat.log
+++ b/p2/quest/batched-mw-olmo-info-cat.log
@ -1,15 +1,10 @@
-setting up the environment by loading in conda environment at Sun Sep 14 11:31:17 CDT 2025
-running the batched olmo categorization job at Sun Sep 14 11:31:17 CDT 2025
+setting up the environment by loading in conda environment at Mon Sep 22 20:07:56 CDT 2025
+running the batched olmo categorization job at Mon Sep 22 20:07:57 CDT 2025
 [nltk_data] Downloading package punkt_tab to
 [nltk_data]     /home/nws8519/nltk_data...
 [nltk_data]   Package punkt_tab is already up-to-date!
 cuda
 NVIDIA A100-SXM4-80GB
-_CudaDeviceProperties(name='NVIDIA A100-SXM4-80GB', major=8, minor=0, total_memory=81153MB, multi_processor_count=108, uuid=02d36cc8-5c30-554c-c2e0-f5fb530dcc7a, L2_cache_size=40MB)
-
Loading checkpoint shards:   0%|          | 0/12 [00:00<?, ?it/s]
Loading checkpoint shards:   8%|▊         | 1/12 [00:00<00:03,  2.93it/s]
Loading checkpoint shards:  17%|█▋        | 2/12 [00:00<00:04,  2.21it/s]
Loading checkpoint shards:  25%|██▌       | 3/12 [00:01<00:04,  2.06it/s]
Loading checkpoint shards:  33%|███▎      | 4/12 [00:01<00:04,  1.93it/s]
Loading checkpoint shards:  42%|████▏     | 5/12 [00:02<00:03,  1.80it/s]
Loading checkpoint shards:  50%|█████     | 6/12 [00:03<00:03,  1.78it/s]
Loading checkpoint shards:  58%|█████▊    | 7/12 [00:03<00:02,  1.84it/s]
Loading checkpoint shards:  67%|██████▋   | 8/12 [00:04<00:02,  1.82it/s]
Loading checkpoint shards:  75%|███████▌  | 9/12 [00:04<00:01,  1.76it/s]
Loading checkpoint shards:  83%|████████▎ | 10/12 [00:05<00:01,  1.68it/s]
Loading checkpoint shards:  92%|█████████▏| 11/12 [00:05<00:00,  1.86it/s]
Loading checkpoint shards: 100%|██████████| 12/12 [00:05<00:00,  2.02it/s]
-Traceback (most recent call last):
-  File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/label_sampling.py", line 122, in <module>
-    random_df = df.sample(n=300, random_sample = 8)
-                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-TypeError: NDFrame.sample() got an unexpected keyword argument 'random_sample'
-unsupervised batched olmo categorization pau at Sun Sep 14 11:34:04 CDT 2025
+_CudaDeviceProperties(name='NVIDIA A100-SXM4-80GB', major=8, minor=0, total_memory=81153MB, multi_processor_count=108, uuid=fb10e36f-fd51-a123-6ae4-e318d24dbb3c, L2_cache_size=40MB)
+
Loading checkpoint shards:   0%|          | 0/12 [00:00<?, ?it/s]
Loading checkpoint shards:   8%|▊         | 1/12 [00:00<00:04,  2.32it/s]
Loading checkpoint shards:  17%|█▋        | 2/12 [00:01<00:06,  1.64it/s]
Loading checkpoint shards:  25%|██▌       | 3/12 [00:01<00:05,  1.66it/s]
Loading checkpoint shards:  33%|███▎      | 4/12 [00:02<00:04,  1.64it/s]
Loading checkpoint shards:  42%|████▏     | 5/12 [00:02<00:04,  1.68it/s]
Loading checkpoint shards:  50%|█████     | 6/12 [00:03<00:03,  1.60it/s]
Loading checkpoint shards:  58%|█████▊    | 7/12 [00:04<00:03,  1.65it/s]
Loading checkpoint shards:  67%|██████▋   | 8/12 [00:04<00:02,  1.59it/s]
Loading checkpoint shards:  75%|███████▌  | 9/12 [00:05<00:01,  1.56it/s]
Loading checkpoint shards:  83%|████████▎ | 10/12 [00:06<00:01,  1.64it/s]
Loading checkpoint shards:  92%|█████████▏| 11/12 [00:06<00:00,  1.76it/s]
Loading checkpoint shards: 100%|██████████| 12/12 [00:06<00:00,  1.82it/s]
+Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
--- a/p2/quest/python_scripts/090425_batched_olmo_cat.py
+++ b/p2/quest/python_scripts/090425_batched_olmo_cat.py
@ -72,7 +72,7 @@ with open("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_
        text_dict['comment_text'] = row[2]
        text_dict['comment_type'] = row[12]
        if text_dict['comment_type'] == "task_description":
-            raw_text = text_dict['task_title'] + "\n\n" + text_dict['comment_text']
+            raw_text = text_dict['task_title'] + ". \n\n" + text_dict['comment_text']
        else: 
            raw_text = text_dict['comment_text']
        
@ -122,7 +122,7 @@ with open("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_
        array_of_categorizations.append(text_dict)
    df = pd.DataFrame(array_of_categorizations)
    #print(df.head())
-    df.to_csv('all_091625_olmo_batched_categorized.csv', index=False)
+    df.to_csv('all_092225_olmo_batched_categorized.csv', index=False)


 	    
--- a/p2/quest/python_scripts/label_sampling.py
+++ b/p2/quest/python_scripts/label_sampling.py
@ -74,7 +74,7 @@ with open("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_
        text_dict['TaskPHID'] = row[11]
        #making sure the comment title is included in things
        if text_dict['comment_type'] == "task_description":
-            raw_text = text_dict['task_title'] + "\n\n" + text_dict['comment_text']
+            raw_text = text_dict['task_title'] + ". \n\n" + text_dict['comment_text']
        else:
            raw_text = text_dict['comment_text']
        
@ -127,9 +127,12 @@ with open("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_

    #taking a random sample of 50 task discussions
    unique_tasks = df['TaskPHID'].unique()
-    sampled_tasks = pd.Series(unique_tasks).sample(n=25, random_state=8)
+    sampled_tasks = pd.Series(unique_tasks).sample(n=15, random_state=8)
    random_df = df[df['TaskPHID'].isin(sampled_tasks)]
-    random_df.to_csv('091625_human_conversation_sample.csv', index=False)
+    random_df = random_df.copy()
+    random_df['focal_sentence'] = random_df['cleaned_sentences']
+    exploded_df = random_df.explode('focal_sentence')
+    exploded_df.to_csv('092225_human_conversation_sample.csv', index=False)


 	    
--- a/p2/quest/sampling-mw-olmo-info-cat.log
+++ b/p2/quest/sampling-mw-olmo-info-cat.log
@ -0,0 +1,10 @@
+setting up the environment by loading in conda environment at Mon Sep 22 20:07:58 CDT 2025
+running the sampling job at Mon Sep 22 20:07:58 CDT 2025
+[nltk_data] Downloading package punkt_tab to
+[nltk_data]     /home/nws8519/nltk_data...
+[nltk_data]   Package punkt_tab is already up-to-date!
+cuda
+NVIDIA A100-SXM4-80GB
+_CudaDeviceProperties(name='NVIDIA A100-SXM4-80GB', major=8, minor=0, total_memory=81153MB, multi_processor_count=108, uuid=290bcddd-9b2f-3a5b-4cbd-b17d9ec05044, L2_cache_size=40MB)
+
Loading checkpoint shards:   0%|          | 0/12 [00:00<?, ?it/s]
Loading checkpoint shards:   8%|▊         | 1/12 [00:00<00:04,  2.32it/s]
Loading checkpoint shards:  17%|█▋        | 2/12 [00:01<00:06,  1.64it/s]
Loading checkpoint shards:  25%|██▌       | 3/12 [00:01<00:05,  1.66it/s]
Loading checkpoint shards:  33%|███▎      | 4/12 [00:02<00:04,  1.64it/s]
Loading checkpoint shards:  42%|████▏     | 5/12 [00:02<00:04,  1.68it/s]
Loading checkpoint shards:  50%|█████     | 6/12 [00:03<00:03,  1.60it/s]
Loading checkpoint shards:  58%|█████▊    | 7/12 [00:04<00:03,  1.65it/s]
Loading checkpoint shards:  67%|██████▋   | 8/12 [00:04<00:02,  1.59it/s]
Loading checkpoint shards:  75%|███████▌  | 9/12 [00:05<00:01,  1.56it/s]
Loading checkpoint shards:  83%|████████▎ | 10/12 [00:06<00:01,  1.64it/s]
Loading checkpoint shards:  92%|█████████▏| 11/12 [00:06<00:00,  1.76it/s]
Loading checkpoint shards: 100%|██████████| 12/12 [00:06<00:00,  1.82it/s]
+sampling pau at Mon Sep 22 20:11:48 CDT 2025