1
0

trying to sample the human label rows again

This commit is contained in:
mgaughan 2025-09-22 20:34:31 -05:00
parent bcfa688e11
commit b4f0c8f885
5 changed files with 4019 additions and 15 deletions

File diff suppressed because it is too large Load Diff

View File

@ -1,15 +1,10 @@
setting up the environment by loading in conda environment at Sun Sep 14 11:31:17 CDT 2025
running the batched olmo categorization job at Sun Sep 14 11:31:17 CDT 2025
setting up the environment by loading in conda environment at Mon Sep 22 20:07:56 CDT 2025
running the batched olmo categorization job at Mon Sep 22 20:07:57 CDT 2025
[nltk_data] Downloading package punkt_tab to
[nltk_data] /home/nws8519/nltk_data...
[nltk_data] Package punkt_tab is already up-to-date!
cuda
NVIDIA A100-SXM4-80GB
_CudaDeviceProperties(name='NVIDIA A100-SXM4-80GB', major=8, minor=0, total_memory=81153MB, multi_processor_count=108, uuid=02d36cc8-5c30-554c-c2e0-f5fb530dcc7a, L2_cache_size=40MB)
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s] Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:03, 2.93it/s] Loading checkpoint shards: 17%|█▋ | 2/12 [00:00<00:04, 2.21it/s] Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:04, 2.06it/s] Loading checkpoint shards: 33%|███▎ | 4/12 [00:01<00:04, 1.93it/s] Loading checkpoint shards: 42%|████▏ | 5/12 [00:02<00:03, 1.80it/s] Loading checkpoint shards: 50%|█████ | 6/12 [00:03<00:03, 1.78it/s] Loading checkpoint shards: 58%|█████▊ | 7/12 [00:03<00:02, 1.84it/s] Loading checkpoint shards: 67%|██████▋ | 8/12 [00:04<00:02, 1.82it/s] Loading checkpoint shards: 75%|███████▌ | 9/12 [00:04<00:01, 1.76it/s] Loading checkpoint shards: 83%|████████▎ | 10/12 [00:05<00:01, 1.68it/s] Loading checkpoint shards: 92%|█████████▏| 11/12 [00:05<00:00, 1.86it/s] Loading checkpoint shards: 100%|██████████| 12/12 [00:05<00:00, 2.02it/s]
Traceback (most recent call last):
File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/label_sampling.py", line 122, in <module>
random_df = df.sample(n=300, random_sample = 8)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: NDFrame.sample() got an unexpected keyword argument 'random_sample'
unsupervised batched olmo categorization pau at Sun Sep 14 11:34:04 CDT 2025
_CudaDeviceProperties(name='NVIDIA A100-SXM4-80GB', major=8, minor=0, total_memory=81153MB, multi_processor_count=108, uuid=fb10e36f-fd51-a123-6ae4-e318d24dbb3c, L2_cache_size=40MB)
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s] Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:04, 2.32it/s] Loading checkpoint shards: 17%|█▋ | 2/12 [00:01<00:06, 1.64it/s] Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:05, 1.66it/s] Loading checkpoint shards: 33%|███▎ | 4/12 [00:02<00:04, 1.64it/s] Loading checkpoint shards: 42%|████▏ | 5/12 [00:02<00:04, 1.68it/s] Loading checkpoint shards: 50%|█████ | 6/12 [00:03<00:03, 1.60it/s] Loading checkpoint shards: 58%|█████▊ | 7/12 [00:04<00:03, 1.65it/s] Loading checkpoint shards: 67%|██████▋ | 8/12 [00:04<00:02, 1.59it/s] Loading checkpoint shards: 75%|███████▌ | 9/12 [00:05<00:01, 1.56it/s] Loading checkpoint shards: 83%|████████▎ | 10/12 [00:06<00:01, 1.64it/s] Loading checkpoint shards: 92%|█████████▏| 11/12 [00:06<00:00, 1.76it/s] Loading checkpoint shards: 100%|██████████| 12/12 [00:06<00:00, 1.82it/s]
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

View File

@ -72,7 +72,7 @@ with open("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_
text_dict['comment_text'] = row[2]
text_dict['comment_type'] = row[12]
if text_dict['comment_type'] == "task_description":
raw_text = text_dict['task_title'] + "\n\n" + text_dict['comment_text']
raw_text = text_dict['task_title'] + ". \n\n" + text_dict['comment_text']
else:
raw_text = text_dict['comment_text']
@ -122,7 +122,7 @@ with open("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_
array_of_categorizations.append(text_dict)
df = pd.DataFrame(array_of_categorizations)
#print(df.head())
df.to_csv('all_091625_olmo_batched_categorized.csv', index=False)
df.to_csv('all_092225_olmo_batched_categorized.csv', index=False)

View File

@ -74,7 +74,7 @@ with open("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_
text_dict['TaskPHID'] = row[11]
#making sure the comment title is included in things
if text_dict['comment_type'] == "task_description":
raw_text = text_dict['task_title'] + "\n\n" + text_dict['comment_text']
raw_text = text_dict['task_title'] + ". \n\n" + text_dict['comment_text']
else:
raw_text = text_dict['comment_text']
@ -127,9 +127,12 @@ with open("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_
#taking a random sample of 50 task discussions
unique_tasks = df['TaskPHID'].unique()
sampled_tasks = pd.Series(unique_tasks).sample(n=25, random_state=8)
sampled_tasks = pd.Series(unique_tasks).sample(n=15, random_state=8)
random_df = df[df['TaskPHID'].isin(sampled_tasks)]
random_df.to_csv('091625_human_conversation_sample.csv', index=False)
random_df = random_df.copy()
random_df['focal_sentence'] = random_df['cleaned_sentences']
exploded_df = random_df.explode('focal_sentence')
exploded_df.to_csv('092225_human_conversation_sample.csv', index=False)

View File

@ -0,0 +1,10 @@
setting up the environment by loading in conda environment at Mon Sep 22 20:07:58 CDT 2025
running the sampling job at Mon Sep 22 20:07:58 CDT 2025
[nltk_data] Downloading package punkt_tab to
[nltk_data] /home/nws8519/nltk_data...
[nltk_data] Package punkt_tab is already up-to-date!
cuda
NVIDIA A100-SXM4-80GB
_CudaDeviceProperties(name='NVIDIA A100-SXM4-80GB', major=8, minor=0, total_memory=81153MB, multi_processor_count=108, uuid=290bcddd-9b2f-3a5b-4cbd-b17d9ec05044, L2_cache_size=40MB)
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s] Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:04, 2.32it/s] Loading checkpoint shards: 17%|█▋ | 2/12 [00:01<00:06, 1.64it/s] Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:05, 1.66it/s] Loading checkpoint shards: 33%|███▎ | 4/12 [00:02<00:04, 1.64it/s] Loading checkpoint shards: 42%|████▏ | 5/12 [00:02<00:04, 1.68it/s] Loading checkpoint shards: 50%|█████ | 6/12 [00:03<00:03, 1.60it/s] Loading checkpoint shards: 58%|█████▊ | 7/12 [00:04<00:03, 1.65it/s] Loading checkpoint shards: 67%|██████▋ | 8/12 [00:04<00:02, 1.59it/s] Loading checkpoint shards: 75%|███████▌ | 9/12 [00:05<00:01, 1.56it/s] Loading checkpoint shards: 83%|████████▎ | 10/12 [00:06<00:01, 1.64it/s] Loading checkpoint shards: 92%|█████████▏| 11/12 [00:06<00:00, 1.76it/s] Loading checkpoint shards: 100%|██████████| 12/12 [00:06<00:00, 1.82it/s]
sampling pau at Mon Sep 22 20:11:48 CDT 2025