trying to sample the human label rows again
This commit is contained in:
parent
bcfa688e11
commit
b4f0c8f885
3996
p2/quest/092225_human_conversation_sample.csv
Normal file
3996
p2/quest/092225_human_conversation_sample.csv
Normal file
File diff suppressed because it is too large
Load Diff
@ -1,15 +1,10 @@
|
||||
setting up the environment by loading in conda environment at Sun Sep 14 11:31:17 CDT 2025
|
||||
running the batched olmo categorization job at Sun Sep 14 11:31:17 CDT 2025
|
||||
setting up the environment by loading in conda environment at Mon Sep 22 20:07:56 CDT 2025
|
||||
running the batched olmo categorization job at Mon Sep 22 20:07:57 CDT 2025
|
||||
[nltk_data] Downloading package punkt_tab to
|
||||
[nltk_data] /home/nws8519/nltk_data...
|
||||
[nltk_data] Package punkt_tab is already up-to-date!
|
||||
cuda
|
||||
NVIDIA A100-SXM4-80GB
|
||||
_CudaDeviceProperties(name='NVIDIA A100-SXM4-80GB', major=8, minor=0, total_memory=81153MB, multi_processor_count=108, uuid=02d36cc8-5c30-554c-c2e0-f5fb530dcc7a, L2_cache_size=40MB)
|
||||
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s]
Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:03, 2.93it/s]
Loading checkpoint shards: 17%|█▋ | 2/12 [00:00<00:04, 2.21it/s]
Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:04, 2.06it/s]
Loading checkpoint shards: 33%|███▎ | 4/12 [00:01<00:04, 1.93it/s]
Loading checkpoint shards: 42%|████▏ | 5/12 [00:02<00:03, 1.80it/s]
Loading checkpoint shards: 50%|█████ | 6/12 [00:03<00:03, 1.78it/s]
Loading checkpoint shards: 58%|█████▊ | 7/12 [00:03<00:02, 1.84it/s]
Loading checkpoint shards: 67%|██████▋ | 8/12 [00:04<00:02, 1.82it/s]
Loading checkpoint shards: 75%|███████▌ | 9/12 [00:04<00:01, 1.76it/s]
Loading checkpoint shards: 83%|████████▎ | 10/12 [00:05<00:01, 1.68it/s]
Loading checkpoint shards: 92%|█████████▏| 11/12 [00:05<00:00, 1.86it/s]
Loading checkpoint shards: 100%|██████████| 12/12 [00:05<00:00, 2.02it/s]
|
||||
Traceback (most recent call last):
|
||||
File "/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/label_sampling.py", line 122, in <module>
|
||||
random_df = df.sample(n=300, random_sample = 8)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
TypeError: NDFrame.sample() got an unexpected keyword argument 'random_sample'
|
||||
unsupervised batched olmo categorization pau at Sun Sep 14 11:34:04 CDT 2025
|
||||
_CudaDeviceProperties(name='NVIDIA A100-SXM4-80GB', major=8, minor=0, total_memory=81153MB, multi_processor_count=108, uuid=fb10e36f-fd51-a123-6ae4-e318d24dbb3c, L2_cache_size=40MB)
|
||||
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s]
Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:04, 2.32it/s]
Loading checkpoint shards: 17%|█▋ | 2/12 [00:01<00:06, 1.64it/s]
Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:05, 1.66it/s]
Loading checkpoint shards: 33%|███▎ | 4/12 [00:02<00:04, 1.64it/s]
Loading checkpoint shards: 42%|████▏ | 5/12 [00:02<00:04, 1.68it/s]
Loading checkpoint shards: 50%|█████ | 6/12 [00:03<00:03, 1.60it/s]
Loading checkpoint shards: 58%|█████▊ | 7/12 [00:04<00:03, 1.65it/s]
Loading checkpoint shards: 67%|██████▋ | 8/12 [00:04<00:02, 1.59it/s]
Loading checkpoint shards: 75%|███████▌ | 9/12 [00:05<00:01, 1.56it/s]
Loading checkpoint shards: 83%|████████▎ | 10/12 [00:06<00:01, 1.64it/s]
Loading checkpoint shards: 92%|█████████▏| 11/12 [00:06<00:00, 1.76it/s]
Loading checkpoint shards: 100%|██████████| 12/12 [00:06<00:00, 1.82it/s]
|
||||
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
|
||||
|
||||
@ -72,7 +72,7 @@ with open("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_
|
||||
text_dict['comment_text'] = row[2]
|
||||
text_dict['comment_type'] = row[12]
|
||||
if text_dict['comment_type'] == "task_description":
|
||||
raw_text = text_dict['task_title'] + "\n\n" + text_dict['comment_text']
|
||||
raw_text = text_dict['task_title'] + ". \n\n" + text_dict['comment_text']
|
||||
else:
|
||||
raw_text = text_dict['comment_text']
|
||||
|
||||
@ -122,7 +122,7 @@ with open("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_
|
||||
array_of_categorizations.append(text_dict)
|
||||
df = pd.DataFrame(array_of_categorizations)
|
||||
#print(df.head())
|
||||
df.to_csv('all_091625_olmo_batched_categorized.csv', index=False)
|
||||
df.to_csv('all_092225_olmo_batched_categorized.csv', index=False)
|
||||
|
||||
|
||||
|
||||
|
||||
@ -74,7 +74,7 @@ with open("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_
|
||||
text_dict['TaskPHID'] = row[11]
|
||||
#making sure the comment title is included in things
|
||||
if text_dict['comment_type'] == "task_description":
|
||||
raw_text = text_dict['task_title'] + "\n\n" + text_dict['comment_text']
|
||||
raw_text = text_dict['task_title'] + ". \n\n" + text_dict['comment_text']
|
||||
else:
|
||||
raw_text = text_dict['comment_text']
|
||||
|
||||
@ -127,9 +127,12 @@ with open("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_
|
||||
|
||||
#taking a random sample of 50 task discussions
|
||||
unique_tasks = df['TaskPHID'].unique()
|
||||
sampled_tasks = pd.Series(unique_tasks).sample(n=25, random_state=8)
|
||||
sampled_tasks = pd.Series(unique_tasks).sample(n=15, random_state=8)
|
||||
random_df = df[df['TaskPHID'].isin(sampled_tasks)]
|
||||
random_df.to_csv('091625_human_conversation_sample.csv', index=False)
|
||||
random_df = random_df.copy()
|
||||
random_df['focal_sentence'] = random_df['cleaned_sentences']
|
||||
exploded_df = random_df.explode('focal_sentence')
|
||||
exploded_df.to_csv('092225_human_conversation_sample.csv', index=False)
|
||||
|
||||
|
||||
|
||||
|
||||
10
p2/quest/sampling-mw-olmo-info-cat.log
Normal file
10
p2/quest/sampling-mw-olmo-info-cat.log
Normal file
@ -0,0 +1,10 @@
|
||||
setting up the environment by loading in conda environment at Mon Sep 22 20:07:58 CDT 2025
|
||||
running the sampling job at Mon Sep 22 20:07:58 CDT 2025
|
||||
[nltk_data] Downloading package punkt_tab to
|
||||
[nltk_data] /home/nws8519/nltk_data...
|
||||
[nltk_data] Package punkt_tab is already up-to-date!
|
||||
cuda
|
||||
NVIDIA A100-SXM4-80GB
|
||||
_CudaDeviceProperties(name='NVIDIA A100-SXM4-80GB', major=8, minor=0, total_memory=81153MB, multi_processor_count=108, uuid=290bcddd-9b2f-3a5b-4cbd-b17d9ec05044, L2_cache_size=40MB)
|
||||
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s]
Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:04, 2.32it/s]
Loading checkpoint shards: 17%|█▋ | 2/12 [00:01<00:06, 1.64it/s]
Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:05, 1.66it/s]
Loading checkpoint shards: 33%|███▎ | 4/12 [00:02<00:04, 1.64it/s]
Loading checkpoint shards: 42%|████▏ | 5/12 [00:02<00:04, 1.68it/s]
Loading checkpoint shards: 50%|█████ | 6/12 [00:03<00:03, 1.60it/s]
Loading checkpoint shards: 58%|█████▊ | 7/12 [00:04<00:03, 1.65it/s]
Loading checkpoint shards: 67%|██████▋ | 8/12 [00:04<00:02, 1.59it/s]
Loading checkpoint shards: 75%|███████▌ | 9/12 [00:05<00:01, 1.56it/s]
Loading checkpoint shards: 83%|████████▎ | 10/12 [00:06<00:01, 1.64it/s]
Loading checkpoint shards: 92%|█████████▏| 11/12 [00:06<00:00, 1.76it/s]
Loading checkpoint shards: 100%|██████████| 12/12 [00:06<00:00, 1.82it/s]
|
||||
sampling pau at Mon Sep 22 20:11:48 CDT 2025
|
||||
Loading…
Reference in New Issue
Block a user