1
0

trying to run olmo cat distributed, also running kernelPCA.

This commit is contained in:
mgaughan 2025-09-04 09:35:41 -05:00
parent a36226eab9
commit a3c1a48dc7
9 changed files with 450 additions and 19 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.4 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 434 KiB

View File

@ -1,8 +0,0 @@
setting up the environment by loading in conda environment at Fri Jul 25 21:20:22 CDT 2025
running the bertopic job at Fri Jul 25 21:20:23 CDT 2025
cuda
NVIDIA A100-SXM4-80GB
_CudaDeviceProperties(name='NVIDIA A100-SXM4-80GB', major=8, minor=0, total_memory=81153MB, multi_processor_count=108, uuid=6e26de77-c067-13c4-e9e0-8200eb5a348f, L2_cache_size=40MB)
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s] Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:03, 2.82it/s] Loading checkpoint shards: 17%|█▋ | 2/12 [00:00<00:04, 2.13it/s] Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:04, 1.96it/s] Loading checkpoint shards: 33%|███▎ | 4/12 [00:02<00:04, 1.86it/s] Loading checkpoint shards: 42%|████▏ | 5/12 [00:02<00:03, 1.86it/s] Loading checkpoint shards: 50%|█████ | 6/12 [00:03<00:03, 1.76it/s] Loading checkpoint shards: 58%|█████▊ | 7/12 [00:03<00:02, 1.74it/s] Loading checkpoint shards: 67%|██████▋ | 8/12 [00:04<00:02, 1.68it/s] Loading checkpoint shards: 75%|███████▌ | 9/12 [00:04<00:01, 1.71it/s] Loading checkpoint shards: 83%|████████▎ | 10/12 [00:05<00:01, 1.73it/s] Loading checkpoint shards: 92%|█████████▏| 11/12 [00:06<00:00, 1.83it/s] Loading checkpoint shards: 100%|██████████| 12/12 [00:06<00:00, 1.98it/s]
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
unsupervised olmo categorization pau at Sat Jul 26 12:23:56 CDT 2025

View File

@ -1,5 +1,5 @@
starting the job at: Tue Sep 2 16:02:22 CDT 2025
starting the job at: Wed Sep 3 18:53:34 CDT 2025
setting up the environment
running the neurobiber labeling script
job finished, cleaning up
job pau at: Tue Sep 2 16:02:32 CDT 2025
job pau at: Wed Sep 3 18:53:58 CDT 2025

View File

@ -0,0 +1,167 @@
setting up the environment by loading in conda environment at Wed Sep 3 19:04:03 CDT 2025
running the bertopic job at Wed Sep 3 19:04:03 CDT 2025
----------------------------------------
srun job start: Wed Sep 3 19:04:03 CDT 2025
Job ID: 3220869
Username: nws8519
Queue: gengpu
Account: p32852
----------------------------------------
The following variables are not
guaranteed to be the same in the
prologue and the job run script
----------------------------------------
PATH (in prologue) : /home/nws8519/.conda/envs/olmo/bin:/software/miniconda3/4.12.0/condabin:/home/nws8519/.local/bin:/home/nws8519/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/lpp/mmfs/bin:/hpc/usertools
WORKDIR is: /home/nws8519
----------------------------------------
/home/nws8519/.conda/envs/olmo/bin/python3.11: can't open file '/gpfs/home/nws8519/git/mw-lifecycle-analysis/p2/quest/nnodes': [Errno 2] No such file or directory
/home/nws8519/.conda/envs/olmo/bin/python3.11: can't open file '/gpfs/home/nws8519/git/mw-lifecycle-analysis/p2/quest/nnodes': [Errno 2] No such file or directory
Traceback (most recent call last):
File "/home/nws8519/.conda/envs/olmo/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
Traceback (most recent call last):
File "/home/nws8519/.conda/envs/olmo/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run
run(args)
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
elastic_launch(
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
result = agent.run()
result = agent.run()
^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
result = f(*args, **kwargs)
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 711, in run
^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 711, in run
result = self._invoke_run(role)
result = self._invoke_run(role)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
^^^^^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
self._initialize_workers(self._worker_group)
self._initialize_workers(self._worker_group)
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
result = f(*args, **kwargs)
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 683, in _initialize_workers
^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 683, in _initialize_workers
self._rendezvous(worker_group)
self._rendezvous(worker_group)
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
result = f(*args, **kwargs)
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 500, in _rendezvous
^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 500, in _rendezvous
rdzv_info = spec.rdzv_handler.next_rendezvous()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 67, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 29500, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use
rdzv_info = spec.rdzv_handler.next_rendezvous()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 67, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 29500, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use
E0903 19:04:19.236000 1488504 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 2) local_rank: 0 (pid: 1488524) of binary: /home/nws8519/.conda/envs/olmo/bin/python3.11
E0903 19:04:19.236000 2554912 /gpfs/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 2) local_rank: 0 (pid: 2554950) of binary: /home/nws8519/.conda/envs/olmo/bin/python3.11
Traceback (most recent call last):
File "/home/nws8519/.conda/envs/olmo/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
Traceback (most recent call last):
File "/home/nws8519/.conda/envs/olmo/bin/torchrun", line 8, in <module>
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main
sys.exit(main())
^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run
run(args)
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
nnodes FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-09-03_19:04:19
host : qgpu2013
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 1488524)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
elastic_launch(
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nws8519/.conda/envs/olmo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
nnodes FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-09-03_19:04:19
host : qgpu2014
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 2554950)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: qgpu2013: tasks 0-1: Exited with exit code 1
srun: error: qgpu2014: tasks 2-3: Exited with exit code 1
unsupervised olmo categorization pau at Wed Sep 3 19:04:19 CDT 2025

View File

@ -3,6 +3,9 @@ import torch
import csv
import pandas as pd
import re
import concurrent.futures
import nltk
nltk.download('punkt')
cache_directory = "/projects/p32852/cache/"
@ -17,7 +20,7 @@ olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-1124-13B", cache_dir
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-2-1124-13B")
#TODO: text_preprocessing per https://arxiv.org/pdf/1902.07093
#preprocessing per https://arxiv.org/pdf/1902.07093
priming = "For the **GIVEN SENTENCE**, please categorize it into one of the defined [[CATEGORIES]]. Each [[CATEGORY]] is described in the TYPOLOGY for reference.Your task is to match the **GIVEN SENTENCE** to the **[[CATEGORY]]** that most accurately describes the content of the comment. Only provide the category as your output. Do not provide any text beyond the category name."
#the typology descriptions are taken straight from https://arxiv.org/pdf/1902.07093
@ -60,6 +63,60 @@ TYPOLOGY:
#instructions="Only respond with the GIVEN COMMENT's [[CATEGORY]] classification. Do not provide any more information."
instructions="The sentence's category is: "
def preprocess_comment(raw_text):
# 1. replace code with CODE
comment_text = re.sub(r'`[^`]+`', 'CODE', raw_text) # Inline code
comment_text = re.sub(r'```[\s\S]+?```', 'CODE', comment_text) # Block code
# 2. replace quotes with QUOTE
lines = comment_text.split('\n')
lines = ['QUOTE' if line.strip().startswith('>') else line for line in lines]
comment_text = '\n'.join(lines)
# 3. replace Gerrit URLs with GERRIT URL
gerrit_url_pattern = r'https://gerrit\.wikimedia\.org/r/\d+'
comment_text = re.sub(gerrit_url_pattern, 'GERRIT_URL', comment_text)
# replace URL with URL
url_pattern = r'https?://[^\s]+'
comment_text = re.sub(url_pattern, 'URL', comment_text)
# 4. if possible, replace @ with SCREEN_NAME
cleaned_text = re.sub(r'(^|\s)@\w+', 'SCREEN_NAME', comment_text)
return cleaned_text
def categorize_sentences(sentences, comment_type):
results = []
batch_size = 4
for i in range(0, len(sentences), batch_size):
batch = sentences[i:i+batch_size]
prompts = []
for sent in batch:
given_data = f"**GIVEN SENTENCE: \n ' Type -{comment_type} \n Text -{sent}**'\n"
prompt = f"{priming}\n{typology}\n\n{given_data}\n{instructions}"
prompts.append(prompt)
inputs = tokenizer(prompts, return_tensors='pt', return_token_type_ids=False).to(device)
with torch.no_grad():
outputs = olmo.generate(**inputs, max_new_tokens=256, do_sample=False)
decoded = tokenizer.batch_decode(response, skip_special_tokens=True)[0]
for response in decoded:
match = re.search(r"The sentence's category is: \s*(.*)", response_txt)
if match:
category = match.group(1).strip("[]*")
else:
category = "NO CATEGORY"
results.append(category)
return results
def split_comment(cleaned_comment):
return nltk.sent_tokenize(cleaned_comment)
if __name__ == "__main__":
#loading and preprocessing data
df = pd.read_csv("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072325_biberplus_labels.csv")
df['cleaned_comment_text'] = df['comment_text'].apply(preprocess_comment)
df['comment_sentences'] = df['cleaned_comment_text'].apply(split_comment)
#running the classification task
with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
'''
with open("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072325_biberplus_labels.csv", mode='r', newline='') as file:
reader = csv.reader(file)
array_of_categorizations = []
@ -76,7 +133,7 @@ with open("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072325_biberplus_lab
text_dict['comment_type'] = row[12]
raw_text = text_dict['comment_text']
#print(raw_text)
#print(raw_text)i
# comment_text preprocessing per https://arxiv.org/pdf/1902.07093
# 1. replace code with CODE
comment_text = re.sub(r'`[^`]+`', 'CODE', raw_text) # Inline code
@ -123,17 +180,17 @@ with open("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072325_biberplus_lab
# TODO: collate olmo categories back together into an ordered list
# TODO: add the list of sentence-level olmo categories into dictionary
text_dict['olmo_category'] = following_text
'''
for item in result.strip(";").split(";"):
key_value = item.strip().split('. ')
if len(key_value) == 2:
key = key_value[0]
value = key_value[1]
cite_dict[key] = value
'''
array_of_categorizations.append(text_dict)
#CSV everything
df = pd.DataFrame(array_of_categorizations)
#df.to_csv('072525_olmo_messages_categorized.csv', index=False)
'''

View File

@ -1,4 +1,4 @@
from sklearn.decomposition import PCA
from sklearn.decomposition import PCA, KernelPCA
from sklearn.preprocessing import LabelEncoder
import pandas as pd
#import torch
@ -20,7 +20,7 @@ if __name__ == "__main__":
biber_vec_df = biber_vec_df[biber_vec_df['comment_type'] == 'task_description']
biber_vecs = format_df_data(biber_vec_df)
#handoff to PCA model
pca = PCA(2)
pca = KernelPCA(n_components=2, kernel="rbf")
biber_vecs_pca = pca.fit_transform(biber_vecs)
#first looking at comment_type
@ -33,7 +33,7 @@ if __name__ == "__main__":
plt.ylabel('component 2')
plt.colorbar()
plt.savefig("090225_biber_pca_plot.png", dpi=300)
#plt.savefig("090225_biber_pca_plot.png", dpi=300)
plot_df = pd.DataFrame({
"PC1": biber_vecs_pca[:, 0],
@ -49,5 +49,5 @@ if __name__ == "__main__":
plt.ylabel('component 2')
plt.legend(title='AuthorWMFAffil', bbox_to_anchor=(1.05, 1), loc=2)
plt.tight_layout()
plt.savefig("biber_pca_affil.png", dpi=300)
plt.savefig("biber_kernelpca_affil.png", dpi=300)
plt.show()

View File

@ -0,0 +1,182 @@
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from datautils import MyTrainDataset
import torch.multiprocessing as mp
import torch.distributed as dist
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel as DDP
import os
from transformers import AutoModelForCausalLM, AutoTokenizer, OlmoForCausalLM
import csv
import pandas as pd
import re
import nltk
# ----------------- prompts for LLM
priming = "For the **GIVEN SENTENCE**, please categorize it into one of the defined [[CATEGORIES]]. Each [[CATEGORY]] is described in the TYPOLOGY for reference. Your task is to match the**GIVEN SENTENCE** to the **[[CATEGORY]]** that most accurately describes the content of the comment. Only provide the category as your output. Do not provide any text beyond the category name."
#the typology descriptions are taken straight from https://arxiv.org/pdf/1902.07093
typology = """
TYPOLOGY:
[[EXPECTED BEHAVIOR]], in which stakeholders discuss, from the users perspective, the expected or ideal situation affected by the issue. For example, a participant commented: My suggestion/request in the near term would be to have an option to make the vocabulary read only so that users who want to be able to leave spacy alone to do streaming data processing dont need to worry about changing memory requirements.
[[MOTIVATION]], in which stakeholders elaborate on why the issue needs to be fixed or a feature needs to be added. For example, in support of redesigning TensorFlow's input pipeline one participant wrote: “Right now, this method starves my GPU all the time, which is a shame because most other [deep learning] frameworks manage to make this much more performantly.”
[[OBSERVED BUG BEHAVIOR]], which only appears in bug reports and focuses on describing the observed behaviour of the bug. For example, one participant commented: I found strange behavior using the pipe() method, then started to describe this behavior.
[[BUG REPRODUCTION]], which also only appears in bug reports and focuses on any report, request, and/or question regarding the reproduction of the bug. For example, one participant commented that a bug was reproducible: Same problem here, working on Windows 10 with German text.
[[INVESTIGATION AND EXPLORATION]], in which OSS stakeholders discuss their exploration of ideas about the problem that was thought to have caused the issue. For example, This result confirms my hypothesis but also shows that the memory increase really isnt all that significant... But it still points to a potential flaw in the design of the library.
[[SOLUTION DISCUSSION]] is framed around the solution space from the developers point of view, in which participants discuss design ideas and implementation details, as well as suggestions, constraints, challenges, and useful references around such topics. For example, I know there are multiple ways of approaching this however I strongly recommend node-gyp for performance.
[[CONTRIBUTION AND COMMITMENT]], in which participants call for contributors and/or voice willingness or unwillingness to contribute to resolving the issue. For example, one potential collaborator said: I will gladly contribute in any way I can, however, this is something I will not be able to do alone. Would be best if a few other people is interested as well...
[[TASK PROGRESS]], in which stakeholders request or report progress of tasks and sub-tasks towards the solution of the issue. For example, I made an initial stab at it... - this is just a proof of concept that gets the version string into nodejs. Ill start working on adding the swig interfaces...
[[TESTING]], in which participants discuss the testing procedure and results, as well as the system environment, code, data, and feedback involved in testing. For example, Tested on 0.101 and master - the issue seems to be fixed on master not just for the example document, but for the entire corpus...
[[FUTURE PLAN]], in which participants discuss the long-term plan related to the issue; such plans usually involve work/ideas that are not required to close the current issue. For example, For the futures, stay tuned, as were prototyping something in this direction.
[[POTENTIAL NEW ISSUES AND REQUESTS]], in which participants identify and discuss new bugs or needed features while investigating and addressing the current issue. For example, when discussing a bug in scikit-learn about parallel execution that causes process hanging, one participant said: As a side point, I note there seems to be a lot more joblib parallelisation overhead in master... that wasnt there in 0.14.
[[SOLUTION USAGE]] was usually discussed once a full or partial solution of the issue was released and stakeholders asked questions or provided suggestions about how to use the library with the new solution update. For example, Please help me how to continue training the model [with the new release].
[[WORKAROUNDS]] focus on discussions about temporary or alternative solutions that can help overcome the issue until the official fix or enhancement is released. For example, in a discussion regarding memory growth for streamed data, one participant expressed his temporary solution: For now workaround with reloading / collecting nlp object works quite ok in production.
[[ISSUE CONTENT MANAGEMENT]] focuses on redirecting the discussions and controlling the quality of the comments with respect to the issue. For example, We might want to move this discussion to here: [link to another issue]
[[ACTION ON ISSUE]], in which participants comment on the proper actions to perform on the issue itself. For example, Im going to close this issue because its old and most of the information here is now out of date.
[[SOCIAL CONVERSATION]], in which participants express emotions such as appreciation, disappointment, annoyance, regret, etc. or engage in small talk. For example, Im so glad that this has received so much thought and attention!
"""
instructions="The sentence's category is: "
# ----------------- distributed setup
def setup_ddp():
rank = int(os.environ['RANK'])
world_size = int(os.environ['WORLD_SIZE'])
local_rank = int(os.environ['LOCAL_RANK'])
dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(local_rank)
return rank, world_size, local_rank
#cleanup is dist.destroy_process_group()
# ----------------- distributed data set
class SentenceDataset(Dataset):
def __init__(self, comments, comment_types, priming, typology, instructions):
self.samples = []
for idx, comment in enumerate(comments):
cleaned_comment = preprocess_comment(comment)
sentences = split_to_sentences(cleaned_comment)
for sentence in sentences:
given_data = f"**GIVEN SENTENCE: \n ' Type -{comment_type} \n Text -{sentence}**'\n"
prompt = f"{priming}\n{typology}\n\n{given_data}\n{instructions}"
self.samples.append((idx, sentence, prompt))
def __len__(self):
return len(self.samples)
def __getitem__(self, idx):
return self.samples[idx]
# ----------------- data handling functions
7 def preprocess_comment(raw_text):
# 1. replace code with CODE
comment_text = re.sub(r'`[^`]+`', 'CODE', raw_text) # Inline code
comment_text = re.sub(r'```[\s\S]+?```', 'CODE', comment_text) # Block code
# 2. replace quotes with QUOTE
lines = comment_text.split('\n')
lines = ['QUOTE' if line.strip().startswith('>') else line for line in lines]
comment_text = '\n'.join(lines)
# 3. replace Gerrit URLs with GERRIT URL
gerrit_url_pattern = r'https://gerrit\.wikimedia\.org/r/\d+'
comment_text = re.sub(gerrit_url_pattern, 'GERRIT_URL', comment_text)
# replace URL with URL
url_pattern = r'https?://[^\s]+'
comment_text = re.sub(url_pattern, 'URL', comment_text)
# 4. if possible, replace @ with SCREEN_NAME
cleaned_text = re.sub(r'(^|\s)@\w+', 'SCREEN_NAME', comment_text)
return cleaned_text
def split_to_sentences(text):
return nltk.sent_tokenize(text)
# ----------------- distributed inference
def main():
# https://github.com/nuitrcs/examplejobs/blob/master/python/pytorch_ddp/multinode_torchrun.py
#prep ddp setting
rank, world_size, local_rank = setup_ddp()
device = torch.device(f"cuda:{local_rank}")
#load in data
df = pd.read_csv("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072525_pp_biberplus_labels.csv")
# TODO comment out below
df = df.iloc[:5].copy()
comment_texts = df['comment_text'].tolist()
comment_types = df['comment_type'].tolist()
dataset = SentenceDataset(comment_texts, comment_types, priming, typology, instructions)
#split data up across processes
batch_size = 4
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank, shuffle=False)
dataloader = DataLoader(dataset, batch_size=batch_size, sampler=sampler)
#load model and wrap in DDP
cache_directory="/projects/p32852/cache/"
olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-1124-13B", cache_dir=cache_directory).to(device)
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-2-1124-13B", cache_dir=cache_directory)
ddp_olmo = DDP(olmo, device_ids=[local_rank])
#prepare to collect results as dictionary
results = dict()
with torch.no_grad():
for batch in dataloader:
comment_idxs, sentences, prompts = batch
# categorize the batch
inputs = tokenizer(prompts, return_tensors='pt', return_token_type_ids=False).to(device)
outputs = ddp_olmo.module.generate(**inputs, max_new_tokens=256, do_sample=False)
decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for idx, response in enumerate(decoded):
match = re.search(r"The sentence's category is: \s*(.*)", response)
if match:
category = match.group(1).strip("[]*")
else:
category = "NO CATEGORY"
comment_idx = int(comment_idxs[idx])
sentence = sentences[idx]
results.setdefault(comment_idx, []).append((sentence, category))
#bring all together
gathered = [None for _ in range(world_size)]
dist.all_gather_object(gathered, results)
if rank == 0:
merged = dict()
for partial in gathered:
for k,v in partial.items():
merged.setdefault(k, []).extend(v)
out_rows = []
for comment_idx, sentence_labels in merged.items():
out_rows.append({
'id': df['id'].iloc[comment_idx],
'task_title': df['task_title'].iloc[comment_idx],
'comment_text': df['comment_text'].iloc[comment_idx],
'AuthorPHID': df['AuthorPHID'].iloc[comment_idx],
'sentence_labels': sentence_labels
})
out_df = pd.DataFrame(out_rows)
print(out_df.head())
#TODO out_df.to_csv("090325_olmo_sentence_categorized.csv")
dist.destroy_process_group()
if __name__ == "__main__":
main()
print('all pau; internal to the script')

View File

@ -0,0 +1,33 @@
#!/bin/bash
#SBATCH -A p32852
#SBATCH -p gengpu
#SBATCH --gres=gpu:a100:2
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --time=48:00:00
#SBATCH --mem=64G
#SBATCH --cpus-per-task=4
#SBATCH --job-name=MW-info-typology
#SBATCH --output=parallel-mw-olmo-info-cat.log
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=gaughan@u.northwestern.edu
module purge
eval "$(conda shell.bash hook)"
echo "setting up the environment by loading in conda environment at $(date)"
conda activate olmo
echo "running the bertopic job at $(date)"
srun torchrun \
--nnodes 2 \
--nproc 2 \
--rdzv_id $RANDOM \
--rdzv_backend c10d \
--rdzv_endpoint "$SLURMD_NODENAME:29502" \
/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/info_labeling.py 10000 100
echo "unsupervised olmo categorization pau at $(date)"