1
0

building out olmo classification pipeline

This commit is contained in:
mgaughan 2025-07-25 14:18:27 -05:00
parent a08a49d04e
commit 862643d5df
6 changed files with 223 additions and 151730 deletions

File diff suppressed because one or more lines are too long

View File

@ -1,36 +0,0 @@
starting the job at: Wed Jul 23 14:49:04 CDT 2025
setting up the environment
running the biberplus labeling script
26024
26024
id ... http_flag
0 56791 ... NaN
1 269631 ... NaN
2 269628 ... NaN
3 269622 ... NaN
4 56737 ... NaN
... ... ... ...
26019 403186 ... True
26020 78646 ... True
26021 429163 ... True
26022 429137 ... True
26023 418783 ... True
[26024 rows x 22 columns]
id ... message
0 56791 ... pawn character editing\n\nseen on master branc...
1 269631 ... Change 86685 merged by jenkins-bot:\nFollow-up...
2 269628 ... *** Bug 54785 has been marked as a duplicate o...
3 269622 ... Change 86685 had a related patch set uploaded ...
4 56737 ... **Author:** `Wikifram`\n\n**Description:**\nAf...
... ... ... ...
26019 403186 ... Could you attach a screenshot please? Drag & d...
26020 78646 ... Hi,\n\nWe have a wiki which has a part which c...
26021 429163 ... Sorry for not reply-ing. I did a test and coul...
26022 429137 ... @DikkieDick: Please answer.
26023 418783 ... I cannot replicate this. What's the name of th...
[26024 rows x 121 columns]
biberplus labeling pau
job finished, cleaning up
job pau at: Wed Jul 23 14:58:09 CDT 2025

View File

@ -0,0 +1,88 @@
setting up the environment by loading in conda environment at Fri Jul 25 14:09:58 CDT 2025
running the bertopic job at Fri Jul 25 14:09:58 CDT 2025
cuda
NVIDIA A100-SXM4-80GB
_CudaDeviceProperties(name='NVIDIA A100-SXM4-80GB', major=8, minor=0, total_memory=81153MB, multi_processor_count=108, uuid=393ab5c3-2bcb-e4c6-52ad-eb4896a9d4fe, L2_cache_size=40MB)
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s] Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:03, 2.99it/s] Loading checkpoint shards: 17%|█▋ | 2/12 [00:00<00:04, 2.06it/s] Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:04, 1.99it/s] Loading checkpoint shards: 33%|███▎ | 4/12 [00:02<00:04, 1.90it/s] Loading checkpoint shards: 42%|████▏ | 5/12 [00:02<00:03, 1.88it/s] Loading checkpoint shards: 50%|█████ | 6/12 [00:03<00:03, 1.86it/s] Loading checkpoint shards: 58%|█████▊ | 7/12 [00:03<00:02, 1.76it/s] Loading checkpoint shards: 67%|██████▋ | 8/12 [00:04<00:02, 1.78it/s] Loading checkpoint shards: 75%|███████▌ | 9/12 [00:04<00:01, 1.83it/s] Loading checkpoint shards: 83%|████████▎ | 10/12 [00:05<00:01, 1.76it/s] Loading checkpoint shards: 92%|█████████▏| 11/12 [00:05<00:00, 1.90it/s] Loading checkpoint shards: 100%|██████████| 12/12 [00:05<00:00, 2.04it/s]
this is the response:::: ----------------------------
task_description
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_description
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
ACTION ON ISSUE
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_description
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment
this is the response:::: ----------------------------
task_subcomment

View File

@ -1,6 +0,0 @@
starting the job at: Tue Jul 15 14:09:10 CDT 2025
setting up the environment
running the neurobiber labeling script
neurobiber labeling pau
job finished, cleaning up
job pau at: Tue Jul 15 14:12:26 CDT 2025

View File

@ -0,0 +1,109 @@
from transformers import AutoModelForCausalLM, AutoTokenizer, OlmoForCausalLM
import torch
import csv
import pandas as pd
import re
cache_directory = "/projects/p32852/cache/"
#load in the different models
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
print(torch.cuda.get_device_name(0))
print(torch.cuda.get_device_properties(0))
#olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-0325-32B", cache_dir=cache_directory).to(device)
#tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-2-0325-32B", cache_dir=cache_directory)
olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-1124-13B", cache_dir=cache_directory).to(device)
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-2-1124-13B")
#TODO: text_preprocessing per https://arxiv.org/pdf/1902.07093
priming = "For the GIVEN COMMENT, please categorize it into one of the following [[CATEGORIES]] of information; please categorize by matching the GIVEN COMMENT to the [[CATEGORY]] that mmost describes the GIVEN COMMENT. Below are a list of categories to label the GIVEN COMMENT with. Categories are formatted as '[[CATEGORY]], short description, example of comment that matches the category'."
#the typology descriptions are taken straight from https://arxiv.org/pdf/1902.07093
typology = """
[[EXPECTED BEHAVIOR]], in which stakeholders discuss, from the users perspective, the expected or ideal situation affected by the issue. This discussion sometimes relies on the personal preferences and opinions from the OSS participants. For example, a participant commented: My suggestion/request in the near term would be to have an option to make the vocabulary read only so that users who want to be able to leave spacy alone to do streaming data processing dont need to worry about changing memory requirements.
[[MOTIVATION]], in which stakeholders elaborate on why the issue needs to be fixed or a feature needs to be added. To strengthen their arguments, they usually described use cases involving the requested feature and/or cited competitors who implemented the requested feature. For example, in support of redesigning TensorFlow's input pipeline one participant wrote: “Right now, this method starves my GPU all the time, which is a shame because most other [deep learning] frameworks manage to make this much more performantly.”
[[OBSERVED BUG BEHAVIOR]], which only appears in bug reports and focuses on describing the observed behaviour of the bug. For example, one participant commented: I found strange behavior using the pipe() method, then started to describe this behavior.
[[BUG REPRODUCTION]], which also only appears in bug reports and focuses on any report, request, and/or question regarding the reproduction of the bug. For example, one participant commented that a bug was reproducible: Same problem here, working on Windows 10 with German text.
[[INVESTIGATION AND EXPLORATION]], in which OSS stakeholders discuss their exploration of ideas about the problem that was thought to have caused the issue. Sometimes participants provide suggestions on how or what to investigate. For example, This result confirms my hypothesis but also shows that the memory increase really isnt all that significant... But it still points to a potential flaw in the design of the library.
[[SOLUTION DISCUSSION]] is framed around the solution space from the developers point of view, in which participants discuss design ideas and implementation details, as well as suggestions, constraints, challenges, and useful references around such topics. For example, I know there are multiple ways of approaching this however I strongly recommend node-gyp for performance.
[[CONTRIBUTION AND COMMITMENT]], in which participants call for contributors and/or voice willingness or unwillingness to contribute to resolving the issue. For example, one potential collaborator said: I will gladly contribute in any way I can, however, this is something I will not be able to do alone. Would be best if a few other people is interested as well...
[[TASK PROGRESS]], in which stakeholders request or report progress of tasks and sub-tasks towards the solution of the issue. Participants sometimes also mention their plan of actions. For example, I made an initial stab at it... - this is just a proof of concept that gets the version string into nodejs. Ill start working on adding the swig interfaces...
[[TESTING]], in which participants discuss the testing procedure and results, as well as the system environment, code, data, and feedback involved in testing. For example, Tested on 0.101 and master - the issue seems to be fixed on master not just for the example document, but for the entire corpus...
[[FUTURE PLAN]], in which participants discuss the long-term plan related to the issue; such plans usually involve work/ideas that are not required to close the current issue. For example, For the futures, stay tuned, as were prototyping something in this direction.
[[POTENTIAL NEW ISSUES AND REQUESTS]], in which participants identify and discuss new bugs or needed features while investigating and addressing the current issue. They are out of the scope of the discussion of the current issue but may lead to new issue reports. For example, when discussing a bug in scikit-learn about parallel execution that causes process hanging, one participant said: As a side point, I note there seems to be a lot more joblib parallelisation overhead in master... that wasnt there in 0.14.
[[SOLUTION USAGE]] was usually discussed once a full or partial solution of the issue was released and stakeholders asked questions or provided suggestions about how to use the library with the new solution update. For example, Please help me how to continue training the model [with the new release].
[[WORKAROUNDS]] focus on discussions about temporary or alternative solutions that can help overcome the issue until the official fix or enhancement is released. In a discussion regarding memory growth for streamed data, one participant expressed his temporary solution: For now workaround with reloading / collecting nlp object works quite ok in production.
[[ISSUE CONTENT MANAGEMENT]] focuses on redirecting the discussions and controlling the quality of the comments with respect to the issue. For example, We might want to move this discussion to here: [link to another issue]
[[ACTION ON ISSUE]], in which participants comment on the proper actions to perform on the issue itself. For example, Im going to close this issue because its old and most of the information here is now out of date.
[[SOCIAL CONVERSATION]], in which participants express emotions such as appreciation, disappointment, annoyance, regret, etc. or engage in small talk. For example, Im so glad that this has received so much thought and attention!
"""
instructions="Only respond with the GIVEN COMMENT's [[CATEGORY]] classification. Do not provide any more information."
with open("/home/nws8519/git/mw-lifecycle-analysis/p2/quest/072325_biberplus_labels.csv", mode='r', newline='') as file:
reader = csv.reader(file)
array_of_categorizations = []
index = -1
for row in reader:
index += 1
if index <= 0:
continue
text_dict = {}
#organizing the data from each citation
text_dict['id'] = row[0]
text_dict['task_title'] = row[1]
text_dict['comment_text'] = row[2]
text_dict['comment_type'] = row[12]
#TODO: build out prompt construction; more specificity in data provided
given_data = f"GIVEN COMMENT: \n ' Type -{text_dict['comment_type']} \n Text -{text_dict['comment_text']}'\n"
prompt_question="What do you think about this message? What are they saying?"
#prompt = f"{prompt_1}\n\n{example_1}\n\n{example_2}\n\n{example_3}\n\n{example_4}\n\n{given_data}\n"
prompt = f"{priming}\n{typology}\n{instructions}\n\n{given_data}\n\n What is the above comment's [[CATEGORY]]?"
#handoff to the model
inputs = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)
#deterministic sampling and getting the response back
response = olmo.generate(**inputs, max_new_tokens=256, do_sample=False)
response_txt = tokenizer.batch_decode(response, skip_special_tokens=True)[0]
print("this is the response:::: ----------------------------")
#print(response_txt)
#getting the resulting codes
#codes_id = response_txt.rfind("CATEGORIES:")
#writing them to the citation_dicti
match = re.search(r"What is the above comment's \[\[CATEGORY\]\]\?\s*(.*)", response_txt)
if match:
following_text = match.group(1)
else:
following_text = "NO CATEGORY"
print(following_text)
'''
for item in result.strip(";").split(";"):
key_value = item.strip().split('. ')
if len(key_value) == 2:
key = key_value[0]
value = key_value[1]
cite_dict[key] = value
'''
array_of_categorizations.append(text_dict)
if index > 40:
break
#CSV everything
df = pd.DataFrame(array_of_categorizations)

View File

@ -0,0 +1,26 @@
#!/bin/bash
#SBATCH -A p32852
#SBATCH -p gengpu
#SBATCH --gres=gpu:a100:1
#SBATCH --constraint=sxm
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --time=24:00:00
#SBATCH --mem=64G
#SBATCH --cpus-per-task=4
#SBATCH --job-name=MW-info-typology
#SBATCH --output=mw-olmo-info-cat.log
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=gaughan@u.northwestern.edu
module purge
eval "$(conda shell.bash hook)"
echo "setting up the environment by loading in conda environment at $(date)"
conda activate olmo
echo "running the bertopic job at $(date)"
python /home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/info_labeling.py