adding new batch of olmo labels

2025-11-09 16:29:01 -06:00 · 2025-11-09 16:29:01 -06:00 · 7ace52a559
commit 7ace52a559
parent 43984fb605
6 changed files with 146098 additions and 4307 deletions
--- a/p2/quest/python_scripts/olmo_labeling/110525-batched-mw-olmo-info-cat.log
+++ b/p2/quest/python_scripts/olmo_labeling/110525-batched-mw-olmo-info-cat.log
--- a/p2/quest/python_scripts/olmo_labeling/all_110525_olmo_batched_categorized.csv
+++ b/p2/quest/python_scripts/olmo_labeling/all_110525_olmo_batched_categorized.csv
--- a/p2/quest/python_scripts/olmo_labeling/batched_olmo_cat.py
+++ b/p2/quest/python_scripts/olmo_labeling/batched_olmo_cat.py
@ -23,11 +23,11 @@ print(torch.cuda.get_device_properties(0))
 #olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-0325-32B", torch_dtype=torch.float16, load_in_8bit=True, cache_dir=cache_directory).to(device)
 #olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-0325-32B-Instruct-GGUF", cache_dir=cache_directory).to(device)
 #tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-2-0325-32B-Instruct-GGUF", cache_dir=cache_directory)
-olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-1124-13B", cache_dir=cache_directory).to(device)
+olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-1124-13B-Instruct", cache_dir=cache_directory).to(device)
-tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-2-1124-13B", padding_side='left')
+tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-2-1124-13B-Instruct", padding_side='left')
 information_types = Path('/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_labeling/info_definitions.txt').read_text(encoding="utf-8")
-prompt_template = Path('/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_labeling/prompt_template.txt').read_text(encoding="utf-8")                                                                                                 
+prompt_template = Path('/home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_labeling/prompt_template_nofs.txt').read_text(encoding="utf-8")                                                                                                 
 csv.field_size_limit(sys.maxsize)
 with open("/home/nws8519/git/mw-lifecycle-analysis/analysis_data/102725_unified.csv", mode='r', newline='') as file:
@ -89,20 +89,22 @@ with open("/home/nws8519/git/mw-lifecycle-analysis/analysis_data/102725_unified.
                match = re.search(r"Response: \s*(.*)", response_txt)
                #print(match)
                if match:
-                    category = re.sub(r"[(),\d]", "", match.group(1)).strip()
+                    category = re.sub(r'[()",\d]', "", match.group(1)).strip()
                    category = category.replace("_", " ")
                else:
                    category = "NO CATEGORY"
                results.append(category)
            torch.cuda.empty_cache()
        #print(comment_sentences)
        text_dict['sentence_categories']=results
-        print(results)
+        #print(results)
        array_of_categorizations.append(text_dict)
-        if index == 20:
+        #if index == 200:
-            break
+        #    break
    df = pd.DataFrame(array_of_categorizations)
    #print(df.head())
-    #df.to_csv('all_110525_olmo_batched_categorized.csv', index=False)
+    df.to_csv('all_110525_olmo_batched_categorized.csv', index=False)
--- a/p2/quest/python_scripts/olmo_labeling/info_definitions.txt
+++ b/p2/quest/python_scripts/olmo_labeling/info_definitions.txt
@ -4,21 +4,21 @@ Sentences in software engineering task discussions often contain different types
 Each sentence often has only one primary information type. 
 Below are the different kinds of information types found in task discussion sentences: 
-EXPECTED BEHAVIOR: A sentence in which stakeholders discuss, from the user’s perspective, the expected or ideal situation affected by the issue. Such as “My suggestion/request in the near term would be to have an option to make the vocabulary read only so that users who want to be able to leave spacy alone to do streaming data processing don’t need to worry about changing memory requirements.” 
+EXPECTED BEHAVIOR: A sentence focused on, from the user’s perspective, the expected or ideal situation affected by the issue. Such as “My suggestion/request in the near term would be to have an option to make the vocabulary read only so that users who want to be able to leave spacy alone to do streaming data processing don’t need to worry about changing memory requirements.” 
-MOTIVATION: A sentence in which stakeholders elaborate on why the issue needs to be fixed or a feature needs to be added. Such as “Right now, this method starves my GPU all the time, which is a shame because most other [deep learning] frameworks manage to make this much more performantly.”
+MOTIVATION: A sentence focused on why the issue needs to be fixed or a feature needs to be added. Such as “Right now, this method starves my GPU all the time, which is a shame because most other [deep learning] frameworks manage to make this much more performantly.”
-OBSERVED BUG BEHAVIOR: A sentence which appears in bug reports and focuses on describing the observed behaviour of the bug. Such as one participant commented: “I found strange behavior using the ‘pipe()’ method”, then started to describe this behavior.
+OBSERVED BUG BEHAVIOR: A sentence focused on describing the observed behaviour of the bug. Such as one participant commented: “I found strange behavior using the ‘pipe()’ method”, then started to describe this behavior.
 BUG REPRODUCTION: A sentence focused on any report, request, and/or question regarding the reproduction of the bug. Such as “Same problem here, working on Windows 10 with German text.”
-INVESTIGATION AND EXPLORATION: A sentence where OSS stakeholders discuss their exploration of ideas about the problem that was thought to have caused the issue. Such as “This result confirms my hypothesis but also shows that the memory increase really isn’t all that significant... But it still points to a potential flaw in the design of the library.”
+INVESTIGATION AND EXPLORATION: A sentence focused on the exploration of ideas about the problem that was thought to have caused the issue. Such as “This result confirms my hypothesis but also shows that the memory increase really isn’t all that significant... But it still points to a potential flaw in the design of the library.”
-SOLUTION DISCUSSION: A sentence that is framed around the solution space from the developers’ point of view, in which participants discuss design ideas and implementation details, as well as suggestions, constraints, challenges, and useful references around such topics. Such as “I know there are multiple ways of approaching this however I strongly recommend node-gyp for performance.”
+SOLUTION DISCUSSION: A sentence focused on the solution space from the developers’ point of view, in which participants discuss design ideas and implementation details, as well as suggestions, constraints, challenges, and useful references around such topics. Such as “I know there are multiple ways of approaching this however I strongly recommend node-gyp for performance.”
-CONTRIBUTION AND COMMITMENT: A sentence in which participants call for contributors and/or voice willingness or unwillingness to contribute to resolving the issue. Such as “I will gladly contribute in any way I can, however, this is something I will not be able to do alone. Would be best if a few other people is interested as well...”
+CONTRIBUTION AND COMMITMENT: A sentence focused on calls for contributors and/or voicing willingness or unwillingness to contribute to resolving the issue. Such as “I will gladly contribute in any way I can, however, this is something I will not be able to do alone. Would be best if a few other people is interested as well...”
-NA: A sentence which contains only non-English terms or consists entirely of punctuation and numerals. Such as "***", "ve-ce-protectedNode", or "T8597".
+NA: A sentence which contains only non-English terms, special characters, or numerals. Such as "***", "ve-ce-protectedNode", or "T8597".
-TASK PROGRESS: A sentence in which stakeholders request or report progress of tasks and sub-tasks towards the solution of the issue. This includes automated reports of merged code changes. Such as “I made an initial stab at it... - this is just a proof of concept that gets the version string into nodejs. I’ll start working on adding the swig interfaces...”
+TASK PROGRESS: A sentence focused on requesting or reporting progress of tasks and sub-tasks towards the solution of the issue. This includes automated reports of merged code changes. Such as “I made an initial stab at it... - this is just a proof of concept that gets the version string into nodejs. I’ll start working on adding the swig interfaces...”
-TESTING: A sentence in which participants discuss the testing procedure and results, as well as the system environment, code, data, and feedback involved in testing. Such as  “Tested on ‘0.101’ and ‘master’ - the issue seems to be fixed on ‘master’ not just for the example document, but for the entire corpus...”
+TESTING: A sentence focused on the testing procedure and results, as well as the system environment, code, data, and feedback involved in testing. Such as  “Tested on ‘0.101’ and ‘master’ - the issue seems to be fixed on ‘master’ not just for the example document, but for the entire corpus...”
-FUTURE PLAN: A sentence in which participants discuss the long-term plan related to the issue; such plans usually involve work/ideas that are not required to close the current issue. Such as “For the futures, stay tuned, as we’re prototyping something in this direction.”
+FUTURE PLAN: A sentence focused on the long-term plan related to the issue; usually involving work/ideas that are not required to close the current issue. Such as “For the futures, stay tuned, as we’re prototyping something in this direction.”
-POTENTIAL NEW ISSUES AND REQUESTS: A sentence in which participants identify and discuss new bugs or needed features while investigating and addressing the current issue. Such as “As a side point, I note there seems to be a lot more joblib parallelisation overhead in master... that wasn’t there in 0.14.”
+POTENTIAL NEW ISSUES AND REQUESTS: A sentence focused on new bugs or needed features identified while investigating and addressing the current issue. Such as “As a side point, I note there seems to be a lot more joblib parallelisation overhead in master... that wasn’t there in 0.14.”
-SOLUTION USAGE: A sentence in which stakeholders asked questions or provided suggestions about how to use the library with the new solution update. Such as “Please help me how to continue training the model [with the new release].”
+SOLUTION USAGE: A sentence focused on asking questions or provided suggestions about how to use the library with the new solution update. Such as “Please help me how to continue training the model [with the new release].”
-WORKAROUNDS: A sentence in which stakeholders discussed temporary or alternative solutions that can help overcome the issue until the official fix or enhancement is released. Such as “For now workaround with reloading / collecting nlp object works quite ok in production.”
+WORKAROUNDS: A sentence focused on temporary or alternative solutions that can help overcome the issue until the official fix or enhancement is released. Such as “For now workaround with reloading / collecting nlp object works quite ok in production.”
-ISSUE CONTENT MANAGEMENT: A sentence in which a stakeholder focuses on redirecting the discussions and controlling the quality of the comments with respect to the issue. Such as “We might want to move this discussion to here: [link to another issue]” or "This other issue [link to another issue] is a duplicate of this issue".
+ISSUE CONTENT MANAGEMENT: A sentence focused on redirecting the discussions and controlling the quality of the comments with respect to the issue. This includes marking other issues as duplicates. Such as “We might want to move this discussion to here: [link to another issue]” or "This other issue [link to another issue] is a duplicate of this issue".
-ACTION ON ISSUE: A sentence in which participants comment on the proper actions to perform on the issue itself. Such as “I’m going to close this issue because it’s old and most of the information here is now out of date.”
+ACTION ON ISSUE: A sentence focused on the proper actions to perform on the issue itself. Such as “I’m going to close this issue because it’s old and most of the information here is now out of date.”
-SOCIAL CONVERSATION: A sentence in which participants express emotions such as appreciation, disappointment, annoyance, regret, etc. or engage in small talk. Such as “I’m so glad that this has received so much thought and attention!”, "My apologies." or "Thank you!"
+SOCIAL CONVERSATION: A sentence focused on expressing emotions such as appreciation, disappointment, annoyance, regret, etc. or engage in small talk. Such as “I’m so glad that this has received so much thought and attention!”, "My apologies." or "Thank you!"
--- a/p2/quest/python_scripts/olmo_labeling/prompt_template_nofs.txt
+++ b/p2/quest/python_scripts/olmo_labeling/prompt_template_nofs.txt
@ -0,0 +1,20 @@
 {info_definitions}
 ---
 Task 
 Given the title of a software engineering task discussion and a sentence from within that discussion, identify the primary information type from the list above that applies to the sentence. 
 For each sentence: 
 1. Provide the information type label (exactly as named above)
 2. Provide a confidence score from 1-10, where 10 means you are highly confident this information type applies to this sentence. 
 Output format (valid tuple only):
 ("INFORMATION_TYPE", CONFIDENCE_SCORE)
 --- 
 Now label this sentence 
 Discussion Title: {task_title}
 Sentence: {sent}
 Response: 
--- a/p2/quest/slurm_jobs/110525_olmo_batched_cat.sh
+++ b/p2/quest/slurm_jobs/110525_olmo_batched_cat.sh
@ -0,0 +1,30 @@
 #!/bin/bash
 #SBATCH -A p32852
 #SBATCH -p gengpu
 #SBATCH --gres=gpu:h100:1
 #SBATCH --constraint=sxm
 #SBATCH --nodes=2
 #SBATCH --ntasks-per-node=1
 #SBATCH --time=48:00:00
 #SBATCH --mem=64G
 #SBATCH --cpus-per-task=4
 #SBATCH --job-name=batched-MW-info-typology
 #SBATCH --output=110525-batched-mw-olmo-info-cat.log
 #SBATCH --mail-type=BEGIN,END,FAIL
 #SBATCH --mail-user=gaughan@u.northwestern.edu
 module purge
 eval "$(conda shell.bash hook)"
 echo "setting up the environment by loading in conda environment at $(date)"
 conda activate olmo
 echo "running the batched olmo categorization job at $(date)"
 python /home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/olmo_labeling/batched_olmo_cat.py
 #python /home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/label_sampling.py         
 echo "unsupervised batched olmo categorization pau at $(date)"