1
0

new batched OLMO labels

This commit is contained in:
mgaughan 2025-10-24 10:03:28 -05:00
parent 0ed72af495
commit d6965a33cb
4 changed files with 146049 additions and 7 deletions

View File

@ -0,0 +1,12 @@
setting up the environment by loading in conda environment at Tue Oct 21 22:09:37 CDT 2025
running the batched olmo categorization job at Tue Oct 21 22:09:38 CDT 2025
[nltk_data] Downloading package punkt_tab to
[nltk_data] /home/nws8519/nltk_data...
[nltk_data] Package punkt_tab is already up-to-date!
cuda
NVIDIA A100-SXM4-80GB
_CudaDeviceProperties(name='NVIDIA A100-SXM4-80GB', major=8, minor=0, total_memory=81153MB, multi_processor_count=108, uuid=f431a288-a223-5654-4599-c2a6f20abe8d, L2_cache_size=40MB)
Loading checkpoint shards: 0%| | 0/12 [00:00<?, ?it/s] Loading checkpoint shards: 8%|▊ | 1/12 [00:00<00:03, 3.54it/s] Loading checkpoint shards: 17%|█▋ | 2/12 [00:00<00:03, 2.63it/s] Loading checkpoint shards: 25%|██▌ | 3/12 [00:01<00:03, 2.47it/s] Loading checkpoint shards: 33%|███▎ | 4/12 [00:01<00:03, 2.22it/s] Loading checkpoint shards: 42%|████▏ | 5/12 [00:02<00:03, 2.06it/s] Loading checkpoint shards: 50%|█████ | 6/12 [00:02<00:02, 2.05it/s] Loading checkpoint shards: 58%|█████▊ | 7/12 [00:03<00:02, 2.01it/s] Loading checkpoint shards: 67%|██████▋ | 8/12 [00:03<00:01, 2.00it/s] Loading checkpoint shards: 75%|███████▌ | 9/12 [00:04<00:01, 2.05it/s] Loading checkpoint shards: 83%|████████▎ | 10/12 [00:04<00:00, 2.04it/s] Loading checkpoint shards: 92%|█████████▏| 11/12 [00:05<00:00, 2.22it/s] Loading checkpoint shards: 100%|██████████| 12/12 [00:05<00:00, 2.36it/s]
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
unsupervised batched olmo categorization pau at Thu Oct 23 16:52:15 CDT 2025

View File

@ -38,11 +38,11 @@ TYPOLOGY:
[[CONTRIBUTION AND COMMITMENT]], in which participants call for contributors and/or voice willingness or unwillingness to contribute to resolving the issue. For example, one potential collaborator said: I will gladly contribute in any way I can, however, this is something I will not be able to do alone. Would be best if a few other people is interested as well...
[[TASK PROGRESS]], in which stakeholders request or report progress of tasks and sub-tasks towards the solution of the issue. For example, I made an initial stab at it... - this is just a proof of concept that gets the version string into nodejs. Ill start working on adding the swig interfaces...
[[TASK PROGRESS]], in which stakeholders request or report progress of tasks and sub-tasks towards the solution of the issue. This includes automated reports of merged code changes. For example, I made an initial stab at it... - this is just a proof of concept that gets the version string into nodejs. Ill start working on adding the swig interfaces...
[[TESTING]], in which participants discuss the testing procedure and results, as well as the system environment, code, data, and feedback involved in testing. For example, Tested on 0.101 and master - the issue seems to be fixed on master not just for the example document, but for the entire corpus...
[[NA]], in which the sentence contents are entirely incomprehensible or only consist of punctuation or numerals. For example, "***", "ve-ce-protectedNode", or "T8597".
[[NA]], only apply this category if the sentence consists of non-English terms or only consist of punctuation or numerals. For example, "***", "ve-ce-protectedNode", or "T8597".
[[FUTURE PLAN]], in which participants discuss the long-term plan related to the issue; such plans usually involve work/ideas that are not required to close the current issue. For example, For the futures, stay tuned, as were prototyping something in this direction.
@ -52,11 +52,11 @@ TYPOLOGY:
[[WORKAROUNDS]] focus on discussions about temporary or alternative solutions that can help overcome the issue until the official fix or enhancement is released. For example, in a discussion regarding memory growth for streamed data, one participant expressed his temporary solution: For now workaround with reloading / collecting nlp object works quite ok in production.
[[ISSUE CONTENT MANAGEMENT]] focuses on redirecting the discussions and controlling the quality of the comments with respect to the issue. For example, We might want to move this discussion to here: [link to another issue]
[[ISSUE CONTENT MANAGEMENT]] focuses on redirecting the discussions and controlling the quality of the comments with respect to the issue. For example, We might want to move this discussion to here: [link to another issue] or "This other issue [link to another issue] is a duplicate of this issue".
[[ACTION ON ISSUE]], in which participants comment on the proper actions to perform on the issue itself. For example, Im going to close this issue because its old and most of the information here is now out of date.
[[SOCIAL CONVERSATION]], in which participants express emotions such as appreciation, disappointment, annoyance, regret, etc. or engage in small talk. For example, Im so glad that this has received so much thought and attention!
[[SOCIAL CONVERSATION]], in which participants express emotions such as appreciation, disappointment, annoyance, regret, etc. or engage in small talk. For example, Im so glad that this has received so much thought and attention!, "My apologies." or "Thank you!"
"""
instructions="The sentence's category is: "
@ -129,7 +129,7 @@ with open("/home/nws8519/git/mw-lifecycle-analysis/analysis_data/100325_unified_
array_of_categorizations.append(text_dict)
df = pd.DataFrame(array_of_categorizations)
#print(df.head())
df.to_csv('all_101325_olmo_batched_categorized.csv', index=False)
df.to_csv('all_102125_olmo_batched_categorized.csv', index=False)

File diff suppressed because one or more lines are too long

View File

@ -9,7 +9,7 @@
#SBATCH --mem=64G
#SBATCH --cpus-per-task=4
#SBATCH --job-name=batched-MW-info-typology
#SBATCH --output=101325-batched-mw-olmo-info-cat.log
#SBATCH --output=102125-batched-mw-olmo-info-cat.log
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=gaughan@u.northwestern.edu
@ -23,7 +23,7 @@ conda activate olmo
echo "running the batched olmo categorization job at $(date)"
python /home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/090425_batched_olmo_cat.py
python /home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/102125_batched_olmo_cat.py
#python /home/nws8519/git/mw-lifecycle-analysis/p2/quest/python_scripts/label_sampling.py