1
0

pulling new olmocr image and new categorization stuff

This commit is contained in:
mgaughan 2025-06-02 21:24:53 -05:00
parent 1e4db14a65
commit 020b3090d6
6 changed files with 3005 additions and 38 deletions

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,9 @@
starting the job at: Mon Jun 2 11:43:44 CDT 2025
setting up the environment
running the p1 categorization script
cuda
NVIDIA A100-SXM4-80GB
_CudaDeviceProperties(name='NVIDIA A100-SXM4-80GB', major=8, minor=0, total_memory=81153MB, multi_processor_count=108, uuid=af8da8da-1900-3762-4351-d9c80d33463b, L2_cache_size=40MB)
Loading checkpoint shards: 0%| | 0/6 [00:00<?, ?it/s] Loading checkpoint shards: 17%|█▋ | 1/6 [00:00<00:03, 1.42it/s] Loading checkpoint shards: 33%|███▎ | 2/6 [00:01<00:03, 1.20it/s] Loading checkpoint shards: 50%|█████ | 3/6 [00:02<00:02, 1.07it/s] Loading checkpoint shards: 67%|██████▋ | 4/6 [00:03<00:01, 1.05it/s] Loading checkpoint shards: 83%|████████▎ | 5/6 [00:04<00:00, 1.06it/s] Loading checkpoint shards: 100%|██████████| 6/6 [00:05<00:00, 1.17it/s] Loading checkpoint shards: 100%|██████████| 6/6 [00:05<00:00, 1.14it/s]
job finished, cleaning up
job pau at: Mon Jun 2 13:20:49 CDT 2025

View File

@ -1,27 +0,0 @@
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob sha256:96d54c3075c9eeaed5561fd620828fd6bb5d80ecae7cb25f9ba5f7d88ea6e15c
Copying blob sha256:09d415c238d76b32a7ea4a6e6add9542db9a5641f7f183af70aae185d0709e58
Copying blob sha256:9fe6e2e61518cba6844870c03b285737daec35e62baf25ae7744629ed3a7b470
Copying blob sha256:41f16248e682693ff20b3032c1d5e5541cc87c5af898ae2ff9b24d2940e59100
Copying blob sha256:95d7b781703928cf3c4eece39d800cccb76728c375fedf51ecd83833fb25e458
Copying blob sha256:8f6c9048534734f4c873935293b7296225846ceb31c1a158400a67ea170dde7f
Copying blob sha256:ab17245097e491b9368790714f9d90ed447bf0973bd677cfe6f2456d62b72a13
Copying blob sha256:dfecd7e9912b76ed460b8edd5a85f1943666e38a973ab5458177cf2c7c3110e3
Copying blob sha256:464a8f74544589bf7b57f9a4cadcb6681e5ed00758f6c35025e691df4e88e890
Copying blob sha256:61d26dce6d4129f40549457a063c82aca2c606d73ef156d5ac7e495e1d52530a
Copying blob sha256:227a9906e6cccfa3aee837559aeb3fdcbf4409286dd4dd0a37287cfd483c37f6
Copying blob sha256:c826e867602d3c7a5d3b8a552e49d51c58cccf42c31d016a660a50b7f451ef09
Copying blob sha256:d40507eacecbbd8647bcee51d03f8b8cc86044d73cb72448112d49a08b8feaac
Copying blob sha256:93c7cb8303f3b8ca1165c92b4b55a08973e8bd1a1360dd7bc3cb8bd18804d2a8
Copying blob sha256:7f2d4a3887cae1984105738d5887b3ed325095939dfe31e89d5b47212b7f6479
Copying blob sha256:167c57c419bc5ef23ffe823e05c1cac741246ef69352f36a2724e2f4c276f52b
Copying blob sha256:1b456af08bb7c15512a9be77ce1ed44ce87f2c52315c99cbb2a99dd786adb4cb
Copying blob sha256:054fcf1bbe967cf874bfa40161b3c559f8cf03ca1e05a532a69a8edca4d8d0e5
Copying blob sha256:e26ee59fb49e43a8046b8c3812c52cc62bb8e5772e3323ff84c76a5715668c36
Copying blob sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1
Copying blob sha256:da54e4cf5022248356962228df4c357d330f5f87e2ebc0fcff2b766400721cef
Copying blob sha256:684176763e41ba50c8aa61c6e6eb6aec1ac35eea61710971f410dd1a5a2953a8
Copying blob sha256:78c24341e0f9d5ae00c21f5dd0a35adf62f5c1ba2618b5e0c7e45994eb69f6b5
Copying blob sha256:435f630eb19ee65f1b1e2db0d34b278037511d4344ca482b720c6bb1f70b8f58

View File

@ -17,5 +17,5 @@ module load singularity
export SINGULARITY_CACHEDIR=$TMPDIR export SINGULARITY_CACHEDIR=$TMPDIR
singularity pull olmocr.sif docker://alleninstituteforai/olmocr:latest singularity pull olmocr_container.sif docker://alleninstituteforai/olmocr:latest

View File

@ -12,7 +12,7 @@ olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-1124-7B").to(device)
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-2-1124-7B") tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-2-1124-7B")
#priming prompt #priming prompt
prompt_1 = "For the GIVEN DATA, Please categorize it based on the following numbered characteristics: \n\n 1: YES/NO (Characteristic 1. This is an English language empirical study. English language empirical studies discuss data or observations.) \n 2: YES/NO (Characteristic 2. This focuses on free and open source software (FOSS). The focus of this paper is on free or open source software projects and ecosystems.) \n 3: YES/NO (Characteristic 3. This focuses on FOSS project evolution. FOSS project evolution describes changes to free and open source projects and ecosystems.) \n 4: YES/NO (Characteristic 4. This focuses on FOSS project adaptation. FOSS project adaptation describes the intentional changes made by projects to better align with the project's broader environment.) \n\n Only respond with the appropriate number followed by 'YES' if the characteristic is present in the provided data or 'NO' if it is not (e.g. '1. YES; 2. NO;'). Do not provide any additional information." prompt_1 = "For the GIVEN DATA, Please categorize it based on the following numbered characteristics: \n\n 1: YES/NO (Characteristic 1. This is an English language empirical study. Empirical studies discuss data or observations.) \n 2: YES/NO (Characteristic 2. This discusses free and open source software (FOSS or OSS). The focus of the GIVEN DATA is on free or open source software projects or ecosystems.) \n 3: YES/NO (Characteristic 3. The GIVEN DATA discusses FOSS project evolution. FOSS project evolution describes any changes to free and open source projects.) \n 4: YES/NO (Characteristic 4. This GIVEN DATA discusses FOSS project adaptation. FOSS project adaptation describes the intentional strategic changes made by projects to better align with the project's broader environment.) \n\n Only respond with the appropriate number followed by 'YES' if the characteristic is present in the provided data or 'NO' if it is not (e.g. '1. YES; 2. NO;'). Do not provide any additional information."
example_4 = "Example 4: TITLE - Analysis of Open Source Software Evolution Using Evolution Curve Method \n ABSTRACT - Design and evolution of modem information systems is influenced by many factors: technical, organizational, social, and psychological. This is especially true for open source software systems (OSSS), when many developers from different backgrounds interact, share their ideas and contribute towards the development and improvement of a software product. The evolution of all OSSS is a continuous process of source code development, adaptation, improvement and maintenance. Studying changes to the various characteristics of source code can help us understand the evolution of a software system. In this paper, the software evolution process is analyzed using a proposed Evolution curve (E-curve) method, which is based on information theoretic metrics of source code. The method allows identifying major evolution stages and transition points of an analyzed software system. The application of the E-curves is demonstrated for the eMule system. .\n CATEGORIES: 1. YES; 2. YES; 3.YES; 4. NO" example_4 = "Example 4: TITLE - Analysis of Open Source Software Evolution Using Evolution Curve Method \n ABSTRACT - Design and evolution of modem information systems is influenced by many factors: technical, organizational, social, and psychological. This is especially true for open source software systems (OSSS), when many developers from different backgrounds interact, share their ideas and contribute towards the development and improvement of a software product. The evolution of all OSSS is a continuous process of source code development, adaptation, improvement and maintenance. Studying changes to the various characteristics of source code can help us understand the evolution of a software system. In this paper, the software evolution process is analyzed using a proposed Evolution curve (E-curve) method, which is based on information theoretic metrics of source code. The method allows identifying major evolution stages and transition points of an analyzed software system. The application of the E-curves is demonstrated for the eMule system. .\n CATEGORIES: 1. YES; 2. YES; 3.YES; 4. NO"

View File

@ -1,9 +0,0 @@
starting the job at: Mon Jun 2 09:24:01 CDT 2025
setting up the environment
running the p1 categorization script
cuda
NVIDIA A100-PCIE-40GB
_CudaDeviceProperties(name='NVIDIA A100-PCIE-40GB', major=8, minor=0, total_memory=40442MB, multi_processor_count=108, uuid=c91b110a-9eb1-15b6-ff0a-7aeb47b26ff0, L2_cache_size=40MB)
Loading checkpoint shards: 0%| | 0/6 [00:00<?, ?it/s] Loading checkpoint shards: 17%|█▋ | 1/6 [00:00<00:04, 1.24it/s] Loading checkpoint shards: 33%|███▎ | 2/6 [00:01<00:03, 1.08it/s] Loading checkpoint shards: 50%|█████ | 3/6 [00:02<00:02, 1.05it/s] Loading checkpoint shards: 67%|██████▋ | 4/6 [00:03<00:01, 1.01it/s] Loading checkpoint shards: 83%|████████▎ | 5/6 [00:04<00:01, 1.02s/it] Loading checkpoint shards: 100%|██████████| 6/6 [00:05<00:00, 1.12it/s] Loading checkpoint shards: 100%|██████████| 6/6 [00:05<00:00, 1.08it/s]
job finished, cleaning up
job pau at: Mon Jun 2 11:08:43 CDT 2025