1
0

Compare commits

..

17 Commits

Author SHA1 Message Date
mgaughan
c50a3b57ff updating new git organization to remove sif file 2025-06-03 09:52:20 -05:00
mgaughan
ff8ca0b46e updating with new container, collected categorizations 2025-06-03 09:43:25 -05:00
mgaughan
c6f4a244f4 updating (and failing) to plot categorization with sankey diagram 2025-06-02 22:40:48 -05:00
mgaughan
9403c79c44 pulling new olmocr image and new categorization stuff 2025-06-02 21:24:53 -05:00
mgaughan
c5df6cb6c6 removing ill categorizations 2025-06-02 11:35:45 -05:00
mgaughan
63450ba7ef now with updated categorizations 2025-06-02 11:29:59 -05:00
mgaughan
5ed797e971 trying to get olmocr to run, updated categorization values 2025-06-02 11:27:23 -05:00
mgaughan
d8b9ca9dea updating with docker images and categorized citations 2025-06-02 09:01:18 -05:00
mgaughan
c7448f2fc2 trying to load-balance the few-shot a bit more 2025-05-30 21:45:30 -05:00
mgaughan
225d7f53c8 bad categorization data, some restructuring of the repo 2025-05-30 21:36:18 -05:00
mgaughan
9985e190e7 updated with preliminary categorization 2025-05-30 21:20:36 -05:00
mgaughan
c3bb0801a2 ~final~ update to categorization script 2025-05-30 16:39:24 -05:00
mgaughan
86e2cd3ed8 updating with manual dedup of citations 2025-05-30 16:37:03 -05:00
mgaughan
1d63537027 redoing the dedup csv, something wrong with the other one 2025-05-30 13:52:13 -05:00
mgaughan
9d86f24c41 updating scripts and models for classification; errors in citation csv 2025-05-30 13:33:41 -05:00
mgaughan
17c69a6c92 updating prompts for categorization trial 2025-05-20 23:12:11 -05:00
mgaughan
7aedc1edbb mid-point on setting up the olmo models on quest; updating the organization of different scripts 2025-05-20 21:32:44 -05:00
5 changed files with 2975 additions and 2 deletions

1
.gitignore vendored Normal file
View File

@ -0,0 +1 @@
*.sif

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,9 @@
starting the job at: Mon Jun 2 22:58:46 CDT 2025
setting up the environment
running the p1 categorization script
cuda
NVIDIA A100-PCIE-40GB
_CudaDeviceProperties(name='NVIDIA A100-PCIE-40GB', major=8, minor=0, total_memory=40442MB, multi_processor_count=108, uuid=a48cfab5-6d74-8479-c725-d4a6e53059e3, L2_cache_size=40MB)
Loading checkpoint shards: 0%| | 0/6 [00:00<?, ?it/s] Loading checkpoint shards: 17%|█▋ | 1/6 [00:01<00:06, 1.36s/it] Loading checkpoint shards: 33%|███▎ | 2/6 [00:02<00:05, 1.33s/it] Loading checkpoint shards: 50%|█████ | 3/6 [00:03<00:03, 1.28s/it] Loading checkpoint shards: 67%|██████▋ | 4/6 [00:05<00:02, 1.46s/it] Loading checkpoint shards: 83%|████████▎ | 5/6 [00:07<00:01, 1.47s/it] Loading checkpoint shards: 100%|██████████| 6/6 [00:08<00:00, 1.30s/it] Loading checkpoint shards: 100%|██████████| 6/6 [00:08<00:00, 1.35s/it]
job finished, cleaning up
job pau at: Tue Jun 3 00:46:04 CDT 2025

View File

@ -12,7 +12,7 @@ olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-1124-7B").to(device)
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-2-1124-7B")
#priming prompt
prompt_1 = "For the GIVEN DATA, Please categorize it based on the following numbered characteristics: \n\n 1: YES/NO (Characteristic 1. This is an English language empirical study. Empirical studies discuss data or observations.) \n 2: YES/NO (Characteristic 2. This discusses free and open source software (FOSS or OSS). The focus of the GIVEN DATA is on free or open source software projects or ecosystems.) \n 3: YES/NO (Characteristic 3. The GIVEN DATA discusses FOSS project evolution. FOSS project evolution describes any changes to free and open source projects.) \n 4: YES/NO (Characteristic 4. This GIVEN DATA discusses FOSS project adaptation. FOSS project adaptation describes the intentional strategic changes made by projects to better align with the project's broader environment.) \n\n Only respond with the appropriate number followed by 'YES' if the characteristic is present in the provided data or 'NO' if it is not (e.g. '1. YES; 2. NO;'). Do not provide any additional information."
prompt_1 = "For the GIVEN DATA, Please categorize it based on the following numbered characteristics: \n\n 1: YES/NO (Characteristic 1. This is an English language empirical study. Empirical studies discuss data or observations.) \n 2: YES/NO (Characteristic 2. This discusses free and open source software (FOSS or OSS). The focus of the GIVEN DATA is on free or open source software projects or ecosystems.) \n 3: YES/NO (Characteristic 3. The GIVEN DATA discusses FOSS project evolution. FOSS project evolution describes any changes to free and open source projects.) \n 4: YES/NO (Characteristic 4. This GIVEN DATA discusses FOSS project adaptation. FOSS project adaptation describes the intentional strategic changes made by projects to better align with the project's broader environment.) \n\n Characteristics 2, 3, and 4 can only be YES if the preceding characteristic was also a YES. \n\n Only respond with the appropriate number followed by 'YES' if the characteristic is present in the provided data or 'NO' if it is not (e.g. '1. YES; 2. NO;'). Do not provide any additional information."
example_4 = "Example 4: TITLE - Analysis of Open Source Software Evolution Using Evolution Curve Method \n ABSTRACT - Design and evolution of modem information systems is influenced by many factors: technical, organizational, social, and psychological. This is especially true for open source software systems (OSSS), when many developers from different backgrounds interact, share their ideas and contribute towards the development and improvement of a software product. The evolution of all OSSS is a continuous process of source code development, adaptation, improvement and maintenance. Studying changes to the various characteristics of source code can help us understand the evolution of a software system. In this paper, the software evolution process is analyzed using a proposed Evolution curve (E-curve) method, which is based on information theoretic metrics of source code. The method allows identifying major evolution stages and transition points of an analyzed software system. The application of the E-curves is demonstrated for the eMule system. .\n CATEGORIES: 1. YES; 2. YES; 3.YES; 4. NO"
@ -62,4 +62,4 @@ with open("cites/053025_man_filtered_dedup.csv", mode='r', newline='') as file:
array_of_categorizations.append(cite_dict)
#CSV everything
df = pd.DataFrame(array_of_categorizations)
df.to_csv('060225_olmo_categorized_citations.csv', index=False)
df.to_csv('060325_olmo_categorized_citations.csv', index=False)