updating new git organization to remove sif file

updating with new container, collected categorizations
updating (and failing) to plot categorization with sankey diagram
2025-06-03 09:52:20 -05:00 · 2025-06-03 09:43:25 -05:00 · 2025-06-02 22:40:48 -05:00 · 2025-06-02 21:24:53 -05:00 · 2025-06-02 11:35:45 -05:00 · 2025-06-02 11:29:59 -05:00
5 changed files with 2975 additions and 2 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1 @@
+*.sif
--- a/cites/060325_olmo_categorized_citations.csv
+++ b/cites/060325_olmo_categorized_citations.csv
--- a/cites/second-p1-categorization.log
+++ b/cites/second-p1-categorization.log
@ -0,0 +1,9 @@
+starting the job at: Mon Jun  2 22:58:46 CDT 2025
+setting up the environment
+running the p1 categorization script
+cuda
+NVIDIA A100-PCIE-40GB
+_CudaDeviceProperties(name='NVIDIA A100-PCIE-40GB', major=8, minor=0, total_memory=40442MB, multi_processor_count=108, uuid=a48cfab5-6d74-8479-c725-d4a6e53059e3, L2_cache_size=40MB)
+
Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]
Loading checkpoint shards:  17%|█▋        | 1/6 [00:01<00:06,  1.36s/it]
Loading checkpoint shards:  33%|███▎      | 2/6 [00:02<00:05,  1.33s/it]
Loading checkpoint shards:  50%|█████     | 3/6 [00:03<00:03,  1.28s/it]
Loading checkpoint shards:  67%|██████▋   | 4/6 [00:05<00:02,  1.46s/it]
Loading checkpoint shards:  83%|████████▎ | 5/6 [00:07<00:01,  1.47s/it]
Loading checkpoint shards: 100%|██████████| 6/6 [00:08<00:00,  1.30s/it]
Loading checkpoint shards: 100%|██████████| 6/6 [00:08<00:00,  1.35s/it]
+job finished, cleaning up
+job pau at: Tue Jun  3 00:46:04 CDT 2025
--- a/containers/olmocr-pull.log
+++ b/containers/olmocr-pull.log
--- a/models/p1-categorization.py
+++ b/models/p1-categorization.py
@ -12,7 +12,7 @@ olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-1124-7B").to(device)
 tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-2-1124-7B")

 #priming prompt
-prompt_1 = "For the GIVEN DATA, Please categorize it based on the following numbered characteristics: \n\n 1: YES/NO (Characteristic 1. This is an English language empirical study. Empirical studies discuss data or observations.) \n 2: YES/NO (Characteristic 2. This discusses free and open source software (FOSS or OSS). The focus of the GIVEN DATA is on free or open source software projects or ecosystems.) \n 3: YES/NO (Characteristic 3. The GIVEN DATA discusses FOSS project evolution. FOSS project evolution describes any changes to free and open source projects.) \n 4: YES/NO (Characteristic 4. This GIVEN DATA discusses FOSS project adaptation. FOSS project adaptation describes the intentional strategic changes made by projects to better align with the project's broader environment.) \n\n Only respond with the appropriate number followed by 'YES' if the characteristic is present in the provided data or 'NO' if it is not (e.g. '1. YES; 2. NO;'). Do not provide any additional information."
+prompt_1 = "For the GIVEN DATA, Please categorize it based on the following numbered characteristics: \n\n 1: YES/NO (Characteristic 1. This is an English language empirical study. Empirical studies discuss data or observations.) \n 2: YES/NO (Characteristic 2. This discusses free and open source software (FOSS or OSS). The focus of the GIVEN DATA is on free or open source software projects or ecosystems.) \n 3: YES/NO (Characteristic 3. The GIVEN DATA discusses FOSS project evolution. FOSS project evolution describes any changes to free and open source projects.) \n 4: YES/NO (Characteristic 4. This GIVEN DATA discusses FOSS project adaptation. FOSS project adaptation describes the intentional strategic changes made by projects to better align with the project's broader environment.) \n\n Characteristics 2, 3, and 4 can only be YES if the preceding characteristic was also a YES. \n\n Only respond with the appropriate number followed by 'YES' if the characteristic is present in the provided data or 'NO' if it is not (e.g. '1. YES; 2. NO;'). Do not provide any additional information."

 example_4 = "Example 4: TITLE - Analysis of Open Source Software Evolution Using Evolution Curve Method \n ABSTRACT - Design and evolution of modem information systems is influenced by many   factors: technical, organizational, social, and psychological. This is   especially true for open source software systems (OSSS), when many   developers from different backgrounds interact, share their ideas and   contribute towards the development and improvement of a software   product. The evolution of all OSSS is a continuous process of source   code development, adaptation, improvement and maintenance. Studying   changes to the various characteristics of source code can help us   understand the evolution of a software system. In this paper, the   software evolution process is analyzed using a proposed Evolution curve   (E-curve) method, which is based on information theoretic metrics of   source code. The method allows identifying major evolution stages and   transition points of an analyzed software system. The application of the   E-curves is demonstrated for the eMule system. .\n CATEGORIES: 1. YES; 2. YES; 3.YES; 4. NO"

@ -62,4 +62,4 @@ with open("cites/053025_man_filtered_dedup.csv", mode='r', newline='') as file:
        array_of_categorizations.append(cite_dict)
    #CSV everything
    df = pd.DataFrame(array_of_categorizations)
-    df.to_csv('060225_olmo_categorized_citations.csv', index=False)
+    df.to_csv('060325_olmo_categorized_citations.csv', index=False)
Author	SHA1	Message	Date
mgaughan	c50a3b57ff	updating new git organization to remove sif file	2025-06-03 09:52:20 -05:00
mgaughan	ff8ca0b46e	updating with new container, collected categorizations	2025-06-03 09:43:25 -05:00
mgaughan	c6f4a244f4	updating (and failing) to plot categorization with sankey diagram	2025-06-02 22:40:48 -05:00
mgaughan	9403c79c44	pulling new olmocr image and new categorization stuff	2025-06-02 21:24:53 -05:00
mgaughan	c5df6cb6c6	removing ill categorizations	2025-06-02 11:35:45 -05:00
mgaughan	63450ba7ef	now with updated categorizations	2025-06-02 11:29:59 -05:00
mgaughan	5ed797e971	trying to get olmocr to run, updated categorization values	2025-06-02 11:27:23 -05:00
mgaughan	d8b9ca9dea	updating with docker images and categorized citations	2025-06-02 09:01:18 -05:00
mgaughan	c7448f2fc2	trying to load-balance the few-shot a bit more	2025-05-30 21:45:30 -05:00
mgaughan	225d7f53c8	bad categorization data, some restructuring of the repo	2025-05-30 21:36:18 -05:00
mgaughan	9985e190e7	updated with preliminary categorization	2025-05-30 21:20:36 -05:00
mgaughan	c3bb0801a2	~final~ update to categorization script	2025-05-30 16:39:24 -05:00
mgaughan	86e2cd3ed8	updating with manual dedup of citations	2025-05-30 16:37:03 -05:00
mgaughan	1d63537027	redoing the dedup csv, something wrong with the other one	2025-05-30 13:52:13 -05:00
mgaughan	9d86f24c41	updating scripts and models for classification; errors in citation csv	2025-05-30 13:33:41 -05:00
mgaughan	17c69a6c92	updating prompts for categorization trial	2025-05-20 23:12:11 -05:00
mgaughan	7aedc1edbb	mid-point on setting up the olmo models on quest; updating the organization of different scripts	2025-05-20 21:32:44 -05:00