updating scripts and models for classification; errors in citation csv

2025-05-30 13:33:41 -05:00 · 2025-05-30 13:33:41 -05:00 · d7e3d54e0f
commit d7e3d54e0f
parent c1f1d81f29
2 changed files with 53 additions and 14 deletions
--- a/models/p1-categorization.py
+++ b/models/p1-categorization.py
@ -1,27 +1,64 @@
 from transformers import AutoModelForCausalLM, AutoTokenizer, OlmoForCausalLM
 import torch
+import csv 
+import pandas as pd 

 #load in the different models 
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-0425-1B-Instruct").to(device)
-tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-2-0425-1B-Instruct")
+print(device)
+olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-1124-7B").to(device)
+tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-2-1124-7B")

 #priming prompt
-first_sentence = "Given the following data:"
+prompt_1 = "For the GIVEN DATA, Please categorize it based on the following numbered characteristics: \n\n 1: YES/NO (Characteristic 1. This is an English language empirical study. English language empirical studies are academic papers written in English that study or analyze evidence. Literature reviews are not empirical studies.) \n 2: YES/NO (Characteristic 2. This focuses on free and open source software (FOSS). The focus of this paper is on FOSS projects and ecosystems.) \n 3: YES/NO (Characteristic 3. This focuses on FOSS project evolution. FOSS project evolution is the study of longitudinal changes to the characteristics of free and open source projects.) \n 4: YES/NO (Characteristic 4. This focuses on FOSS project adaptation. FOSS project adaptation describes the intentional changes made to the characteristics of FOSS projects to better align with the project's broader environment.) \n\n Only respond with the appropriate number followed by 'YES' if the characteristic is present in the provided data or 'NO' if it is not (e.g. '1. NO; 2. YES;'). Do not provide any additional information."

-data_prompt = "'Title - Underproduction: An Approach for Measuring Risk in Open Source Software \n Abstract - The widespread adoption of Free/Libre and Open Source Software (FLOSS) means that the ongoing maintenance of many widely used software components relies on the collaborative effort of volunteers who set their own priorities and choose their own tasks. We argue that this has created a new form of risk that we call 'underproduction' which occurs when the supply of software engineering labor becomes out of alignment with the demand of people who rely on the software produced. We present a conceptual framework for identifying relative underproduction in software as well as a statistical method for applying our framework to a comprehensive dataset from the Debian GNU/Linux distribution that includes 21,902 source packages and the full history of 461,656 bugs. We draw on this application to present two experiments: (1) a demonstration of how our technique can be used to identify at-risk software packages in a large FLOSS repository and (2) a validation of these results using an alternate indicator of package risk. Our analysis demonstrates both the utility of our approach and reveals the existence of widespread underproduction in a range of widely-installed software components in Debian.'"
+example_1 = "Example 1: TITLE -Understanding the OSS Communities of Deep Learning Frameworks: A Comparative Case Study of PYTORCH and TENSORFLOW \n ABSTRACT - Over the past two decades, deep learning has received tremendous success in developing software systems across various domains. Deep learning frameworks have been proposed to facilitate the development of such software systems, among which, PYTORCH and TENSORFLOW stand out as notable examples. Considerable attention focuses on exploring software engineering practices and addressing diverse technical aspects in developing and deploying deep learning frameworks and software systems. Despite these efforts, little is known about the open source software communities involved in the development of deep learning frameworks. In this article, we perform a comparative investigation into the open source software communities of the two representative deep learning frameworks, PYTORCH and TENSORFLOW. To facilitate the investigation, we compile a dataset of 2,792 and 3,288 code commit authors, along with 9,826 and 19,750 participants engaged in issue events on GITHUB, from the two communities, respectively. With the dataset, we first characterize the structures of the two communities by employing four operationalizations to classify contributors into various roles and inspect the contributions made by common contributors across the two communities. We then conduct a longitudinal analysis to characterize the evolution of the two communities across various releases, in terms of the numbers of contributors with various roles and role transitions among contributors. Finally, we explore the causal effects between community characteristics and the popularity of the two frameworks. We find that the TENSORFLOW community harbors a larger base of contributors, encompassing a higher proportion of core developers and a more extensive cohort of active users compared to the PYTORCH community. In terms of the technical background of the developers, 64.4% and 56.1% developers in the PYTORCH and TENSORFLOW communities are employed by the leading companies of the corresponding open source software projects, Meta and Google, respectively; 25.9% and 21.9% core developers in the PYTORCH and TENSORFLOW communities possess Ph.D. degrees, while 77.2% and 77.7% contribute to other machine learning or deep learning open source projects, respectively. Developers contributing to both communities demonstrate spatial and temporal similarities to some extent in their pull requests across the respective projects. The evolution of contributors with various roles exhibits a consistent upward trend over time in the PYTORCH community. Conversely, a noticeable turning point in the growth of contributors characterizes the evolution of the TENSORFLOW community. Both communities show a statistically significant decreasing trend in the inflow rates of core developers. Furthermore, we observe statistically significant causal effects between the expansion of communities and retention of core developers and the popularity of deep learning frameworks. Based on our findings, we discuss implications, provide recommendations for sustaining open source software communities of deep learning frameworks, and outline directions for future research.\n CATEGORIES: 1. YES; 2. YES; 3.YES; 4. NO"

+example_4 = "Example 4: TITLE - “Needle” hidden in silk floss: Inactivation effect and mechanism of melamine sponge loaded bismuth oxide composite copper-metal organic framework (MS/Bi2O3@Cu-MOF) as floating photocatalyst on Microcystis aeruginosa \n ABSTRACT - Photocatalytic technology showed significant potential for addressing the issue of cyanobacterial blooms resulting from eutrophication in bodies of water. However, the traditional powder materials were easy to agglomerate and settle, which led to the decrease of photocatalytic activity. The emergence of floating photocatalyst was important for the practical application of controlling harmful algal blooms. This study was based on the efficient powder photocatalyst bismuth oxide composite copper-metal organic framework (Bi2O3 @Cu-MOF), which was successfully loaded onto melamine sponge (MS) by sodium alginate immobilization to prepare a floating photocatalyst MS/Bi2O3 @Cu-MOF for the inactivation of Microcystis aeruginosa (M. aeruginosa) under visible light. When the capacity was 0.4 g (CA0.4), MS/Bi2O3 @Cu-MOF showed good photocatalytic activity, and the inactivation rate of M. aeruginosa reached 74.462% after 120 h. MS/Bi2O3 @Cu-MOF-CA0.4 showed a large specific surface area of 30.490 m2/g and an average pore size of 22.862 nm, belonging to mesoporous materials. After 120 h of treatment, the content of soluble protein in the MS/Bi2O3 @Cu-MOF-CA0.4 treatment group decreased to 0.365 mg/L, the content of chlorophyll a (chla) was 0.023 mg/L, the content of malondialdehyde (MDA) increased to 3.168 nmol/mgprot, and the contents of various antioxidant enzymes experienced drastic changes, first increasing and then decreasing. The photocatalytic process generated center dot OH and center dot O2-, which played key role in inactivating the algae cells. Additionally, the release of Cu2+ and adsorption of the material also contributed to the process. \n CATEGORIES: 1. YES; 2. NO; 3. NO; 4. NO;"

-third_prompt="please categorize it based on the following numbered characteristics: \n\n 1: YES/NO (Characteristic 1. This is an English language empirical study, this an academic papers written in Egnlish that studies or analyzes evidence. Literature reviews are not empirical studies.)  \n 2: YES/NO (Characteristic 2. This focuses on FOSS projects, the focus of the research work is on the domain of free and open source software projects.) \n 3: YES/NO (Characteristic 3. This studies FOSS evolution, the data focuses on longitudinal changes to free and open source projects over time.) \n 4: YES/NO (Characteristic 4. This studies FOSS adaptation, the data focuses on intentional changes made by free and open source software projects to better align themselves with their broader environment.) \n\n Only respond with the appropriate number followed by 'YES' if the characteristic is present in the provided data or 'NO' if it is not (e.g. '1: NO; 2: YES;'. Do not provide any additional information."
+example_3 = "Example 3: TITLE - Open Source Software Community Inclusion Initiatives to Support Women Participation \n ABSTRACT - This paper focuses on the inclusion initiatives of Open Source Software (OSS) Communities to support women who participate in their online communities. In recent years, media and research has highlighted the negative experiences of women in OSS and we believe that could be detrimental to the women of OSS. Therefore, in this research, we built upon the research that demonstrates the value of Codes of Conduct for minorities in an online community. Additionally, we focus on women only spaces in OSS, because past research on women and IT shows that women perform better when they can build connections and mentoring networks with other women. We investigated 355 OSS websites for presence of women only spaces and searched for, collected and analyzed the Codes of Conduct on the websites of these OSS. Qualitative content analysis of the websites show that only 12 out of 355 websites have women only sections. Less than ten percent (28) of the analyzed websites had a code of conduct.\n CATEGORIES: 1. YES; 2. YES; 3. NO; 4. NO"

-prompt = f"{first_sentence}\n{data_prompt}\n{third_prompt}"
+example_2 = "Example 2: TITLE - An Exploratory Mixed-methods Study on General Data Protection Regulation (GDPR) Compliance in Open-Source Software \n ABSTRACT- Background: Governments worldwide are considering data privacy regulations. These laws, such as the European Union’s General Data Protection Regulation (GDPR), require software developers to meet privacy-related requirements when interacting with users’ data. Prior research describes the impact of such laws on software development, but only for commercial software. Although open-source software is commonly integrated into regulated software, and thus must be engineered or adapted for compliance, we do not know how such laws impact open-source software development. Aims: To understand how data privacy laws affect open-source software (OSS) development, we focus on the European Union’s GDPR, as it is the most prominent such law. We investigated how GDPR compliance activities influence OSS developer activity (RQ1), how OSS developers perceive fulfilling GDPR requirements (RQ2), the most challenging GDPR requirements to implement (RQ3), and how OSS developers assess GDPR compliance (RQ4). Method: We distributed an online survey to explore perceptions of GDPR implementations from open-source developers (N=56). To augment this analysis, we further conducted a repository mining study to analyze development metrics on pull requests (N=31,462) submitted to open-source GitHub repositories. Results: Our results suggest GDPR policies complicate OSS development and introduce challenges, primarily regarding the management of users’ data, implementation costs and time, and assessments of compliance. Moreover, we observed negative perceptions of the GDPR from OSS developers and significant increases in development activity, in particular metrics related to coding and reviewing, on GitHub pull requests related to GDPR compliance. Conclusions: Our findings provide future research directions and implications for improving data privacy policies, motivating the need for relevant resources and automated tools to support data privacy regulation implementation and compliance efforts in OSS. \n CATEGORIES: 1. YES; 2. YES; 3. YES; 4. YES;"

-inputs = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)
-
-#deterministic sampling 
-response = olmo.generate(**inputs, max_new_tokens=256, do_sample=False)
-response_txt = tokenizer.batch_decode(response, skip_special_tokens=True)[0]
-
-with open('/home/nws8519/git/adaptation-slr/trial-output.txt', 'w') as file:
-    file.write(response_txt)
+with open("cites/auto-dedup-cites.csv", mode='r', newline='') as file:
+    reader = csv.reader(file)
+    array_of_categorizations = []
+    index = -1
+    for row in reader:
+        index += 1
+        if index < 0:
+            continue
+        cite_dict = {}
+        #organizing the data from each citation
+        cite_dict['key'] = row[0]
+        cite_dict['title'] = row[4]
+        cite_dict['abstract'] = row[10]
+        #prompt construction
+        given_data = f"GIVEN DATA: Title - {cite_dict['title']} \n Abstract - {cite_dict['abstract']}"
+        prompt = f"{prompt_1}\n\n{example_1}\n\n{example_2}\n\n{example_3}\n\n{example_4}\n\n{given_data}\n"
+        #handoff to the model
+        inputs = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)
+        #deterministic sampling and getting the response back 
+        response = olmo.generate(**inputs, max_new_tokens=256, do_sample=False)
+        response_txt = tokenizer.batch_decode(response, skip_special_tokens=True)[0]
+        #getting the resulting codes 
+        codes_id = response_txt.rfind("CATEGORIES:")
+        if codes_id != -1:
+            result = response_txt[codes_id + len("CATEGORIES:"):].strip()
+        else:
+            cite_dict["1"] = "NULL"
+            cite_dict["2"] = "NULL"
+            cite_dict["3"] = "NULL"
+            cite_dict["4"] = "NULL"
+        #writing them to the citation_dict
+        for item in result.strip(";").split(";"):
+            key_value = item.strip().split('. ')
+            if len(key_value) == 2:
+                key = key_value[0]
+                value = key_value[1]
+                cite_dict[key] = value
+        array_of_categorizations.append(cite_dict)

+    #CSV everything
+    df = pd.DataFrame(array_of_categorizations)
+    df.to_csv('053025_olmo_categorized_citations.csv', index=False)
--- a/scripts/p1-cat.sh
+++ b/scripts/p1-cat.sh
@ -12,6 +12,8 @@
 #SBATCH --mail-type=BEGIN,END,FAIL
 #SBATCH --mail-user=gaughan@u.northwestern.edu

+echo "starting the job at: $(date)"
+
 echo "setting up the environment"

 module purge