From c18c9401e9f54f04aae3a945e3d1b817d73630fd Mon Sep 17 00:00:00 2001 From: Carl Colglazier Date: Tue, 21 May 2024 15:36:08 -0500 Subject: [PATCH] Submitted revisions. --- acm.qmd | 42 +++++++++++++++++++++++++----------------- 1 file changed, 25 insertions(+), 17 deletions(-) diff --git a/acm.qmd b/acm.qmd index ae8d2a7..1af62f6 100644 --- a/acm.qmd +++ b/acm.qmd @@ -397,25 +397,23 @@ Based on these findings, we suggest a need for better ways for newcomers to find ## Constraints and Evaluation -One challenge in building such a system is the decentralized nature of the system. A single, central actor which collects data from servers and then distributes recommendations would be antithetical to the decentralized nature of Mastodon. Instead, we propose a system where servers can report the top hashtags by the number of unique accounts on the server using them during the last three months. Such a system would be opt-in and require few additional server resources since tags already have their own database table. +The decentralized web presents unique challenges for recommendation systems. Centralized recommendation systems can collect data from all users and use this data to make recommendations. However, this is less desirable on the decentralized web, where data is spread across many servers and users may not want to share their data with a central authority. Instead, I propose a system where servers can report the top hashtags by the number of unique accounts on the server using them during the last three months. Such a system would be opt-in and require few additional server resources since tags already have their own database table. Because each server only reports aggregated counts of publicly posted hashtags, this also reduces the risk of privacy violations. -We evaluate the system in part using the accounts which moved between servers. Based on their posting history (e.g. hashtags), can the recommendations system predict where they will move to? +In the Mastodon context, the cold start problem has two possible facets: there is no information on new servers and there is also no information on new users. New servers are thus likely prone to falling for popularity bias: there is simply more data on larger servers. A common strategy to deal with new users is to ask for some intitial preferences to create an initial workable user profile. In the case of this system, we ask the user to provide a set of tags which they are interested in. We then use these tags to find the top servers which match these tags. -In the Mastodon context, the cold start problem has two possible facets: there is no information on new servers and there is also no information on new users. New servers are thus likely prone to falling for popularity bias: there is simply more data on larger servers. A common strategy to deal with new users is to ask for some intitial preferences to create an initial workable user profile. -The decentralized web presents unique challenges for recommendation systems. Centralized recommendation systems can collect data from all users and use this data to make recommendations. However, this is less desirable on the decentralized web, where data is spread across many servers and users may not want to share their data with a central authority. +I plan to evaluate the system in part using the accounts which moved between servers. Based on their posting history (e.g. hashtags), can the recommendations system predict where they will move to? As my recommender system operates under the assumption that smaller, more topic-focused servers are better, it follows that a diverse set of niche results which only match a few tags are more helpful than a set of results which match a larger and more broad set of tags. The system therefore presents results sorted in a manner which encourages a higher diversity of results. One current limitation of my system is that it does not account for the relationship between tags, e.g. “union” and “unions” are essentially the same tag and “furry” and “fursuit” are highly related tags which are in similar areas of embedded space. In future revisions, I hope to account for the relationship between similar tags and pull the top servers from clusters of highly related tags with the top priority going to clusters based on their number of selected tags. This system could be implemented efficiently in O(nt) time given a minimum cluster size of $t$. + ## Recommendation System Design -### Similarity Matrix - We use Okapi BM25 to construct a term frequency-inverse document frequency (TF-IDF) model to associate the top tags with each server using counts of tag-account pairs from each server for the term frequency and the number of servers that use each tag for the inverse document frequency. We then L2 normalize the vectors for each tag and calculate the cosine similarity between the tag vectors for each server. $$ @@ -434,14 +432,11 @@ $$ tfidf = \frac{tf \cdot idf}{\| tf \cdot idf \|_2} $$ -### Recommendation System -We then used the normalized TF-IDF matrix to produce recommendations using SVD. +We then used the normalized TF-IDF matrix to produce recommendations using SVD where the relationship between tags and servers can be presented as $A = U \Sigma V^{T}$. We then use the similarity matrix to find the top servers which match the user's selected tags. We can also suggest related tags to users based on the similarity between tags, $U \Sigma$. ## Applications - - ### Server Similarity Neighborhoods Mastodon provides two feeds in addition to a user's home timeline populated by accounts they follow: a local timeline with all public posts from their local server and a federated timeline which includes all posts from users followed by other users on their server. We suggest a third kind of timeline, a *neighborhood timeline*, which filters the federated timeline by topic. @@ -452,6 +447,8 @@ $$ \text{similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|} $$ +For an example of how this might work in practice, consider a use-case for someone who is a researcher in the field of human-computer interaction. They might be situated with an account on `hci.social`, but also interested in discovering posts from account on similar related servers. We can use the similarity matrix to find the top five servers most similar to `hci.social`, which is shown in @tbl-sim-servers. + ::: {#tbl-sim-servers} ```{r} @@ -515,9 +512,19 @@ Given a set of popular tags and a list of servers, we build a recommendation sys ## Evaluation -### Server Recommendations for Users +### Server Recommendations for Users (Offline) -For evaluation, we use data from posts on accounts during a different time period from the one we used to train the recommender system. The goal of the system is to suggest the best servers for these accounts. +#### Time-based + +For evaluation, we plan to use data from posts on accounts during a different time period from the one we used to train the recommender system. The goal of the system is to suggest the best servers for these accounts. + +#### Movement-based + +In parallel with the analysis of server survival, we also take an interest in users who moved servers since we can assume that these users found a server they liked better than their original server. We can use the recommender system to predict where these users will move to and use these predictions to evaluate the sytem. + +### Online Evaluation + +_I have also given some thought to online evaluation. Could we use an aleternative version of the front-end to produce recomemndations for interesting servers from existing accounts?_ ### Rubustness to Limited Data @@ -550,12 +557,13 @@ We simulated various scenarios that limit both servers that report data and the # Discussion +This work provides a first step toward building a recommendation system for finding servers on the Fediverse based on empirical evidence of trace data from thousands of Fediverse newcomers. We find that servers matter and that users tend to move from larger servers to smaller servers. Our recommender system considers constraints in a novel context where data is decentralized and privacy is a major concern. We propose a federated recommendation system which can be implemented with minimal resources and which can provide value to users by helping them find servers which match their interests. + The analysis can also be improved by additionally focusing on factors lead to accounts remaining active or dropping out, which a particular focus on the actual activity of accounts over time. For instance, do accounts that interact with other users more remain active longer? Are there particular markers of activity that are more predictive of account retention? Future work could use these to provide suggests for ways to helps newcomers during the onboarding process. The observational nature of the data limit some of the causal claims we can make. It is unclear, for instance, if accounts on general servers are less likely to remain active because of the server itself or because of the type of users who join such servers. For example, it is conceivable that the kind of person who spends more time researching which server to join is more invested in their Mastodon experience than one who simply joins the first server they find. - ## Future Work Future work is necessary to determine the how well the recommendation system is at helping users find servers that match their interests. This may involve user studies and interviews to determine how well the system works in practice. @@ -609,8 +617,8 @@ library(ggrepel) We also illustrate the potential value of such a system with three user stories: -**User Story 1**: Juan is a human-computer interaction researcher looking for a server to connect with colleagues and also share about his projects. +**User Story 1**: Juan is a human-computer interaction researcher looking for a server to connect with colleagues and also share about his projects. He is interested in finding a server with a focus on research and technology. Juan inputs the tags "research", "academia", and "technology" into the system and receives a list of servers which match his interests: `sciences.social`, `mathstodon.xyz`, `mas.to`, `synapse.cafe`. -**User Story 2** (Arthur) just wants to connect with friends and family. +**User Story 2** (Arthur) just wants to connect with friends and family. For some reason, Arthur clicks every single major category and gets the suggestions: `mas.to`, `mstdn.social`, `mastodon.world`, `mastodon.social`. -**User Story 3** (Tracy) has run a niche aesthetic blog on Tumblr for the last eight years and is curious about migrating to the Fediverse. \ No newline at end of file +**User Story 3** (Tracy) has run a niche fandom blog on Tumblr for the last eight years and is curious about migrating to the Fediverse. She inputs the tags "doctorwho", "fanart", and "fanfiction" and gets the suggestions: `blorbo.social`, `sakurajima.moe`, `toot.kif.rocks`, `socel.net`. \ No newline at end of file