162 lines
12 KiB
Plaintext
162 lines
12 KiB
Plaintext
---
|
||
title: "Do Servers Matter on Mastodon? Data-driven Design for Decentralized Social Media"
|
||
author: Carl Colglazier
|
||
bibliography: ../../references.bib
|
||
format:
|
||
ic2s2-pdf:
|
||
include-in-header:
|
||
- text: |
|
||
\usepackage{tabularray}
|
||
execute:
|
||
freeze: true
|
||
echo: false
|
||
error: false
|
||
warning: false
|
||
message: false
|
||
cache: false
|
||
knitr:
|
||
opts_knit:
|
||
verbose: true
|
||
---
|
||
|
||
```{r, cache.extra = tools::md5sum("codebase/R/helpers.R")}
|
||
#| label: fig-account-timeline
|
||
#| fig-cap: "Accounts in the dataset created between January 2022 and March 2023. The top panels shows the proportion of accounts still active 45 days after creation, the proportion of accounts that have moved, and the proportion of accounts that have been suspended. The bottom panel shows the count of accounts created each week. The dashed vertical lines in the bottom panel represent the annoucement day of the Elon Musk Twitter acquisition, the acquisition closing day, a day where Twitter suspended a number of prominent journalist, and a day when Twitter experienced an outage and started rate limiting accounts."
|
||
#| fig-height: 2.75
|
||
#| fig-width: 6.75
|
||
#| fig-env: figure*
|
||
#| fig-pos: tb!
|
||
|
||
library(here)
|
||
source(here("codebase/R/helpers.R"))
|
||
get_here <- here::here
|
||
account_timeline_plot()
|
||
```
|
||
|
||
Following Twitter's 2022 acquisition, Mastodon---an open-source, decentralized social network and microblogging community---saw an increase in activity and attention as a potential Twitter alternative [@heFlockingMastodonTracking2023; @lacavaDriversSocialInfluence2023]. While millions of people set up new accounts and significantly increased the size of the network (@fig-account-timeline), many of these newcomers and potential newcomers found the process confusing and many accounts did not remain active. Unlike centralized social media platforms, Mastodon is a network of independent servers with their own rules and norms [@nicholsonMastodonRulesCharacterizing2023]. Each server can communicate with each other using the shared ActivityPub protocols and accounts can move between Mastodon servers, but the local experience can vary widely from server to server.
|
||
|
||
Although attracting and retaining newcomers is a key challenge for online communities [@krautBuildingSuccessfulOnline2011 p. 182], Mastodon's onboarding process has not always been straightforward. Variation among servers can also present a challenge for newcomers who may not even be aware of the specific rules, norms, or general topics of interest on the server they are joining [@diazUsingMastodonWay2022]. Further, many Mastodon servers have specific norms which people coming from Twitter may find confusing, such as local norms around content warnings [@nicholsonMastodonRulesCharacterizing2023]. Various guides and resources for people trying to join Mastodon offered mixed advice on choosing a server. Some suggest that the most important thing is to simply join any server and work from there [@krasnoffMastodon101How2022; @silberlingBeginnerGuideMastodon2023], while others have created tools and guides to help people find potential servers of interest by size and location[@rousseauMastodonInstances2017; @kingMastodonMe2024].
|
||
|
||
Mastodon's approach to onboarding has also changed over time. In much of 2020 and early 2021, the Mastodon developers closed signups to their flagship server and linked to an alternative server, which saw increased sign-ups during this period. They also linked to a list of servers on the Join Mastodon webpage [@mastodonggmbhServers], where all servers are pre-approved and follow the Mastodon Server Covenant which guarantees certain content moderation standards and data protections. Starting in 2023, the Mastodon developers shifted toward making the flagship server the default when people sign up on the official Mastodon Android and iOS apps [@rochkoNewOnboardingExperience2023; @rothItGettingEasier2023].
|
||
|
||
We first ask question: *Does server choice matter for Mastodon newcomers?* Toward this question, we used profile data from over a million Mastodon accounts collected from public timelines and profile directories between October 1, 2020 and August 15, 2023. With a subset of these accounts created from May 1 to June 30, 2023, we create a Kaplan–Meier estimator for account activity in the 91 days after creation (@fig-survival). We find that accounts on the 12 largest general instances featured at the top of the Join Mastodon webpage (which includes the flagship server) are less likely to remain active than accounts created on other Join Mastodon servers.
|
||
|
||
To corroborate this model, we also use data from thousands of accounts which moved between Mastodon servers, taking advantage of the data portability of the platform. Conceiving of these moved accounts as edges within a weighted directional network where nodes represent servers, edges represent accounts, and weights represent the number of accounts that moved between servers, we construct an exponential family random graph model (ERGM) with terms for server size, open registrations, and language match between servers. We find that accounts are more likely to move from larger servers to smaller servers.
|
||
|
||
```{=html}
|
||
<!--
|
||
We found that users who sign up on large, general topic servers are less likely to remain active than those who sign up on smaller servers. We also found that many users who move their accounts between servers tend to gravitate toward smaller servers over time.
|
||
-->
|
||
```
|
||
Based on these findings, we suggest a need for better ways for potential newcomers to find servers and propose a viable way to create server and tag recommendations on Mastodon, which could both help newcomers find servers that match their interests and help established accounts discover "neighborhoods" of related servers. One challenge in building such a system is the decentralized nature of the system. A single, central actor which collects data from servers and then distributes recommendations would be antithetical to the decentralized nature of Mastodon. Instead, we propose a system where servers can report the top hashtags by the number of unique accounts on the server using them during the last three months. Such a system would be opt-in and require few additional server resources since tags already have their own database table.
|
||
|
||
In our proposal, after collecting these top tags on each server, each server then uses Okapi BM25 to construct a term frequency-inverse document frequency (TF-IDF) matrix to associate the top tags with each server in their known network. We suggest first filtering to only consider tags used by a minimal number of account on a server and only consider tags used on a minimal number of servers. The counts of tag-account pairs from each server make up the term frequency and the number of servers that use each tag make up the inverse document frequency. The system can then apply L2 normalization to the vectors for each tag and calculate the cosine similarity between the tag vectors for each server. To find similarity between tags, the system could also calculate the cosine similarity between the server vectors.
|
||
|
||
To determine the viability of the recommendation system, we simulated various scenarios that limit both servers that report data and the number of tags they report. We used rank biased overlap (RBO) to then compare the outputs from these simulations to the baseline with more complete information from all tags on all servers. @fig-simulations-rbo shows how the average agreement with the baseline scales linearly with the logarithm of the tag count.
|
||
|
||
Thus based on analysis of trace data from millions of new Mastodon accounts, we find evidence that suggests that servers matter and that users tend to move from larger servers to smaller servers. We then propose a recommendation system that can help new Mastodon users find servers with a high probability of being a good match based on their interests. Based on simulations, we demonstrate that such a tool can be effectively deployed in a federated manner, even with limited data on each local server.
|
||
|
||
```{r, cache.extra = tools::md5sum("codebase/R/survival.R")}
|
||
#| cache: true
|
||
#| label: fig-survival
|
||
#| fig-cap: "Survival probabilities for accounts created during May and June 2023 on servers featured on Join Mastodon. Groups represent whether the account is on one of the 12 largest and most prominently featured servers or another Join Mastodon server."
|
||
library(here)
|
||
source(here("codebase/R/survival.R"))
|
||
plot_km
|
||
```
|
||
|
||
::: {#tbl-ergm-table}
|
||
```{r}
|
||
#| label: table-ergm-table
|
||
#| echo: false
|
||
#| warning: false
|
||
#| message: false
|
||
#| error: false
|
||
|
||
library(here)
|
||
library(modelsummary)
|
||
library(kableExtra)
|
||
library(purrr)
|
||
library(stringr)
|
||
load(file = here("data/scratch/ergm-model-early.rda"))
|
||
load(file = here("data/scratch/ergm-model-late.rda"))
|
||
|
||
x <- modelsummary(
|
||
list("Coef." = model.early, "Std.Error" = model.early, "Coef." = model.late, "Std.Error" = model.late),
|
||
estimate = c("{estimate}", "{stars}{std.error}", "{estimate}", "{stars}{std.error}"),
|
||
statistic = NULL,
|
||
gof_omit = ".*",
|
||
coef_rename = c(
|
||
"sum" = "Sum",
|
||
"nonzero" = "Nonzero",
|
||
"diff.sum0.h-t.accounts" = "Smaller server",
|
||
"nodeocov.sum.accounts" = "Server size\n(outgoing)",
|
||
"nodeifactor.sum.registrations.TRUE" = "Open registrations\n(incoming)",
|
||
"nodematch.sum.language" = "Languages match"
|
||
),
|
||
align="lrrrr",
|
||
stars = c('*' = .05, '**' = 0.01, '***' = .001),
|
||
output = "latex_tabular"
|
||
#output = "markdown",
|
||
#table.envir='table*',
|
||
#table.env="table*"
|
||
) #%>% add_header_above(c(" " = 1, "Model A" = 2, "Model B" = 2))
|
||
|
||
x
|
||
```
|
||
|
||
Exponential family random graph models for account movement between Mastodon servers. Accounts in Model A were created in May 2022 and moved to another account at some later point. Accounts in Model B were created at some earlier point and moved after October 2023.
|
||
:::
|
||
|
||
::: {#tbl-sim-servers}
|
||
```{r}
|
||
#| label: table-sim-servers
|
||
library(tidyverse)
|
||
library(arrow)
|
||
|
||
sim_servers <- "data/scratch/server_similarity.feather" %>%
|
||
here::here() %>%
|
||
arrow::read_ipc_file()
|
||
server_of_interest <- "hci.social"
|
||
server_table <- sim_servers %>%
|
||
arrange(desc(Similarity)) %>%
|
||
filter(Source == server_of_interest | Target == server_of_interest) %>%
|
||
head(5) %>%
|
||
pivot_longer(cols=c(Source, Target)) %>%
|
||
filter(value != server_of_interest) %>%
|
||
select(value, Similarity) %>%
|
||
rename("Server" = "value")
|
||
|
||
if (knitr::is_latex_output()) {
|
||
server_table %>% knitr::kable(format="latex", booktabs=TRUE, digits=3)
|
||
} else {
|
||
server_table %>% knitr::kable(digits = 3)
|
||
}
|
||
```
|
||
|
||
Top five servers most similar to hci.social, a Mastodon server focused on human-computer interaction research. Each of these servers relate to computer science, academia, or technology.
|
||
:::
|
||
|
||
```{r}
|
||
#| label: fig-simulations-rbo
|
||
#| fig-env: figure*
|
||
#| cache: true
|
||
#| fig-width: 6.75
|
||
#| fig-height: 3
|
||
#| fig-pos: tb
|
||
#| fig-cap: "Simulated rank biased overlap between simulated server similarity ranks varied by the number of tags reported by each server and the number of servers that report data. The baseline uses 256 tags."
|
||
library(tidyverse)
|
||
library(arrow)
|
||
simulations <- arrow::read_ipc_file(here::here("data/scratch/simulation_rbo.feather"))
|
||
|
||
simulations %>%
|
||
group_by(servers, tags, run) %>% summarize(rbo=mean(rbo), .groups="drop") %>%
|
||
mutate(ltags = as.integer(log2(tags))) %>%
|
||
ggplot(aes(x = factor(ltags), y = rbo, fill = factor(ltags))) +
|
||
geom_boxplot() +
|
||
facet_wrap(~servers, nrow=1) +
|
||
#scale_y_continuous(limits = c(0, 1)) +
|
||
labs(x = "Tags (log2)", y = "RBO", title = "Rank Biased Overlap with Baseline Rankings by Number of Servers") +
|
||
theme_minimal() + theme(legend.position = "none")
|
||
```
|