junior-sheer/article.qmd

---
title: Recommending Servers on Mastodon
short-title: Mastodon Recommendations
authors:
  - name: Carl Colglazier
    affiliation:
      name: Northwestern University
      city: Evanston
      state: Illinois
      country: United States
    corresponding: true
bibliography: references.bib
pdf-engine: pdflatex
format:
  html: default
  pdf+icwsm:
    fig-pos: 'ht!bp'
    cite-method: natbib
    template: template.tex
    keep-md: true
    link-citations: false
  acm-pdf:
    output-file: mastodon-recommendations-acm.pdf
acm-metadata:
  # comment this out to make submission anonymous
  anonymous: true
  # comment this out to build a draft version
  #final: true

  # comment this out to specify detailed document options
  # acmart-options: sigconf, review

  # acm preamble information
  copyright-year: 2018
  acm-year: 2018
  copyright: acmcopyright
  doi: XXXXXXX.XXXXXXX
  conference-acronym: "Conference acronym 'XX"
  conference-name: |
    Make sure to enter the correct
    conference title from your rights confirmation emai
  conference-date: June 03--05, 2018
  conference-location: Woodstock, NY
  price: "15.00"
  isbn: 978-1-4503-XXXX-X/18/06

  # if present, replaces the list of authors in the page header.
  shortauthors: Colglazier

  # The code below is generated by the tool at http://dl.acm.org/ccs.cfm.
  # Please copy and paste the code instead of the example below.
  ccs: |
    \begin{CCSXML}
    <ccs2012>
     <concept>
      <concept_id>10010520.10010553.10010562</concept_id>
      <concept_desc>Computer systems organization~Embedded systems</concept_desc>
      <concept_significance>500</concept_significance>
     </concept>
     <concept>
      <concept_id>10010520.10010575.10010755</concept_id>
      <concept_desc>Computer systems organization~Redundancy</concept_desc>
      <concept_significance>300</concept_significance>
     </concept>
     <concept>
      <concept_id>10010520.10010553.10010554</concept_id>
      <concept_desc>Computer systems organization~Robotics</concept_desc>
      <concept_significance>100</concept_significance>
     </concept>
     <concept>
      <concept_id>10003033.10003083.10003095</concept_id>
      <concept_desc>Networks~Network reliability</concept_desc>
      <concept_significance>100</concept_significance>
     </concept>
    </ccs2012>
    \end{CCSXML}

    \ccsdesc[500]{Computer systems organization~Embedded systems}
    \ccsdesc[300]{Computer systems organization~Redundancy}
    \ccsdesc{Computer systems organization~Robotics}
    \ccsdesc[100]{Networks~Network reliability}

  keywords:
    - decentralized online social networks
abstract: |
  When trying to join the Fediverse, a decentralized collection of interoperable social networking websites, new users face the dillema of choosing a home server. Using trace data from millions of new Fediverse accounts, we show that new accounts on the flagship server are less likely to remain active and that accounts that move between servers tend to move from larger servers to smaller server. We then use the insights from our analysis to build a tool that can help new Fediverse users find servers with a high probability of being a good match based on their interests. Based on simulations, we demonstrate that such a tool can be effective even with limited data on each local server.
execute:
  echo: false
  error: false
  warning: false
  message: false
  freeze: auto
fig-width: 6.75
knitr:
  opts_knit:
    verbose: true
#filters:
#  - parse-latex
---

# Introduction

The Fediverse has emerged as a viable alternative to corporate, centralized social media such as Twitter and Reddit. Over the course of the last two years, millions of people have set up new accounts, significantly increasing the size of the network. In the wake of Elon Musk's Twitter aquisition, Mastodon, a popular Fediverse software which offers a Twitter-like experience, saw in increase in activity and scrutiny.

We show how the onboarding process for Mastodon has changed over time with a particular focus on the largest, flagship Mastodon server. Users who sign up to this server are less likely to remain active. Based on data from over a million Mastodon accounts, we also find that many users who move accounts tend to gravitate toward smaller, more niche servers over time.

We design a potential way to create server and tag recommendations on Mastodon, which could both help newcomers find servers that match their interests and help established accounts discover "neighborhoods" of related servers.

# Background

## Empirical Setting

The Fediverse is a set of decentralized online social networks which interoperate using shared protocols like ActivityPub. Mastodon is a software program used by many Fediverse servers and offers a user experience similar to the Tweetdeck client for Twitter. It was first created in late 2016 and saw a surge in interest in 2022 during and after Elon Musk's Twitter acquisition.

Discovery has been challenging on Masotodon. The developers and user base tend to be skeptical of algorithmic intrusions, instead opting for timelines which only show posts in reverse chronological order. Search is also difficult. Public hashtags are searchable, but most servers have traditionally not supported searching keywords or simple strings. Accounts can only be searched using their full `username@server` form.

Mastodon features a "home" timeline which shows all public posts from accounts that share the same home server. On larger servers, this timeline can be unwieldy; however, on smaller servers, this presents the opportunity to discover new posts and users of potential interest.

Mastodon offers its users high levels of data portability. Users can move their accounts accross instances while retaining their follows (their post data; however, does not move with the new account). The choice of an initial instance consequentially is not irreversible.

## Newcomers in Online Communities

Onboarding newcomers is an important part of the lifecycle of online communities. Any community can expect a certain amount of turnover, and so it is important for the long-term health and longevity of the community to be able to bring in new members [@krautBuildingSuccessfulOnline2011 p. 182]. However, the process of onboarding newcomers is not always straightforward. Newcomers may have difficulty finding the community, understanding the norms and expectations, and finding a place for themselves within the community. This can lead to high rates of attrition among newcomers.

## The Mastodon Migrations

```{r}
#| label: fig-account-timeline
#| fig-cap: "Accounts in the dataset created between January 2022 and March 2023. The top panels shows the proportion of accounts still active 45 days after creation, the proportion of accounts that have moved, and the proportion of accounts that have been suspended. The bottom panel shows the count of accounts created each week. The dashed vertical lines in the bottom panel represent the annoucement day of the Elon Musk Twitter acquisition, the acquisition closing day, a day where Twitter suspended a number of prominent journalist, and a day when Twitter experienced an outage and started rate limiting accounts."
#| fig-height: 2.75
#| fig-width: 6.75
#| fig-env: figure*
#| fig-pos: tb!

library(here)
source(here("code/helpers.R"))
account_timeline_plot()
```

Mastodon saw a surge in interest in 2022 and 2023, particularly after Elon Musk's Twitter acquisition. In particular, four events of interests drove measurable increases in new users to the network: the announcement of the acquisition (April 14, 2022), the closing of the acquisition (October 27, 2022), a day when Twitter suspended a number of prominent journalists (December 15, 2022), and a day when Twitter experienced an outage and started rate limiting accounts (July 1, 2023). Many Twitter accounts announced they were setting up Mastodon accounts and linked their new accounts to their followers, often using tags like #TwitterMigration [@heFlockingMastodonTracking2023] and driving interest in Mastodon in a process @cavaDriversSocialInfluence2023 found consistent with social influence theory.

The series of migrations of new users into Mastodon in many ways reflect folk stories of "Eternal Septembers" on previous communication networks, where a large influx of newcomers challenged the existing norms [@driscollWeMisrememberEternal2023]. Many Mastodon servers do have specific norms which people coming from Twitter may find confusing, such as local norms around content warnings [@nicholsonMastodonRulesCharacterizing2023]. Variation amoung servers can also present a challenge for newcomers who may not even be aware of the specific rules, norms, or general topics of interest on the server they are joining [@diazUsingMastodonWay2022].

Some media outlets have framed reports on Mastodon [@hooverMastodonBumpNow2023] through what @zulliRethinkingSocialSocial2020 calls the "Killer Hype Cycle", whereby the media finds a new alterntive social media platform, declares it a potential killer of some established platform, and laters calls it a failure if it does not displace the existing platform. Such framing fails to take systems like the Fediverse seriously for their own merits: completely replacing existing commercial systems is not the only way to measure success, nor does it account for the real value the Fediverse provides for its millions of active users.

# Data

**Mastodon Profiles**: We collected accounts using data previously collected from posts on public Mastodon timelines from October 2020 to August 2023. We then queried for up-to-date information on those accounts including their most recent status and if the account had moved as of February 2024. This gave us a total of N accounts. Note that because we got updated information on each account, we include only accounts on servers which still exist and which returned records for the account.

**Moved Profiles**: We found a subset of N accounts which had moved from one server to another.

**Tags**: We collect N posts which contained between 2 and 5 hashtags.

# Analysis and Results

## Survival Model

*Are accounts on suggested general servers less likely to remain active than accounts on other servers?*

```{r, cache.extra = tools::md5sum("code/survival.R")}
#| cache: true
#| label: fig-survival
#| fig-env: figure
#| fig-cap: "Survival probabilities for accounts created during May 2023."
#| fig-width: 3.375
#| fig-height: 2.5
#| fig-pos: h!

library(here)
source(here("code/survival.R"))
plot_km
```

```{r}
#| label: table-coxme
library(ehahelper)
library(broom)

cxme_table <- tidy(cxme) %>%
  mutate(conf.low = exp(conf.low), conf.high=exp(conf.high)) %>%
  mutate(term = case_when(
    term == "factor(group)1" ~ "Join Mastodon",
    term == "factor(group)2" ~ "General Servers",
    term == "small_serverTRUE" ~ "Small Server",
    TRUE ~ term
  )) %>%
  mutate(exp.coef = paste("(", round(conf.low, 2), ", ", round(conf.high, 2), ")", sep="")) %>%
  select(term, estimate, exp.coef , p.value)
```

::: {#tbl-cxme .column-body}
```{r}
if (knitr::is_latex_output()) {
  cxme_table %>% knitr::kable(format="latex", booktabs=TRUE, digits=3)
} else {
  cxme_table %>% knitr::kable(digits = 3)
}
```

Coefficients for the Cox Proportional Hazard Model with Mixed Effects. The model includes a random effect for the server.
:::

Using accounts created during from May 1 to June 30, 2023, we create a Kaplan–Meier estimator for the probability that an account will remain active based on whether the account is on one of the largest general instances (`r paste(general_servers, collapse=", ")`) featured at the top of the Join Mastodon webpage or otherwise if it is on a server in the Join Mastodon list. Accounts are considered active if they have made at least one post after the censorship period `r active_period` days after account creation.

We also contruct a Mixed Effects Cox Proportional Hazard Model with coefficients for whether the account is on a small server (less than a hundred accounts), and whether the account in featured on JoinMastodon or is featured as one of the largest general instances. We again find that accounts on the largest general instances are less likely to remain active than accounts on other servers, while accounts created on smaller servers are more likely to remain active.

## Moved Accounts

*Do accounts tend to move to larger or smaller servers?*

Mastodon users can move their accounts to another server while retaining their connections (but not their posts) to other Mastodon accounts. This feature, built into the Mastodon software, offers data portability and helps avoid lock-in.

```{r}
#| label: ergm-table
#| echo: false
#| warning: false
#| message: false
#| error: false

library(here)
library(modelsummary)
library(kableExtra)
library(purrr)
library(stringr)
load(file = here("data/scratch/ergm-model-early.rda"))
load(file = here("data/scratch/ergm-model-late.rda"))

if (knitr::is_latex_output()) {
  format <- "latex"
} else {
  format <- "html"
}

x <- modelsummary(
  list("Coef." = model.early, "Std.Error" = model.early, "Coef." = model.late, "Std.Error" = model.late),
  estimate = c("{estimate}", "{stars}{std.error}", "{estimate}", "{stars}{std.error}"),
  statistic = NULL,
  gof_omit = ".*",
  coef_rename = c(
    "sum" = "(Sum)",
    "diff.sum0.h-t.accounts" = "Smaller server",
    "nodeocov.sum.accounts" = "Server size\n(outgoing)",
    "nodeifactor.sum.registrations.TRUE" = "Open registrations\n(incoming)",
    "nodematch.sum.language" = "Languages match"
  ),
  align="lrrrr",
  stars = c('*' = .05, '**' = 0.01, '***' = .001),
  output = format
  #output = "markdown",
  #table.envir='table*',
  #table.env="table*"
  ) %>% add_header_above(c(" " = 1, "Model A" = 2, "Model B" = 2))

if (knitr::is_latex_output()) {
  x %>% reduce(str_c, capture.output(.), sep="\n") %>% gsub("table", "table*", .) %>% knitr::raw_latex()
} else {
  x
}
```

# Proposed Recommendation System

*How can we build an opt-in, low-resource recommendation system for finding Fediverse servers?*

Tailored servers focused on a particular topic and community have advantages for onboarding newcomers; however, it may be difficult for new and existing Mastodon users to discover these communities. To address this gap, we propose a recommendation system for finding new servers. This system would be opt-in and low-resource, requiring only a small amount of data from each server.

First, we construct the ideal system based on observed data. That is, we use the data from all posts we collected from all servers to construct an ideal recommender. We then simulate various scenarios that limit both servers that report data and the number of tags they report. We use rank biased overlap (RBO) to then compare the outputs from these simulations to the baseline with more complete information from all tags on all servers.

## Recommendation System Design

We use Okapi BM25 to construct a term frequency-inverse document frequency (tf-idf) model to associate the top tags with each server using counts of tag-account pairs from each server for the term frequency and the number of servers that use each tag for the inverse document frequency. We then L2 normalize the vectors for each tag and calculate the cosine similarity between the tag vectors for each server.

$$
tf = \frac{f_{t,d} \cdot (k_1 + 1)}{f_{t,d} + k_1 \cdot (1 - b + b \cdot \frac{|d|}{avgdl})}
$$

where $f_{t,d}$ is the frequency of term $t$ in document $d$, $k_1$ and $b$ are tuning parameters, and $avgdl$ is the average document length.

$$
idf = \log \frac{N - n + 0.5}{n + 0.5}
$$

where $N$ is the total number of documents and $n$ is the number of documents containing the term.

$$
\text{similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|}
$$

## Applications

```{r}
#| eval: false
library(tidyverse)
library(igraph)
library(arrow)

sim_servers <- "data/scratch/server_similarity.feather" %>% arrow::read_ipc_file() %>% rename("weight" = "Similarity")
#sim_net <- as.network(sim_servers)
g <- graph_from_data_frame(sim_servers, directed = FALSE)

g_strength <- log(sort(strength(g)))
normalized_strength <- (g_strength - min(g_strength)) / (max(g_strength) - min(g_strength))

server_centrality <- enframe(normalized_strength, name="server", value="strength")
server_centrality %>% arrow::write_ipc_file("data/scratch/server_centrality.feather")
```

::: {#tbl-sim-servers}

```{r}
#| label: table-sim-servers
library(tidyverse)
library(arrow)

sim_servers <- "data/scratch/server_similarity.feather" %>% arrow::read_ipc_file()
server_of_interest <- "hci.social"
server_table <- sim_servers %>%
    arrange(desc(Similarity)) %>%
    filter(Source == server_of_interest | Target == server_of_interest) %>%
    head(5) %>%
    pivot_longer(cols=c(Source, Target)) %>%
    filter(value != server_of_interest) %>%
    select(value, Similarity) %>%
    rename("Server" = "value")

if (knitr::is_latex_output()) {
  server_table %>% knitr::kable(format="latex", booktabs=TRUE, digits=3)
} else {
  server_table %>% knitr::kable(digits = 3)
}
```

Top five servers most similar to hci.social

:::

### Server Discovery

This system can empower users to find other servers of potential interest to them. For instance, we can build a system which recommends potential server matches to a new user.

### Server Neighborhoods

Mastodon provides two feeds in addition to a user's home timeline populated by accounts they follow: a local timeline with all public posts from their local server and a federated timeline which includes all posts from users followed by other users on their server. We suggest a third kind of timeline, a *neighborhood timeline*, which filters the federated timeline by topic.

## Rubustness to Limited Data

```{r}
#| label: fig-simulations-rbo
#| fig-env: figure*
#| cache: true
#| fig-width: 6.75
#| fig-height: 3
#| fig-pos: tb
library(tidyverse)
library(arrow)
simulations <- arrow::read_ipc_file("data/scratch/simulation_rbo.feather")

simulations %>%
  group_by(servers, tags, run) %>% summarize(rbo=mean(rbo), .groups="drop") %>%
  mutate(ltags = as.integer(log2(tags))) %>%
  ggplot(aes(x = factor(ltags), y = rbo, fill = factor(ltags))) +
  geom_boxplot() +
  facet_wrap(~servers, nrow=1) +
  #scale_y_continuous(limits = c(0, 1)) +
  labs(x = "Tags (log2)", y = "RBO", title = "Rank Biased Overlap with Baseline Rankings by Number of Servers") +
  theme_minimal() + theme(legend.position = "none")
```

We simulated various scenarios that limit both servers that report data and the number of tags they report. We used rank biased overlap (RBO) to then compare the outputs from these simulations to the baseline with more complete information from all tags on all servers. @fig-simulations-rbo shows how the average agreement with the baseline scales linearly with the logarithm of the tag count.

# Conclusion

Based on analysis of trace data from millions of new Fediverse accounts, we find evidence that suggests that servers matter and that users tend to move from larger servers to smaller servers. We then propose a recommendation system that can help new Fediverse users find servers with a high probability of being a good match based on their interests. Based on simulations, we demonstrate that such a tool can be effectively deployed in a federated manner, even with limited data on each local server.