junior-sheer/index.qmd

---
title: Best Practices for Onboarding on the Fediverse
short-title: Onboarding Fediverse
authors:
  - name: Carl Colglazier
    affiliation:
      name: Northwestern University
      city: Evanston
      state: Illinois
      country: United States
    #roles: writing
    corresponding: true
bibliography: references.bib
acm-metadata:
  final: false
  copyright-year: 2024
  acm-year: 2024
  copyright: rightsretained
  doi: XXXXXXX.XXXXXXX
  conference-acronym: "PACMHCI"
  #conference-name: |
  #  Make sure to enter the correct
  #  conference title from your rights confirmation email
  #conference-date: June 03--05, 2018
  #conference-location: Woodstock, NY
  #price: "15.00"
  #isbn: 978-1-4503-XXXX-X/18/06
format:
  acm-pdf:
    keep-tex: true
    documentclass: acmart
    classoption: [acmsmall,manuscript,screen,authorversion,nonacm,timestamp]
abstract: |
  When trying to join the Fediverse, a decentralized collection of interoperable social networking websites, new users face the dillema of choosing a home server. Using trace data from thousands of new Fediverse accounts, we show that this choice matters and significantly effects the probably that the account remains active in the future. We then use these insights from this relationship to build a tool that can help new Fediverse users find a server with a high probability of being a good match based on their interests.
execute:
  echo: false
  error: false
  freeze: auto
  fig-width: 6.75
---

```{r}
#| label: r-setup
#| output: false
#| error: false
#| warning: false
library(reticulate)
library(tidyverse)
library(arrow)
library(statnet)
library(network)
library(survival)
library(ggsurvfit)
library(modelsummary)
library(randomForestSRC)
library(grid)
library(scales)

options(arrow.skip_nul = TRUE)
```

We first explore the extent to which server choice matters. We find that accounts that join smaller, more interest-based servers are more likely to continue posting six months after their creation.

Using these findings, we then propose a tool that can help users find servers that match their interests.

# Background

## Newcomers in Online Communities

Onboarding newcomers is an important thing for online communities. Any community can expect a certain amount of turnover, and so it is important for the long-term health and longevity of the community to be able to bring in new members.

RQ: What server attributes correspond with better newcomer retention?

## Migrations in Online Communities

All online communities and accounts trend toward death.

Online fandom communities, for instance...


On Reddit, @newellUserMigrationOnline found that the news aggregator had an advantage of potential competitors because of their catalogue of niche communities: people who migrated to alternative platforms tended to post most often proportionally in popular communities.

+ Fiesler on online fandom communities [@fieslerMovingLandsOnline2020]
+ TeBlunthuis on competition and mutalism [@teblunthuisIdentifyingCompetitionMutualism2022]
+ Work on "alt-tech" communities.

# Empirical Setting

The Fediverse is a set of decentralized online social networks which interoperate using shared protocols like ActivityPub.

Mastodon is a software program used by many Fediverse servers and offers a user experience similar to the Tweetdeck client for Twitter. It was first created in late 2016.

Discovery has been challenging on Masotodon. The developers and user base tend to be skeptical of algorithmic intrusions, instead opting for timelines which only show posts in reverse chronological order. Search is also difficult. Public hashtags are searchable, but most servers have traditionally not supported searching keywords or simple strings. Accounts can only be searched using their full `username@server` form.

Mastodon features a "home" timeline which shows all public posts from accounts that share the same home server. On larger servers, this timeline can be unwieldy; however, on smaller servers, this presents the opportunity to discover new posts and users of potential interest.

Mastodon offers its users high levels of data portability. Users can move their accounts accross instances while retaining their follows (their post data; however, does not move with the new account). The choice of an initial instance consequentially is not irreversible.

# Data

```{python}
#| label: py-preprocess-data
#| cache: true
#| output: false

from code.load_accounts import *
from urllib.parse import urlparse

#accounts = pl.concat(
#  read_accounts_file("data/accounts.feather"),
#  read_accounts_file("data/account_lookup_2023.feather")
#)
accounts = read_accounts_file(
  "data/account_lookup_compressed.feather"
).unique(["account", "server"])
# Write a parsed accounts file for R to use
a = accounts.with_columns(
  pl.col("url").map_elements(
    lambda x: urlparse(x).netloc.encode().decode('idna')
  ).alias("host"),
  pl.col("data_string").str.contains("""\"moved\": \{""").alias("has_moved"),
  pl.col("data").struct.field("suspended"),
)

a_save = a.drop(["data", "data_string"])
a_save.select(
  sorted(a_save.columns)
).write_ipc("data/scratch/accounts.feather")

moved_accounts = a.filter(pl.col("has_moved")).with_columns(# Do this again now we know the rows are all moved accounts
  pl.col("data_string").str.json_decode().alias("data")
).with_columns(
  pl.col("data").struct.field("moved")
).drop_nulls("moved").with_columns(
  pl.col("moved").struct.field("acct").alias("moved_acct"),
).with_columns(
  pl.when(
    pl.col("moved_acct").str.contains('@')
  ).then(
    pl.col("moved_acct").str.split('@').list.get(1)
  ).otherwise(
    pl.col("server")
  ).alias("moved_server")
)

number_of_accounts = len(a)

popular_servers = a.group_by("server").count().sort("count", descending=True)

common_moves = moved_accounts.group_by(
  ["server", "moved_server"]
).count().sort("count", descending=True)

common_moves.write_ipc("data/scratch/moved_accounts.feather")
common_moves.rename({
  "server": "Source",
  "moved_server": "Target",
}).write_csv("data/scratch/moved_accounts.csv")

maccounts = moved_accounts.select(["account", "server", "moved_server"])

popular_servers.write_ipc("data/scratch/popular_servers.feather")

jm = pl.read_json("data/joinmastodon.json")
jm.write_ipc("data/scratch/joinmastodon.feather")

read_metadata_file("data/metadata_2023-10-01.feather").drop(
  ["data", "data_string"]
).write_ipc("data/scratch/metadata.feather")
```

```{python}
#| label: py-preprocess-data2
#| cache: true
#| output: false

from code.load_accounts import read_accounts_file
from urllib.parse import urlparse
import polars as pl

profile_accounts = read_accounts_file("data/profiles_local.feather")
p = profile_accounts.with_columns(
  pl.col("url").map_elements(lambda x: urlparse(x).netloc.encode().decode('idna')).alias("host"),
  pl.col("username").alias("account"),
  pl.lit(False).alias("has_moved"),
  pl.lit(False).alias("suspended")
).drop(
  ["data", "data_string"]
)
p.select(sorted(p.columns)).write_ipc("data/scratch/accounts_processed_profiles.feather")
all_accounts = pl.scan_ipc(
  [
    "data/scratch/accounts.feather",
    #"data/scratch/accounts_processed_recent.feather",
    "data/scratch/accounts_processed_profiles.feather"
  ]).collect()
all_accounts.filter(pl.col("host").eq(pl.col("server"))).unique(["account", "server"]).write_ipc("data/scratch/all_accounts.feather")
```


```{r}
#| eval: false
arrow::read_feather(
  "data/scratch/accounts_processed_profiles.feather",
  col_select = c(
    "server", "username", "created_at",
    "last_status_at", "statuses_count",
    "has_moved", "bot", "suspended"
  )) %>%
  mutate(suspended = replace_na(suspended, FALSE)) %>%
  filter(!bot) %>%
  # TODO: what's going on here?
  filter(!is.na(last_status_at)) %>%
  # sanity check
  filter(created_at >= "2022-01-01") %>%
  filter(created_at < "2024-03-01") %>%
  # We don't want accounts that were created
  # and then immediately stopped being active
  filter(statuses_count > 1) %>%
  filter(!suspended) %>%
  filter(!has_moved) %>%
  filter(server == "mastodon.social") %>%
  #filter(last_status_at >= created_at) %>%
  mutate(created_month = format(created_at, "%Y-%m")) %>%
  group_by(created_month) %>%
  summarize(count=n()) %>%
  distinct(created_month, count) %>%
  ggplot(aes(x=created_month, y=count)) +
  geom_bar(stat="identity", fill="black") +
  labs(y="Count", x="Created Month") +
  theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
```

```{r}
#| label: fig-account-timeline
#| fig-cap: "Accounts in the dataset created between January 2022 and March 2023. The top panels shows the proportion of accounts still active 45 days after creation, the proportion of accounts that have moved, and the proportion of accounts that have been suspended. The bottom panel shows the count of accounts created each week. The dashed vertical lines in the bottom panel represent the annoucement day of the Elon Musk Twitter acquisition, the acquisition closing day, a day where Twitter suspended a number of prominent journalist, and a day when Twitter experienced an outage and started rate limiting accounts."
#| fig-height: 3
#| fig-width: 6.75

jm <- arrow::read_feather("data/scratch/joinmastodon.feather")
accounts_unfilt <- arrow::read_feather(
  "data/scratch/all_accounts.feather",
  col_select=c(
    "server", "username", "created_at", "last_status_at",
    "statuses_count", "has_moved", "bot", "suspended",
    "following_count", "followers_count"
  ))
accounts <- accounts_unfilt %>%
  filter(!bot) %>%
  # TODO: what's going on here?
  filter(!is.na(last_status_at)) %>%
  mutate(suspended = replace_na(suspended, FALSE)) %>%
  # sanity check
  filter(created_at >= "2020-10-01") %>%
  filter(created_at < "2024-01-01") %>%
  # We don't want accounts that were created and then immediately stopped being active
  filter(statuses_count >= 1) %>%
  filter(last_status_at >= created_at) %>%
  mutate(active = last_status_at >= "2024-01-01") %>%
  mutate(last_status_at = ifelse(active, lubridate::ymd_hms("2024-01-01 00:00:00", tz = "UTC"), last_status_at)) %>%
  mutate(active_time = difftime(last_status_at, created_at, units="days")) #%>%
  #filter(!has_moved)
acc_data <- accounts %>%
  #filter(!has_moved) %>%
  mutate(created_month = format(created_at, "%Y-%m")) %>%
  mutate(created_week = floor_date(created_at, unit = "week")) %>%
  mutate(active_now = active) %>%
  mutate(active = active_time >= 45) %>%
  mutate("Is mastodon.social" = server == "mastodon.social") %>%
  mutate(jm = server %in% jm$domain) %>%
  group_by(created_week) %>%
  summarize(
    `JoinMastodon Server` = sum(jm) / n(),
    `Is mastodon.social` = sum(`Is mastodon.social`)/n(),
    Suspended = sum(suspended)/n(),
    Active = (sum(active)-sum(has_moved)-sum(suspended))/(n()-sum(has_moved)-sum(suspended)),
    active_now = (sum(active_now)-sum(has_moved)-sum(suspended))/(n()-sum(has_moved)-sum(suspended)),
    Moved=sum(has_moved)/n(),
    count=n()) %>%
  pivot_longer(cols=c("JoinMastodon Server", "active_now", "Active", "Moved", "Is mastodon.social"), names_to="Measure", values_to="value") # "Suspended"
theme_bw_small_labels <- function(base_size = 9) {
  theme_bw(base_size = base_size) %+replace%
    theme(
      plot.title = element_text(size = base_size * 0.8),
      plot.subtitle = element_text(size = base_size * 0.75),
      plot.caption = element_text(size = base_size * 0.7),
      axis.title = element_text(size = base_size * 0.9),
      axis.text = element_text(size = base_size * 0.8),
      legend.title = element_text(size = base_size * 0.9),
      legend.text = element_text(size = base_size * 0.8)
    )
}

p1 <- acc_data %>%
  ggplot(aes(x=as.Date(created_week), group=1)) +
  geom_line(aes(y=value, group=Measure, color=Measure)) +
  geom_point(aes(y=value, color=Measure), size=0.7) +
  scale_y_continuous(limits = c(0, 1.0)) +
  labs(y="Proportion") + scale_x_date(labels=date_format("%Y-%U"), breaks = "4 week") +
  theme_bw_small_labels() +
  theme(axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank())
p2 <- acc_data %>%
  distinct(created_week, count) %>%
  ggplot(aes(x=as.Date(created_week), y=count)) +
  geom_bar(stat="identity", fill="black") +
  geom_vline(
    aes(xintercept = as.numeric(as.Date("2022-10-27"))),
    linetype="dashed", color = "black") +
  #geom_text(
  #  aes(x=as.Date("2022-10-27"),
  #  y=max(count),
  #  label="  Elon Musk Twitter Acquisition Completed"),
  #  vjust=-1, hjust=0, color="black") +
  geom_vline(
    aes(xintercept = as.numeric(as.Date("2022-04-14"))),
    linetype="dashed", color = "black") +
  # https://twitter.com/elonmusk/status/1675187969420828672
  geom_vline(
    aes(xintercept = as.numeric(as.Date("2022-12-15"))),
    linetype="dashed", color = "black") +
  geom_vline(
    aes(xintercept = as.numeric(as.Date("2023-07-01"))),
    linetype="dashed", color = "black") +
  #scale_y_continuous(limits = c(0, max(acc_data$count) + 100000)) +
  scale_y_continuous(labels = scales::comma) +
  labs(y="Count", x="Created Week") +
  theme_bw_small_labels() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + scale_x_date(labels=date_format("%Y-%U"), breaks = "4 week")
#grid.draw(rbind(ggplotGrob(p1), ggplotGrob(p2), size = "last"))
library(patchwork)
p1 + p2 + plot_layout(ncol = 1)
```

**Mastodon Profiles**: We collected accounts using data previously collected from posts on public Mastodon timelines from October 2020 to January 2024. We then queried for up-to-date information on those accounts including their most recent status and if the account had moved. This gave us a total of `r nrow(accounts)` accounts.

**Moved Profiles**: We found a subset of `r accounts %>% filter(has_moved) %>% nrow` accounts which had moved from one server to another.

# Results

## Activity By Server Size

```{r}
#| label: fig-active-accounts
#| eval: false
#library(betareg)
library(lme4)
activity <- arrow::read_feather(
    "data/scratch/activity.feather",
    col_select = c("server", "logins")
  ) %>%
  arrange(desc(logins)) %>%
  mutate(server_count = logins)

account_data <- inner_join(accounts, activity, by="server") %>%
  mutate(active = active_time >= 45)

a_data <- account_data %>%
  #mutate(active = active_time >= 45) %>%
  group_by(server) %>%
  summarize(active_prop = sum(active)/n(), active_count = sum(active), count=n()) %>%
  inner_join(., activity, by="server")

a_model <- glmer(active ~ log1p(logins) + (1|server), data=account_data, family=binomial)
  #betareg(active_prop ~ log10(count), data = a_data)

logins_seq <- seq(min(log1p(account_data$logins)), max(log1p(account_data$logins)), length.out = 100)
a_pred <- predict(
  a_model,
  newdata = data.frame(logins = logins_seq, server = factor(1)),
  type = "response",
  re.form = NA)
pred_data <- data.frame(logins = logins_seq, active_prop = a_pred)

a_data %>%
  mutate(logins = log1p(logins)) %>%
  ggplot(aes(y=active_prop, x=logins)) +
    geom_point(alpha=0.1) +
    # help here
    #geom_line(aes(y = a_pred)) +
    geom_line(data = pred_data, aes(x = logins, y = active_prop), color = "red") + # Use pred_data for line
    labs(
      y = "Active after 45 Days",
      x = "Accounts"
    ) +
    scale_x_continuous(labels = scales::comma) +
    #scale_y_log10() +
    theme_bw_small_labels()
```

```{r}
#| eval: false
library(fable)
#library(fable.binary)
library(tsibble)
library(lubridate)

ad_time <- account_data |>
  mutate(created_at = yearweek(created_at)) |>
  group_by(server, created_at) |>
  summarize(count = n(), active = sum(active)) |>
  as_tsibble(key="server", index=created_at)
```

```{r}
#| eval: false
fit <- ad_time |>
  model(
    logistic = LOGISTIC(active ~ fourier(K = 5, period = "year"))
  )
```

```{r}
#| eval: false
ad_time |>
  filter(server == "mastodon.social") |>
  sample_n(100) |>
  autoplot(active)
```


```{r}
#| label: fig-account-activity-prop
#| fig-cap: "Account Activity Over Time"
#| fig-height: 4
#| eval: false
study_period <- 45
last_day <- "2024-01-15"
#formerly accounts_processed_recent
#server_counts <- arrow::read_feather(
#    "data/scratch/accounts.feather",
#    col_select=c("server", "username", "created_at", "bot")
#  ) %>%
#  filter(created_at <= "2023-03-01") %>%
#  filter(!bot) %>%
#  group_by(server) %>%
#  summarize(server_count = n()) %>%
#  arrange(desc(server_count)) %>%
#  mutate(server_count_bin = floor(log10(server_count)))

metadata <- arrow::read_feather("data/scratch/metadata.feather", col_select=c("server", "user_count")) %>%
  arrange(desc(user_count)) %>%
  mutate(server_count = user_count) %>%
  mutate(server_count_bin = floor(log10(server_count))) %>%
  mutate(server_count_bin = ifelse(server_count_bin >= 4, 4, server_count_bin)) %>%
  mutate(server_count_bin = ifelse(server_count_bin <= 2, 2, server_count_bin))

activity <- arrow::read_feather(
    "data/scratch/activity.feather",
    col_select = c("server", "logins")
  ) %>%
  arrange(desc(logins)) %>%
  mutate(server_count = logins) %>%
  mutate(server_count_bin = floor(log10(server_count))) %>%
  # Merge 4 and 5
  #mutate(server_count_bin = ifelse(server_count_bin >= 5, 4, server_count_bin)) %>%
  # Merge below 2
  #mutate(server_count_bin = ifelse((server_count_bin <= 2) & (server_count_bin >= 1), 2, server_count_bin)) %>%
  mutate(server_count_bin = ifelse(server_count == 0, -1, server_count_bin))

jm <- arrow::read_feather("data/scratch/joinmastodon.feather")

a <- accounts %>%
  filter(!has_moved) %>%
  #filter(created_at >= "2023-06-01") %>%
  #filter(created_at < "2023-08-01") %>%
  filter(created_at >= "2023-10-15") %>%
  filter(created_at < "2023-12-01") %>%
  inner_join(activity, by="server") %>%
  filter(created_at < last_status_at) %>%
  #mutate(large_server = server_count > 1000) %>%
  mutate(active_time = as.integer(active_time)) %>%
  mutate(active_time_weeks = active_time) %>%
  mutate(status = ifelse(active, 0, 1)) %>%
  mutate(jm = server %in% jm$domain) #%>% filter(server_count > 0)


survfit2(Surv(active_time_weeks, status) ~ strata(server_count_bin) + 1, data = a) %>% #  strata(server_count_bin)
  ggsurvfit() +
  add_confidence_interval() +
  scale_y_continuous(limits = c(0, 1)) +
  labs(
    y = "Overall survival probability",
    x = "Time (days)",
  ) +
  #scale_x_continuous(
  #  breaks = seq(0, max(a$active_time_weeks, na.rm = TRUE), by = 4),
  #  labels = seq(0, max(a$active_time_weeks, na.rm = TRUE), by = 4)
  #) +
  theme_bw_small_labels() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

```{r}
a %>% filter(jm) %>% inner_join(., jm, by=c("server"="domain")) %>%
  mutate(is_general = category=="general") %>%
  mutate(is_en = language == "en") %>%
  mutate(is_large = last_week_users >= 585) %>% #filter(following_count < 10) %>%
survfit2(Surv(active_time_weeks, status) ~ is_general + is_large, data = .) %>% #  strata(server_count_bin)
  ggsurvfit(linetype_aes=TRUE, type = "survival") +
  add_confidence_interval() +
  scale_y_continuous(limits = c(0, 1)) +
  labs(
    y = "Overall survival probability",
    x = "Time (days)",
  ) +
  #facet_wrap(~strata, nrow = 3) +
  #scale_x_continuous(
  #  breaks = seq(0, max(a$active_time_weeks, na.rm = TRUE), by = 4),
  #  labels = seq(0, max(a$active_time_weeks, na.rm = TRUE), by = 4)
  #) +
  add_censor_mark() +
  theme_bw_small_labels() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```


```{r}
#| eval: false
library(coxme)

sel_a <- a %>%
  filter(jm) %>% inner_join(., jm, by=c("server"="domain")) %>%
  #mutate(is_general = category=="general") %>%
  rowwise() %>%
  mutate(is_regional = "regional" %in% categories) %>%
  mutate(is_general = "general" %in% categories) %>%
  mutate(is_neither = !(is_regional | is_regional)) %>%
  mutate(is_en = language == "en") %>%
  rowwise() %>%
  mutate(n_categories = length(categories) - is_regional - is_general) %>%
  mutate(many_categories = n_categories > 0) %>%
  mutate(is_large = last_week_users >= 585) %>%
  mutate(follows_someone = followers_count > 0) %>% filter(server_count > 0) %>%
  ungroup
#cx <- coxme(Surv(active_time_weeks, status) ~ is_large + is_general + approval_required + (1|server), data = sel_a, x=TRUE)
cx <- coxph(Surv(active_time_weeks, status) ~ many_categories + is_general*is_regional + is_general:log1p(server_count), data = sel_a, x=TRUE)
coxme(Surv(active_time_weeks, status) ~ is_neither + is_general:log1p(server_count) + (1|server), data = sel_a, x=TRUE)
cx <- coxph(Surv(active_time_weeks, status) ~ is_neither + many_categories + is_general:log10(server_count), data = sel_a, x=TRUE)
cz <- cox.zph(cx)
#plot(cz)
cz
```

```{r}
#| eval: false
options(rf.cores=2, mc.cores=2)
for_data <- sel_a #%>% slice_sample(n=2500)
obj <- rfsrc.fast(Surv(active_time_weeks, status) ~ is_neither + is_general*server_count, data = for_data, ntree=100, forest=TRUE)
#predictions <- predict(obj, newdata = newData)$predicted
#plot(get.tree(obj, 1))
reg.smp.o <- subsample(obj, B = 10, verbose = TRUE)#, subratio = .5)
plot.subsample(reg.smp.o)
```

## Moved Accounts

```{r}
#| label: fig-moved-accounts
#| fig-height: 4
#| eval: false
moved_accounts <- arrow::read_feather("data/scratch/moved_accounts.feather")
popular_servers <- arrow::read_feather("data/scratch/popular_servers.feather")
server_movement_data <- left_join(
  (moved_accounts %>% group_by(server) %>% summarize(out_count = sum(count)) %>% select(server, out_count)),
  (moved_accounts %>% group_by(moved_server) %>% summarize(in_count = sum(count)) %>% select(moved_server, in_count) %>% rename(server=moved_server)),
  by="server"
) %>% replace_na(list(out_count = 0, in_count = 0)) %>%
  mutate(diff = in_count - out_count) %>%
  arrange(diff) %>%
  left_join(., popular_servers, by="server") %>%
  rename(user_count = count) %>% arrange(desc(user_count))
server_movement_data %>%
  ggplot(aes(x=user_count, y=diff)) +
  geom_point() + scale_x_log10() + theme_bw_small_labels()
```

If there was no relationship, we would expect these jumps to be random with respect to server size.

```{r}
popular_servers <-
  arrow::read_feather("data/scratch/popular_servers.feather")
moved_accounts <-
  arrow::read_feather("data/scratch/moved_accounts.feather") %>%
  # Remove loops
  filter(server != moved_server)
activity <-
  arrow::read_feather("data/scratch/activity.feather",
                      col_select = c("server", "logins")) %>%
  arrange(desc(logins))
popular_and_large_servers <-
  popular_servers %>% filter(count >= 1) %>%
  mutate(count = log10(count))
jm <- arrow::read_feather("data/scratch/joinmastodon.feather")
ma <- moved_accounts %>%
  filter(server %in% popular_and_large_servers$server) %>%
  filter(moved_server %in% popular_and_large_servers$server)
# Construct network
edgeNet <- network(ma, matrix.type = "edgelist")
edgeNet %v% "user_count" <-
  left_join((as_tibble(edgeNet %v% 'vertex.names') %>% rename(server = value)),
            popular_and_large_servers,
            by = "server") %>%
  select(count) %>%
  unlist()
edgeNet %v% "in_jm" <-
  as_tibble(edgeNet %v% 'vertex.names') %>%
  mutate(in_jm = value %in% jm$domain) %>%
  select(in_jm) %>% unlist()
```

We construct an exponential family random graph model (ERGM) where nodes represent servers and weighted directed edges represent the number of accounts that moved between servers.

$$
\begin{aligned}
\text{Sum}_{i,j} = & \beta_1 (log10(\text{user count}_j) - log10(\text{user count}_i)) + \\
& \beta_2 \\
& \beta_3 \\
& \beta_4 \\
\end{aligned}
$$

```{r}
#| label: ergm-model
#| cache: true
m1 <-
  ergm(
    edgeNet ~ sum +
      diff("user_count", pow = 1, form = "sum") +
      nodecov("user_count", form = "sum") +
      nodematch("in_jm", diff = TRUE, form = "sum"),
    response = "count",
    reference =  ~ Binomial(3),
    control=control.ergm(parallel=4, parallel.type="PSOCK")
  )

save(m1, file = "data/scratch/ergm-model.rda")
```


```{r}
#| label: tag-ergm-result
#| output: asis
ergm_model <- load("data/scratch/ergm-model.rda")

modelsummary(
  m1,
  escape = FALSE,
  coef_rename = c(
    "sum" = "$\\beta_0$ Intercept",
    "diff.sum.t-h.user_count" = "$\\beta_1$ User Count Difference",
    "nodecov.sum.user_count" = "$\\beta_2$ User Count (Node Covariate)",
    "nodematch.sum.in_jm.TRUE" = "$\\beta_3$ In JoinMastodon (Both True)",
    "nodematch.sum.in_jm.FALSE" = "$\\beta_4$ In JoinMastodon (Both False)"
  ),
)
```

We find a strong preference for accounts to move from large servers to smaller servers.

```{python}
#| eval: false
#| include: false
import random

def simulate_account_moves(origin: str, servers: dict, n: int):
  server_list = list(set(servers.keys()) - {origin})
  weights = [servers[x] for x in server_list]
  return pl.DataFrame({
    "simulation": list(range(n)),
    "server": [origin] * n,
    "moved_server": random.choices(server_list, weights=weights, k=n)
  })

simulations = pl.concat([simulate_account_moves(row["server"], {x["server"]: x["count"] for x in popular_servers.iter_rows(named=True)}, 1000) for row in maccounts.iter_rows(named=True)])
m_counts = maccounts.join(popular_servers, how="inner", on="server").rename({"count": "origin_count"}).join(popular_servers.rename({"server": "moved_server"}), how="inner", on="moved_server").rename({"count": "target_count"})
sim_counts = simulations.join(popular_servers, how="inner", on="server").rename({"count": "origin_count"}).join(popular_servers.rename({"server": "moved_server"}), how="inner", on="moved_server").rename({"count": "target_count"})
```

## Tag Clusters

We found _number_ posts which contained between two and five tags.

# References {#references}