junior-sheer/index.qmd

---
title: Onboarding The Fediverse (working title)
short-title: Onboarding Fediverse
authors:
  - name: Carl Colglazier
    affiliation:
      name: Northwestern University
      city: Evanston
      state: Illinois
      country: United States
    #roles: writing
    corresponding: true
bibliography: references.bib
acm-metadata:
  final: false
  copyright-year: 2024
  acm-year: 2024
  copyright: rightsretained
  doi: XXXXXXX.XXXXXXX
  conference-acronym: "PACMHCI"
  #conference-name: |
  #  Make sure to enter the correct
  #  conference title from your rights confirmation email
  #conference-date: June 03--05, 2018
  #conference-location: Woodstock, NY
  #price: "15.00"
  #isbn: 978-1-4503-XXXX-X/18/06
format:
  acm-pdf:
    documentclass: acmart
    classoption: [acmsmall,manuscript,screen,authorversion,nonacm,timestamp]
abstract: |
  When trying to join the Fediverse, a decentralized collection of interoperable social networking websites, new users face the dillema of choosing a home server. Using trace data from thousands of new Fediverse accounts, we show that this choice matters and significantly effects the probably that the account remains active in the future. We then use these insights from this relationship to build a tool that can help new Fediverse users find a server with a high probability of being a good match based on their interests.
execute:
  echo: false
  error: false
  freeze: auto
---

```{r}
#| label: r-setup
#| output: false
library(reticulate)
library(tidyverse)
library(arrow)
library(statnet)
library(network)
library(survival)
library(ggsurvfit)
library(modelsummary)
options(arrow.skip_nul = TRUE)
```

We first explore the extent to which server choice matters. We find that accounts that join smaller, more interest-based servers are more likely to continue posting six months after their creation.

Using these findings, we then propose a tool that can help users find servers that match their interests.

# Background

## Newcomers in Online Communities

Onboarding newcomers is an important thing for online communities. Any community can expect a certain amount of turnover, and so it is important for the long-term health and longevity of the community to be able to bring in new members.

RQ: What server attributes correspond with better newcomer retention?

## Migrations in Online Communities

All online communities and accounts trend toward death.

+ Fiesler on online fandom communities [@fieslerMovingLandsOnline2020]
+ TeBlunthuis on competition and mutalism [@teblunthuisIdentifyingCompetitionMutualism2022]
+ Work on "alt-tech" communities.

# Empirical Setting

The Fediverse is a set of decentralized online social networks which interoperate using shared protocols like ActivityPub. Mastodon is a software program used by many Fediverse servers and offers a user experience similar to the Tweetdeck client for Twitter.

Discovery has been challenging on Masotodon. The developers and user base tend to be skeptical of algorithmic intrusions, instead opting for timelines which only show posts in reverse chronological order. Search is also difficult. Public hashtags are searchable, but most servers have traditionally not supported searching keywords or simple strings. Accounts can only be searched using their full `username@server` form.

Mastodon features a "home" timeline which shows all public posts from accounts that share the same home server. On larger servers, this timeline can be unwieldy; however, on smaller servers, this presents the opportunity to discover new posts and users of potential interest.

Mastodon offers its users high levels of data portability. Users can move their accounts accross instances while retaining their follows (their post data; however, does not move with the new account). The choice of an initial instance consequentially is not irreversible.

# Data

```{python}
#| label: py-preprocess-data
#| cache: true
#| output: false

from code.load_accounts import *

accounts = read_accounts_file("data/accounts.feather")
# Write a parsed accounts file for R to use
accounts.with_columns(
  pl.col("data").struct.field("moved").is_not_null().alias("has_moved")
).drop(
  ["data", "data_string"]
).write_ipc("data/scratch/accounts.feather")

moved_accounts = accounts.with_columns(
  pl.col("data").struct.field("moved")
).drop_nulls("moved").with_columns(
  pl.col("moved").struct.field("acct").alias("moved_acct"),
).filter(
  pl.col("moved_acct").str.contains('@')
).with_columns(
  pl.col("moved_acct").str.split('@').list.get(1).alias("moved_server")
)

number_of_accounts = len(accounts)

popular_servers = accounts.group_by("server").count().sort("count", descending=True)

common_moves = moved_accounts.group_by(
  ["server", "moved_server"]
).count().sort("count", descending=True)

common_moves.write_ipc("data/scratch/moved_accounts.feather")
common_moves.rename({
  "server": "Source",
  "moved_server": "Target",
}).write_csv("data/scratch/moved_accounts.csv")

maccounts = moved_accounts.select(["account", "server", "moved_server"])

popular_servers.write_ipc("data/scratch/popular_servers.feather")

jm = pl.read_json("data/joinmastodon.json")
jm.write_ipc("data/scratch/joinmastodon.feather")

read_metadata_file("data/metadata_2023-10-01.feather").drop(
  ["data", "data_string"]
).write_ipc("data/scratch/metadata.feather")
```

```{r}
#| label: r-load-accounts
#| output: false
accounts <- arrow::read_feather("data/scratch/accounts.feather", col_select=c("server", "username", "created_at", "last_status_at", "statuses_count", "has_moved", "bot")) %>%
  filter(!has_moved) %>%
  filter(!bot) %>%
  # TODO: what's going on here?
  filter(!is.na(last_status_at)) %>%
  # sanity check
  filter(created_at >= "2022-01-01") %>%
  filter(created_at < "2023-03-01") %>%
  # We don't want accounts that were created and then immediately stopped being active
  filter(statuses_count >= 5) %>%
  filter(last_status_at >= created_at) %>%
  mutate(active = last_status_at >= "2023-09-01") %>%
  # set max last_status_at to 2023-06-01
  mutate(last_status_at = ifelse(active, lubridate::ymd_hms("2023-09-01 00:00:00", tz = "UTC"), last_status_at)) %>%
  mutate(active_time = difftime(last_status_at, created_at, units="secs"))
```

**Mastodon Profiles**: We collected accounts using data previously collected from posts on public Mastodon timelines from October 2020 to February 2023. We then queried for up-to-date infromation on those accounts including their most recent status and if the account had moved. This gave us a total of `r nrow(accounts)` accounts.


```{r}
adv_server_counts <- arrow::read_feather("data/scratch/accounts.feather", col_select=c("server", "username", "created_at", "bot")) %>%
  filter(!bot) %>%
  filter(created_at > "2017-01-01") %>%
  filter(created_at <= "2023-01-01") %>%
  group_by(server) %>%
  arrange(created_at) %>%
  mutate(r = row_number()) %>%
  arrange(desc(r)) %>%
  distinct(server, created_at, .keep_all=TRUE) %>%
  select(server, created_at, r) %>%
  ungroup() %>%
  mutate(server_date = as.Date(created_at))
```

```{r}
adv_server_counts %>% filter(server == "mastodon.social") %>%
  ggplot(aes(x=server_date, y=r)) +
  geom_line() + theme_minimal()
```

```{r}
#| label: fig-account-activity-prop
#| fig-cap: "Account Activity Over Time"
#| fig-height: 4

server_counts <- arrow::read_feather("data/scratch/accounts.feather", col_select=c("server", "username", "created_at", "bot")) %>%
  filter(created_at <= "2023-01-01") %>%
  group_by(server) %>%
  summarize(server_count = n()) %>%
  arrange(desc(server_count)) %>%
  mutate(server_count_bin = floor(log10(server_count)))

metadata <- arrow::read_feather("data/scratch/metadata.feather", col_select=c("server", "user_count")) %>%
  arrange(desc(user_count)) %>%
  mutate(server_count = user_count) %>%
  mutate(server_count_bin = floor(log10(server_count)))

jm <- arrow::read_feather("data/scratch/joinmastodon.feather")

a <- accounts %>%
  inner_join(metadata, by="server") %>%
  mutate(metadata = server_count > 500) %>%
  mutate(active_time = as.integer(active_time)) %>%
  mutate(active_time_weeks = active_time / 60 / 60 / 24 / 7) %>%
  mutate(status = ifelse(active, 0, 1)) %>% mutate(jm = server %in% jm$domain)# %>% filter(jm)

survfit2(Surv(active_time_weeks, status) ~ server_count_bin, data = a) %>%
  ggsurvfit() +
  add_confidence_interval() +
  scale_y_continuous(limits = c(0, 1)) +
  labs(
    y = "Overall survival probability",
    x = "Time (weeks)",
    colour = "Server Size (log10)",
    fill = "Server Size (log10)",
  ) +
  add_risktable() +
  scale_x_continuous(
    breaks = seq(0, max(a$active_time_weeks, na.rm = TRUE), by = 52),
    labels = seq(0, max(a$active_time_weeks, na.rm = TRUE), by = 52)
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```


To determine the relationship between server size and user retention, we...

## Moved Accounts

```{r}
#| label: fig-moved-accounts
#| fig-height: 4
moved_accounts <- arrow::read_feather("data/scratch/moved_accounts.feather")
popular_servers <- arrow::read_feather("data/scratch/popular_servers.feather")
server_movement_data <- left_join(
  (moved_accounts %>% group_by(server) %>% summarize(out_count = sum(count)) %>% select(server, out_count)),
  (moved_accounts %>% group_by(moved_server) %>% summarize(in_count = sum(count)) %>% select(moved_server, in_count) %>% rename(server=moved_server)),
  by="server"
) %>% replace_na(list(out_count = 0, in_count = 0)) %>%
  mutate(diff = in_count - out_count) %>%
  arrange(diff) %>%
  left_join(., popular_servers, by="server") %>%
  rename(user_count = count) %>% arrange(desc(user_count))
server_movement_data %>%
  ggplot(aes(x=user_count, y=diff)) +
  geom_point() + scale_x_log10() + theme_minimal()
```

If there was no relationship, we would expect these jumps to be random with respect to server size.

```{r}

popular_servers <- arrow::read_feather("data/scratch/popular_servers.feather")
moved_accounts <- arrow::read_feather("data/scratch/moved_accounts.feather")
activity <- arrow::read_feather("data/scratch/activity.feather", col_select=c("server", "logins")) %>% arrange(desc(logins))

popular_and_large_servers <- popular_servers %>% filter(count >= 1) %>%
  mutate(count = log10(count))

jm <- arrow::read_feather("data/scratch/joinmastodon.feather")

ma <- moved_accounts %>% filter(server %in% popular_and_large_servers$server) %>% filter(moved_server %in% popular_and_large_servers$server)

edgeNet<-network(ma,matrix.type="edgelist")

edgeNet%v%"user_count" <- left_join((as_tibble(edgeNet%v%'vertex.names') %>% rename(server=value)), popular_and_large_servers, by="server") %>% select(count) %>% unlist()

edgeNet%v%"in_jm" <- as_tibble(edgeNet%v%'vertex.names') %>% mutate(in_jm = value %in% jm$domain) %>% select(in_jm) %>% unlist()
```

```{r}
#| label: ergm-model
#| cache: true
m1 <- ergm(edgeNet ~ sum + diff("user_count", pow=1, form="sum") + nodecov("user_count", form="sum") + nodematch("in_jm", diff=TRUE, form="sum"), response="count", reference=~Binomial(3))

save(m1, file="data/scratch/ergm-model.rda")
```


```{r}
#| label: tag-ergm-result
#| output: asis
ergm_model <- load("data/scratch/ergm-model.rda")

modelsummary(
  m1,
  coef_rename = c(
    "sum" = "Intercept",
    "diff.sum.t-h.user_count	" = "User Count Difference",
    "nodecov.sum.user_count	" = "User Count (Node Covariate)",
    "nodematch.sum.in_jm.TRUE" = "In JoinMastodon (Both True)",
    "nodematch.sum.in_jm.FALSE" = "In JoinMastodon (Both False)"
  ),
)
```

```{python}
#| eval: false
#| include: false
import random

def simulate_account_moves(origin: str, servers: dict, n: int):
  server_list = list(set(servers.keys()) - {origin})
  weights = [servers[x] for x in server_list]
  return pl.DataFrame({
    "simulation": list(range(n)),
    "server": [origin] * n,
    "moved_server": random.choices(server_list, weights=weights, k=n)
  })

simulations = pl.concat([simulate_account_moves(row["server"], {x["server"]: x["count"] for x in popular_servers.iter_rows(named=True)}, 1000) for row in maccounts.iter_rows(named=True)])
m_counts = maccounts.join(popular_servers, how="inner", on="server").rename({"count": "origin_count"}).join(popular_servers.rename({"server": "moved_server"}), how="inner", on="moved_server").rename({"count": "target_count"})
sim_counts = simulations.join(popular_servers, how="inner", on="server").rename({"count": "origin_count"}).join(popular_servers.rename({"server": "moved_server"}), how="inner", on="moved_server").rename({"count": "target_count"})
```

## Tag Clusters

We found _number_ posts which contained between two and five tags.