18
0

datasets/README.md: document srun workflow, PYTHON var, container notes

Update the add_months and Step 6 sections with lessons learned from the
first run attempt:
- Replace salloc with srun (releases node automatically on completion)
- Document the PYTHON variable override needed for parallel/venv
- Note that .zst decompression uses the zstandard library due to
  Singularity container restrictions on the system zstd binary
- Add full srun invocation with bash -l, tee logging, and tmux guidance
- Update Step 6 walkthrough to use srun instead of salloc

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-25 19:05:45 -07:00
parent 4854d4f537
commit 6c6e05c360

View File

@@ -73,6 +73,22 @@ read all layers together correctly. At a yearly update cadence the number
of layers stays small; use `merge_layers.sh` to collapse them when of layers stays small; use `merge_layers.sh` to collapse them when
needed. needed.
#### Environment setup
The Python environment runs inside a Singularity container. Set `PYTHON`
to the full path of the venv interpreter so that `parallel` jobs use the
right Python (fresh shells spawned by `parallel` don't inherit the active
venv):
```sh
PYTHON=/gscratch/comdata/users/makohill/cdsc_reddit/venv/bin/python3
```
The `.zst` decompression uses the `zstandard` Python library rather than
the system `zstd` binary, which is inaccessible from inside the container.
#### Dump directory
The new `.zst` dump files must be accessible at `COMMENTS_DUMPDIR` and The new `.zst` dump files must be accessible at `COMMENTS_DUMPDIR` and
`SUBMISSIONS_DUMPDIR`. Override the defaults (which match `dumps_helper.py`) `SUBMISSIONS_DUMPDIR`. Override the defaults (which match `dumps_helper.py`)
via environment variables if the files are not in the standard locations: via environment variables if the files are not in the standard locations:
@@ -80,18 +96,37 @@ via environment variables if the files are not in the standard locations:
```sh ```sh
COMMENTS_DUMPDIR=/path/to/new/comments \ COMMENTS_DUMPDIR=/path/to/new/comments \
SUBMISSIONS_DUMPDIR=/path/to/new/submissions \ SUBMISSIONS_DUMPDIR=/path/to/new/submissions \
./add_months.sh 2025-01 2025-02 2025-03
``` ```
Part 1 runs directly on a compute node. For Part 2 there are two options: #### Running as a Slurm job
- **Single fat node** (simpler, often faster for smaller sorts): `salloc` The recommended way to run `add_months.sh` is via `srun` on a fat
a `cpu-g2` node (128 cores, ~1 TB RAM) and run the Part 2 script `cpu-g2` node. Using `srun` (rather than `salloc`) means the node is
directly with `spark-submit` or `python3`. See Step 6 of the walkthrough released automatically as soon as the script finishes, regardless of the
below for the `salloc` invocation. walltime. Run from a login node inside a `tmux` session so the terminal
- **Multi-node Spark cluster**: use `start_spark_and_run.sh` from a login survives disconnections:
node. It allocates nodes via `salloc` and handles cluster coordination.
Pass the number of nodes as the first argument. ```sh
tmux new -s add_months
srun -p cpu-g2 -A comdata --nodes=1 --time=72:00:00 -c 112 --mem=400G \
bash -l -c "
cd /mmfs1/gscratch/comdata/users/makohill/cdsc_reddit && \
PYTHON=/gscratch/comdata/users/makohill/cdsc_reddit/venv/bin/python3 \
COMMENTS_DUMPDIR=/path/to/new/comments \
SUBMISSIONS_DUMPDIR=/path/to/new/submissions \
./datasets/add_months.sh 2025-01 2025-02 ... YYYY-MM
" 2>&1 | tee /gscratch/comdata/users/makohill/add_months_run.log
```
The `bash -l` flag sources `.bashrc` on the compute node so the Spark
environment is available. The `tee` command writes output to both the
terminal and a log file so you can review it later.
Detach from tmux with `Ctrl-b d` and reattach with `tmux attach -t add_months`.
For a multi-node Spark cluster instead, use `add_months_multinode.sh`
from a login node — it takes the number of nodes as its first argument.
### Merge layers — `merge_layers.sh` ### Merge layers — `merge_layers.sh`
@@ -296,25 +331,25 @@ continuous with the most recent data we already have.
### Step 6: Part 2 — sorting the `.parquet` files by author and subreddit via Spark ### Step 6: Part 2 — sorting the `.parquet` files by author and subreddit via Spark
If the `.parquet` files reasonably appear to be complete, we can now If the `.parquet` files reasonably appear to be complete, we can now
sort them by author and subreddit. The most efficient way to do so is by sort them by author and subreddit. The most efficient way to do so is via
using one node on `cpu-g2` with 128 CPUs and 994G memory. This one node `srun` on a `cpu-g2` node (128 CPUs, ~1 TB RAM). Using `srun` releases
splits into up to six slices (four in our current case) so the tasks the node automatically when the job finishes. Run from a login node
will still be parallelized (`hyakalloc` or [this Hyak blog][hyak-blog] inside `tmux`:
are good resources for further information). Run `tmux` on a login
node, then grab the whole node for up to a week with:
```sh ```sh
salloc -p cpu-g2 -A comdata --nodes=1 --time=168:00:00 -c 128 --mem=994G srun -p cpu-g2 -A comdata --nodes=1 --time=72:00:00 -c 112 --mem=400G \
bash -l -c "
cd /path/to/cdsc_reddit/datasets && \
source \$SPARK_CONF_DIR/spark-env.sh && \
start_spark_cluster.sh && \
spark-submit --master spark://\$(hostname):\$SPARK_MASTER_PORT submissions_part2.py && \
spark-submit --master spark://\$(hostname):\$SPARK_MASTER_PORT comments_part2.py && \
stop-all.sh
"
``` ```
[hyak-blog]: https://hyak.uw.edu/blog/g1-vs-g2/ [hyak-blog]: https://hyak.uw.edu/blog/g1-vs-g2/
Once Slurm drops you onto the compute node, run
```sh
./start_spark_and_run.sh submissions_part2.py
```
Monitor via `htop` (as described in Step 4); the CPUs may not always Monitor via `htop` (as described in Step 4); the CPUs may not always
show high usage but you should see that memory is being used. Repeat show high usage but you should see that memory is being used. Repeat
for the comments. Successful jobs will result in for the comments. Successful jobs will result in