datasets/README.md: document srun workflow, PYTHON var, container notes
Update the add_months and Step 6 sections with lessons learned from the first run attempt: - Replace salloc with srun (releases node automatically on completion) - Document the PYTHON variable override needed for parallel/venv - Note that .zst decompression uses the zstandard library due to Singularity container restrictions on the system zstd binary - Add full srun invocation with bash -l, tee logging, and tmux guidance - Update Step 6 walkthrough to use srun instead of salloc Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -73,6 +73,22 @@ read all layers together correctly. At a yearly update cadence the number
|
|||||||
of layers stays small; use `merge_layers.sh` to collapse them when
|
of layers stays small; use `merge_layers.sh` to collapse them when
|
||||||
needed.
|
needed.
|
||||||
|
|
||||||
|
#### Environment setup
|
||||||
|
|
||||||
|
The Python environment runs inside a Singularity container. Set `PYTHON`
|
||||||
|
to the full path of the venv interpreter so that `parallel` jobs use the
|
||||||
|
right Python (fresh shells spawned by `parallel` don't inherit the active
|
||||||
|
venv):
|
||||||
|
|
||||||
|
```sh
|
||||||
|
PYTHON=/gscratch/comdata/users/makohill/cdsc_reddit/venv/bin/python3
|
||||||
|
```
|
||||||
|
|
||||||
|
The `.zst` decompression uses the `zstandard` Python library rather than
|
||||||
|
the system `zstd` binary, which is inaccessible from inside the container.
|
||||||
|
|
||||||
|
#### Dump directory
|
||||||
|
|
||||||
The new `.zst` dump files must be accessible at `COMMENTS_DUMPDIR` and
|
The new `.zst` dump files must be accessible at `COMMENTS_DUMPDIR` and
|
||||||
`SUBMISSIONS_DUMPDIR`. Override the defaults (which match `dumps_helper.py`)
|
`SUBMISSIONS_DUMPDIR`. Override the defaults (which match `dumps_helper.py`)
|
||||||
via environment variables if the files are not in the standard locations:
|
via environment variables if the files are not in the standard locations:
|
||||||
@@ -80,18 +96,37 @@ via environment variables if the files are not in the standard locations:
|
|||||||
```sh
|
```sh
|
||||||
COMMENTS_DUMPDIR=/path/to/new/comments \
|
COMMENTS_DUMPDIR=/path/to/new/comments \
|
||||||
SUBMISSIONS_DUMPDIR=/path/to/new/submissions \
|
SUBMISSIONS_DUMPDIR=/path/to/new/submissions \
|
||||||
./add_months.sh 2025-01 2025-02 2025-03
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Part 1 runs directly on a compute node. For Part 2 there are two options:
|
#### Running as a Slurm job
|
||||||
|
|
||||||
- **Single fat node** (simpler, often faster for smaller sorts): `salloc`
|
The recommended way to run `add_months.sh` is via `srun` on a fat
|
||||||
a `cpu-g2` node (128 cores, ~1 TB RAM) and run the Part 2 script
|
`cpu-g2` node. Using `srun` (rather than `salloc`) means the node is
|
||||||
directly with `spark-submit` or `python3`. See Step 6 of the walkthrough
|
released automatically as soon as the script finishes, regardless of the
|
||||||
below for the `salloc` invocation.
|
walltime. Run from a login node inside a `tmux` session so the terminal
|
||||||
- **Multi-node Spark cluster**: use `start_spark_and_run.sh` from a login
|
survives disconnections:
|
||||||
node. It allocates nodes via `salloc` and handles cluster coordination.
|
|
||||||
Pass the number of nodes as the first argument.
|
```sh
|
||||||
|
tmux new -s add_months
|
||||||
|
|
||||||
|
srun -p cpu-g2 -A comdata --nodes=1 --time=72:00:00 -c 112 --mem=400G \
|
||||||
|
bash -l -c "
|
||||||
|
cd /mmfs1/gscratch/comdata/users/makohill/cdsc_reddit && \
|
||||||
|
PYTHON=/gscratch/comdata/users/makohill/cdsc_reddit/venv/bin/python3 \
|
||||||
|
COMMENTS_DUMPDIR=/path/to/new/comments \
|
||||||
|
SUBMISSIONS_DUMPDIR=/path/to/new/submissions \
|
||||||
|
./datasets/add_months.sh 2025-01 2025-02 ... YYYY-MM
|
||||||
|
" 2>&1 | tee /gscratch/comdata/users/makohill/add_months_run.log
|
||||||
|
```
|
||||||
|
|
||||||
|
The `bash -l` flag sources `.bashrc` on the compute node so the Spark
|
||||||
|
environment is available. The `tee` command writes output to both the
|
||||||
|
terminal and a log file so you can review it later.
|
||||||
|
|
||||||
|
Detach from tmux with `Ctrl-b d` and reattach with `tmux attach -t add_months`.
|
||||||
|
|
||||||
|
For a multi-node Spark cluster instead, use `add_months_multinode.sh`
|
||||||
|
from a login node — it takes the number of nodes as its first argument.
|
||||||
|
|
||||||
### Merge layers — `merge_layers.sh`
|
### Merge layers — `merge_layers.sh`
|
||||||
|
|
||||||
@@ -296,25 +331,25 @@ continuous with the most recent data we already have.
|
|||||||
### Step 6: Part 2 — sorting the `.parquet` files by author and subreddit via Spark
|
### Step 6: Part 2 — sorting the `.parquet` files by author and subreddit via Spark
|
||||||
|
|
||||||
If the `.parquet` files reasonably appear to be complete, we can now
|
If the `.parquet` files reasonably appear to be complete, we can now
|
||||||
sort them by author and subreddit. The most efficient way to do so is by
|
sort them by author and subreddit. The most efficient way to do so is via
|
||||||
using one node on `cpu-g2` with 128 CPUs and 994G memory. This one node
|
`srun` on a `cpu-g2` node (128 CPUs, ~1 TB RAM). Using `srun` releases
|
||||||
splits into up to six slices (four in our current case) so the tasks
|
the node automatically when the job finishes. Run from a login node
|
||||||
will still be parallelized (`hyakalloc` or [this Hyak blog][hyak-blog]
|
inside `tmux`:
|
||||||
are good resources for further information). Run `tmux` on a login
|
|
||||||
node, then grab the whole node for up to a week with:
|
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
salloc -p cpu-g2 -A comdata --nodes=1 --time=168:00:00 -c 128 --mem=994G
|
srun -p cpu-g2 -A comdata --nodes=1 --time=72:00:00 -c 112 --mem=400G \
|
||||||
|
bash -l -c "
|
||||||
|
cd /path/to/cdsc_reddit/datasets && \
|
||||||
|
source \$SPARK_CONF_DIR/spark-env.sh && \
|
||||||
|
start_spark_cluster.sh && \
|
||||||
|
spark-submit --master spark://\$(hostname):\$SPARK_MASTER_PORT submissions_part2.py && \
|
||||||
|
spark-submit --master spark://\$(hostname):\$SPARK_MASTER_PORT comments_part2.py && \
|
||||||
|
stop-all.sh
|
||||||
|
"
|
||||||
```
|
```
|
||||||
|
|
||||||
[hyak-blog]: https://hyak.uw.edu/blog/g1-vs-g2/
|
[hyak-blog]: https://hyak.uw.edu/blog/g1-vs-g2/
|
||||||
|
|
||||||
Once Slurm drops you onto the compute node, run
|
|
||||||
|
|
||||||
```sh
|
|
||||||
./start_spark_and_run.sh submissions_part2.py
|
|
||||||
```
|
|
||||||
|
|
||||||
Monitor via `htop` (as described in Step 4); the CPUs may not always
|
Monitor via `htop` (as described in Step 4); the CPUs may not always
|
||||||
show high usage but you should see that memory is being used. Repeat
|
show high usage but you should see that memory is being used. Repeat
|
||||||
for the comments. Successful jobs will result in
|
for the comments. Successful jobs will result in
|
||||||
|
|||||||
Reference in New Issue
Block a user