Refactor revision parsing logic to be columnar #1

Merged
beason merged 27 commits from test-parquet into parquet_support 2025-06-17 18:22:26 +00:00
Owner

Use a configurable table of columns which encapsulate per-field logic.

Use a configurable table of columns which encapsulate per-field logic.
groceryheist was assigned by beason 2025-06-03 22:17:05 +00:00
beason added 19 commits 2025-06-03 22:17:06 +00:00
Changed logic for handling anonymous edits so that wikiq handles
the type for editor ids consistently. Parquet can mix int64 and
None, but not int64 and strings - previously the code used the empty
string to denote anonymous editors.

Tests failing. Don't merge yet.

Signed-off-by: Will Beason <willbeason@gmail.com>
Signed-off-by: Will Beason <willbeason@gmail.com>
This way we can ensure that the parquet code outputs equivalent output.

Signed-off-by: Will Beason <willbeason@gmail.com>
This requires some data smoothing to get read_table and read_parquet
DataFrames to look close enough, but the test now passes and validates
that the data match.

Signed-off-by: Will Beason <willbeason@gmail.com>
Also add test to ensure functionality works.

Signed-off-by: Will Beason <willbeason@gmail.com>
Use simple loop for when we aren't collapsing users.
Add test which covers case when users are deleted.

Signed-off-by: Will Beason <willbeason@gmail.com>
Use groupby so we don't have to deal with edge cases and compare
revisions directly.

Signed-off-by: Will Beason <willbeason@gmail.com>
Tests broken due to url encoding, which can likely now be removed.

Signed-off-by: Will Beason <willbeason@gmail.com>
Signed-off-by: Will Beason <willbeason@gmail.com>
Surprisingly replacing list<str> with str doesn't break anything,
even baselines.

Signed-off-by: Will Beason <willbeason@gmail.com>
This way we're not adding temporary fields to objects that don't
normally have these fields.

Signed-off-by: Will Beason <willbeason@gmail.com>
This is optional, and doesn't impact existing users as preexisting
behavior when users specify an output directory is unchanged.

This makes tests not need to copy large files as part of their
execution, as they can ask files to be written to explicit
locations.

Signed-off-by: Will Beason <willbeason@gmail.com>
This will allow making columns optional, as desired, and make
adding new columns straightforward without impacting existing
behavior.

Signed-off-by: Will Beason <willbeason@gmail.com>
Noargs works, now to do persistence.

Signed-off-by: Will Beason <willbeason@gmail.com>
Now all columns are tested in the parquet test.

Signed-off-by: Will Beason <willbeason@gmail.com>
Signed-off-by: Will Beason <willbeason@gmail.com>
Signed-off-by: Will Beason <willbeason@gmail.com>
Signed-off-by: Will Beason <willbeason@gmail.com>
beason added 1 commit 2025-06-03 22:20:28 +00:00
This should help PR readability.

There is likely still some unused code, but that should be the
bulk of it.

Signed-off-by: Will Beason <willbeason@gmail.com>
groceryheist reviewed 2025-06-03 23:10:53 +00:00
wikiq Outdated
@ -505,3 +360,3 @@
# skip namespaces not in the filter
if self.namespace_filter is not None:
if namespace not in self.namespace_filter:
if page.mwpage.namespace not in self.namespace_filter:
Owner

Curious why this is page.mwpage.namespace now instead of the old logic

Curious why this is page.mwpage.namespace now instead of the old logic
Author
Owner

This keeps all data about the page contained to one object rather than in both the "page" and "mwpage" fields. This means all of the columnar functions can use an interface that requires minimal information and simplifies the need to propagate information from mwpage to page.

This keeps all data about the page contained to one object rather than in both the "page" and "mwpage" fields. This means all of the columnar functions can use an interface that requires minimal information and simplifies the need to propagate information from mwpage to page.
groceryheist reviewed 2025-06-04 01:29:20 +00:00
wikiq Outdated
@ -616,0 +398,4 @@
regex_matches[k] = []
regex_matches[k].append(v)
buffer = table.pop()
Owner

can we give buffer a more descriptive name?

can we give `buffer` a more descriptive name?
Author
Owner

Done!

Done!
beason marked this conversation as resolved
groceryheist reviewed 2025-06-04 01:35:57 +00:00
wikiq Outdated
@ -545,3 +382,1 @@
if not rev.text:
rev.text = ""
# if text exists, we'll check for a sha1 and generate one otherwise
revs[-1] = rev
Owner

This logic is to 1. replace None with "" and then fix some edits from Fandom that didn't come with sha1s. Could we move this to a function so it looks like revs = repair_revs(revs). I'd like the mutation of revs to be super clear.

This logic is to 1. replace `None` with "" and then fix some edits from Fandom that didn't come with sha1s. Could we move this to a function so it looks like `revs = repair_revs(revs)`. I'd like the mutation of revs to be super clear.
Author
Owner

Done!

Done!
beason marked this conversation as resolved
groceryheist reviewed 2025-06-04 01:36:16 +00:00
wikiq Outdated
@ -572,3 +384,1 @@
# wrap user-defined editors in quotes for fread
rev_data.editor = rev.user.text
rev_data.anon = rev.user.id is None
table.add(page.mwpage, list(revs))
Owner

Don't think it's necessary to call list(revs) here.

Don't think it's necessary to call `list(revs)` here.
beason marked this conversation as resolved
groceryheist reviewed 2025-06-04 01:38:05 +00:00
wikiq Outdated
@ -90,0 +88,4 @@
self.__revisions: Generator[list[mwxml.Revision]] = self.rev_list()
@staticmethod
def user_text(rev) -> str | None:
Owner

Love adding types where you have so far. Its kinda awkward to use the types inconsistently. Let's use them everywhere eventually.

Love adding types where you have so far. Its kinda awkward to use the types inconsistently. Let's use them everywhere eventually.
beason marked this conversation as resolved
groceryheist reviewed 2025-06-13 21:20:45 +00:00
tables.py Outdated
@ -0,0 +92,4 @@
class RevisionEditorId(RevisionField[int | None]):
field = pa.field("editorid", pa.int64(), nullable=True)
def extract(self, page: mwtypes.Page, revisions: list[mwxml.Revision]) -> int | None:
Owner

I want to support python 3.9 as that's installed on TACC. I don't think it supports this syntax for Union types.

I want to support python 3.9 as that's installed on TACC. I don't think it supports this syntax for Union types.
beason marked this conversation as resolved
groceryheist reviewed 2025-06-13 21:22:06 +00:00
groceryheist reviewed 2025-06-13 21:23:30 +00:00
@ -0,0 +29,4 @@
@abstractmethod
def extract(self, page: mwtypes.Page, revisions: list[mwxml.Revision]) -> T:
"""
:param page: The page for this set of revisions.
Owner

lets note that when the collapse-revs behavior isn't enabled that we're only passing lists of one revision.

lets note that when the collapse-revs behavior isn't enabled that we're only passing lists of one revision.
beason marked this conversation as resolved
groceryheist reviewed 2025-06-13 21:26:59 +00:00
wikiq Outdated
@ -30,3 +33,3 @@
from dataclasses import dataclass
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.csv as pc
Owner

I usually use pc for pyarrow.compute. Maybe pacsv instead?

I usually use pc for `pyarrow.compute`. Maybe `pacsv` instead?
beason marked this conversation as resolved
groceryheist approved these changes 2025-06-13 21:29:21 +00:00
groceryheist left a comment
Owner

Thank you Will!

Thank you Will!
beason added 4 commits 2025-06-17 16:42:01 +00:00
Signed-off-by: Will Beason <willbeason@gmail.com>
Signed-off-by: Will Beason <willbeason@gmail.com>
Since our execution environment requires this

Signed-off-by: Will Beason <willbeason@gmail.com>
Signed-off-by: Will Beason <willbeason@gmail.com>
beason added 1 commit 2025-06-17 16:46:38 +00:00
Signed-off-by: Will Beason <willbeason@gmail.com>
beason added 2 commits 2025-06-17 17:46:27 +00:00
beason merged commit 0d9ab003f0 into parquet_support 2025-06-17 18:22:26 +00:00
beason deleted branch test-parquet 2025-06-17 18:22:27 +00:00
Sign in to join this conversation.
No reviewers
No Label
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: collective/mediawiki_dump_tools#1
No description provided.