Refactor revision parsing logic to be columnar #1
No reviewers
Labels
No Label
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: collective/mediawiki_dump_tools#1
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "test-parquet"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Use a configurable table of columns which encapsulate per-field logic.
@ -505,3 +360,3 @@
# skip namespaces not in the filter
if self.namespace_filter is not None:
if namespace not in self.namespace_filter:
if page.mwpage.namespace not in self.namespace_filter:
Curious why this is page.mwpage.namespace now instead of the old logic
This keeps all data about the page contained to one object rather than in both the "page" and "mwpage" fields. This means all of the columnar functions can use an interface that requires minimal information and simplifies the need to propagate information from mwpage to page.
@ -616,0 +398,4 @@
regex_matches[k] = []
regex_matches[k].append(v)
buffer = table.pop()
can we give
buffer
a more descriptive name?Done!
@ -545,3 +382,1 @@
if not rev.text:
rev.text = ""
# if text exists, we'll check for a sha1 and generate one otherwise
revs[-1] = rev
This logic is to 1. replace
None
with "" and then fix some edits from Fandom that didn't come with sha1s. Could we move this to a function so it looks likerevs = repair_revs(revs)
. I'd like the mutation of revs to be super clear.Done!
@ -572,3 +384,1 @@
# wrap user-defined editors in quotes for fread
rev_data.editor = rev.user.text
rev_data.anon = rev.user.id is None
table.add(page.mwpage, list(revs))
Don't think it's necessary to call
list(revs)
here.@ -90,0 +88,4 @@
self.__revisions: Generator[list[mwxml.Revision]] = self.rev_list()
@staticmethod
def user_text(rev) -> str | None:
Love adding types where you have so far. Its kinda awkward to use the types inconsistently. Let's use them everywhere eventually.
@ -0,0 +92,4 @@
class RevisionEditorId(RevisionField[int | None]):
field = pa.field("editorid", pa.int64(), nullable=True)
def extract(self, page: mwtypes.Page, revisions: list[mwxml.Revision]) -> int | None:
I want to support python 3.9 as that's installed on TACC. I don't think it supports this syntax for Union types.
@ -0,0 +29,4 @@
@abstractmethod
def extract(self, page: mwtypes.Page, revisions: list[mwxml.Revision]) -> T:
"""
:param page: The page for this set of revisions.
lets note that when the collapse-revs behavior isn't enabled that we're only passing lists of one revision.
@ -30,3 +33,3 @@
from dataclasses import dataclass
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.csv as pc
I usually use pc for
pyarrow.compute
. Maybepacsv
instead?Thank you Will!