Refactor revision parsing logic to be columnar #1
No reviewers
Labels
No Label
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: collective/mediawiki_dump_tools#1
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "test-parquet"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Use a configurable table of columns which encapsulate per-field logic.
@ -505,3 +360,3 @@# skip namespaces not in the filterif self.namespace_filter is not None:if namespace not in self.namespace_filter:if page.mwpage.namespace not in self.namespace_filter:Curious why this is page.mwpage.namespace now instead of the old logic
This keeps all data about the page contained to one object rather than in both the "page" and "mwpage" fields. This means all of the columnar functions can use an interface that requires minimal information and simplifies the need to propagate information from mwpage to page.
@ -616,0 +398,4 @@regex_matches[k] = []regex_matches[k].append(v)buffer = table.pop()can we give
buffera more descriptive name?Done!
@ -545,3 +382,1 @@if not rev.text:rev.text = ""# if text exists, we'll check for a sha1 and generate one otherwiserevs[-1] = revThis logic is to 1. replace
Nonewith "" and then fix some edits from Fandom that didn't come with sha1s. Could we move this to a function so it looks likerevs = repair_revs(revs). I'd like the mutation of revs to be super clear.Done!
@ -572,3 +384,1 @@# wrap user-defined editors in quotes for freadrev_data.editor = rev.user.textrev_data.anon = rev.user.id is Nonetable.add(page.mwpage, list(revs))Don't think it's necessary to call
list(revs)here.@ -90,0 +88,4 @@self.__revisions: Generator[list[mwxml.Revision]] = self.rev_list()@staticmethoddef user_text(rev) -> str | None:Love adding types where you have so far. Its kinda awkward to use the types inconsistently. Let's use them everywhere eventually.
@ -0,0 +92,4 @@class RevisionEditorId(RevisionField[int | None]):field = pa.field("editorid", pa.int64(), nullable=True)def extract(self, page: mwtypes.Page, revisions: list[mwxml.Revision]) -> int | None:I want to support python 3.9 as that's installed on TACC. I don't think it supports this syntax for Union types.
@ -0,0 +29,4 @@@abstractmethoddef extract(self, page: mwtypes.Page, revisions: list[mwxml.Revision]) -> T:""":param page: The page for this set of revisions.lets note that when the collapse-revs behavior isn't enabled that we're only passing lists of one revision.
@ -30,3 +33,3 @@from dataclasses import dataclassimport pyarrow as paimport pyarrow.parquet as pqimport pyarrow.csv as pcI usually use pc for
pyarrow.compute. Maybepacsvinstead?Thank you Will!