Commit Graph

100 Commits

Author SHA1 Message Date
groceryheist
ff96d52cb9 Merge pull request #12 from makoshark/master
substantial changes to wikipedia fetching code
2020-04-01 16:36:56 -07:00
ff5521d44b ignore __pycache__ 2020-04-01 18:23:50 -05:00
b26a2b5a86 fix bug in previous commit
forgot to import digobs module in the scraper script
2020-04-01 18:22:36 -05:00
427eddd141 cleaned up unnecessary files 2020-04-01 18:21:41 -05:00
b457cd726b use the type= feature in argparse
- integrated the type= feature in argparse in all three scripts
- removed some redundant code from the third file
2020-04-01 18:13:02 -05:00
17c3f75389 Merge branch 'master' of github.com:CommunityDataScienceCollective/COVID-19_Digital_Observatory 2020-04-01 17:19:33 -05:00
070d23f718 changes in response to code review by nate
- moved some common functions into files
- other smaller changes
2020-04-01 17:16:34 -05:00
34f8b9a23e Merge pull request #14 from aaronshaw/aaronshaw-master
pointing at updated data url, adding explicit NA handling to factor, …
2020-04-01 16:58:02 -05:00
aaronshaw
282588772e pointing at updated data url, adding explicit NA handling to factor, cutting unnecessary call to ggplot2, and updated corresponding output from new data file. May not work while kibo urls are getting resolved 2020-04-01 16:52:22 -05:00
4fe5deb013 Merge branch 'master' of github.com:CommunityDataScienceCollective/COVID-19_Digital_Observatory 2020-04-01 16:42:16 -05:00
d655e1ce93 tweaks to revision export code
- flags were not being exported (e.g., minor, anon)
- broke with hidden/deleted user names
2020-04-01 16:39:53 -05:00
3f19805d36 fix bug in rev scraper script
Bug was a break, added for debugging, that caused the script to only
work for the first article.
2020-04-01 15:49:28 -05:00
95d37cff7a change copy to move in cron scripts 2020-04-01 15:49:02 -05:00
5739d1c404 Merge branch 'master' of github.com:makoshark/COVID-19_Digital_Observatory 2020-04-01 15:18:50 -05:00
141871eda6 add two small shellscripts for automation
- Added two bash scripts usable as cronjobs to automate the production
  of revisions and view data.

These commands automate the process of running code and copying material
2020-04-01 15:15:11 -05:00
04e00f363b address confusion with date
The timestamps in files should be the day that the exports are done. For
the view data, the query date needs to be the day before but this
shouldn't be the timestamp we use in files, etc.
2020-04-01 15:14:05 -05:00
06d2fd1563 fix bugs with the date stamps 2020-04-01 10:47:33 -05:00
4f8a698c62 Merge pull request #11 from jdfoote/master
Adding a tidyverse example (with very verbose comments)
2020-04-01 10:41:02 -05:00
4e1b7fbdfe fixed typo in debug message 2020-04-01 08:18:05 -07:00
061105b7b4 Merge branch 'master' of github.com:CommunityDataScienceCollective/COVID-19_Digital_Observatory 2020-04-01 07:53:40 -07:00
268f9e1cf3 added gitignore for wikipedia/data directory 2020-04-01 07:52:15 -07:00
784458f206 renamed the wikipedia_views module to wikipedia 2020-04-01 07:51:20 -07:00
6493361fbd added initial version of revision-scraper
Borrows much of the structure from the (patched) version of the
dailyview scraper.
2020-04-01 07:42:38 -07:00
cb26ecabda fixed typo in description of view scraper 2020-04-01 07:42:24 -07:00
5c861cfca4 renamed daily views to make it clear that it's just enwiki 2020-04-01 07:29:45 -07:00
38fdd07b39 changes to a bunch of the wikipedia view code
- Renamed the articles.txt to something more specific

Changes to both scripts:

- Updated filenames to match the new standard
- Reworked the logging code so that it can write to stderr by
  default. Because we can only call logging.basicConfig() once, this
  eneded up being a bigger changes.
- Caused scripts to output git commits and export to track which code
  produced which dataset.
- Caused programs to take files instead of directories as
  output (allows us to run programs more than once a day).

Changes to the wikipedia_views/scripts/fetch_daily_views.py:

- Change output that it outputs a sequence of JSON dictionaries (one
  per line) as per the standard we agreed to and which is what
  Twitter, Github, and other dumps do. Previous behavior was to create
  output a single JSON list object.
- A number of other small changes and tweaks throughout.
2020-04-01 07:15:12 -07:00
Jeremy Foote
6b05896aa5 Adding a tidyverse example (with very verbose comments) 2020-03-31 22:42:31 -04:00
8bb3db8b46 add examples using the translations data 2020-03-31 16:56:59 -07:00
c8b886364f add documentation for the output files 2020-03-31 16:22:30 -07:00
29ae62c83e create 'latest.csv' to link to the most recent output. 2020-03-31 16:16:36 -07:00
687da1284f Merge branch 'master' of github.com:CommunityDataScienceCollective/COVID-19_Digital_Observatory 2020-03-31 16:01:43 -07:00
603a7b6ec3 update output 2020-03-31 16:01:38 -07:00
74667cf4dc use 'item' instead of 'entity' 2020-03-31 15:34:34 -07:00
3d142377ca rename compile script 2020-03-31 15:27:39 -07:00
55110c7f21 update compile script 2020-03-31 15:27:21 -07:00
4fd516a700 Improve README.md for keywords 2020-03-31 15:25:51 -07:00
98b07b8098 rename 'transliterations' to 'keywords' 2020-03-31 15:15:01 -07:00
20ad09d155 Update README.md
linking to project pages more fully
2020-03-31 17:09:58 -05:00
10a7d915a5 Merge pull request #10 from makoshark/master
stop writing writing header to one-column list
2020-03-31 12:23:36 -07:00
72bf7bcd37 stop writing writing header to one-column list
This feels like it's asking for trouble. Description of the contents
of the list is in the filename.
2020-03-31 08:35:23 -07:00
09d171608f reorganize file structure
- move 'input' files to resources
- outputs not meant for downstream go in output/intermediate
- csv outputs for downstream go in output/csv
2020-03-29 21:49:57 -07:00
Kaylea Champion
50f58a3887 migrating to new directory structure 2020-03-29 13:42:01 -05:00
a86c3a97ee Merge pull request #7 from kayleachampion/master
cleanup with merge
2020-03-29 11:39:32 -07:00
Kaylea Champion
317c32cdb5 all march data 2020-03-29 00:19:54 -07:00
Kaylea Champion
3bd1c684df adding a logs dir without adding my log files, assuming those don't
belong in repo
2020-03-28 23:50:04 -07:00
Kaylea Champion
fa8e977741 new version of this from scrape. no double quotes around articles any
more
2020-03-28 23:47:55 -07:00
Kaylea Champion
4226b45b97 adds a scraper to update the articles file 2020-03-28 23:46:48 -07:00
Kaylea Champion
c7af46f8fb adds in new logging capability 2020-03-28 18:46:35 -07:00
05b8025e15 Merge pull request #9 from aaronshaw/master
minimal analysis example with pageview data
2020-03-28 20:42:40 -05:00
aaronshaw
5dfbe3dab4 minimal analysis example with pageview data 2020-03-28 20:33:23 -05:00