Commit Graph

80 Commits

Author SHA1 Message Date
Benjamin Mako Hill
061105b7b4 Merge branch 'master' of github.com:CommunityDataScienceCollective/COVID-19_Digital_Observatory 2020-04-01 07:53:40 -07:00
Benjamin Mako Hill
268f9e1cf3 added gitignore for wikipedia/data directory 2020-04-01 07:52:15 -07:00
Benjamin Mako Hill
784458f206 renamed the wikipedia_views module to wikipedia 2020-04-01 07:51:20 -07:00
Benjamin Mako Hill
6493361fbd added initial version of revision-scraper
Borrows much of the structure from the (patched) version of the
dailyview scraper.
2020-04-01 07:42:38 -07:00
Benjamin Mako Hill
cb26ecabda fixed typo in description of view scraper 2020-04-01 07:42:24 -07:00
Benjamin Mako Hill
5c861cfca4 renamed daily views to make it clear that it's just enwiki 2020-04-01 07:29:45 -07:00
Benjamin Mako Hill
38fdd07b39 changes to a bunch of the wikipedia view code
- Renamed the articles.txt to something more specific

Changes to both scripts:

- Updated filenames to match the new standard
- Reworked the logging code so that it can write to stderr by
  default. Because we can only call logging.basicConfig() once, this
  eneded up being a bigger changes.
- Caused scripts to output git commits and export to track which code
  produced which dataset.
- Caused programs to take files instead of directories as
  output (allows us to run programs more than once a day).

Changes to the wikipedia_views/scripts/fetch_daily_views.py:

- Change output that it outputs a sequence of JSON dictionaries (one
  per line) as per the standard we agreed to and which is what
  Twitter, Github, and other dumps do. Previous behavior was to create
  output a single JSON list object.
- A number of other small changes and tweaks throughout.
2020-04-01 07:15:12 -07:00
8bb3db8b46 add examples using the translations data 2020-03-31 16:56:59 -07:00
c8b886364f add documentation for the output files 2020-03-31 16:22:30 -07:00
29ae62c83e create 'latest.csv' to link to the most recent output. 2020-03-31 16:16:36 -07:00
687da1284f Merge branch 'master' of github.com:CommunityDataScienceCollective/COVID-19_Digital_Observatory 2020-03-31 16:01:43 -07:00
603a7b6ec3 update output 2020-03-31 16:01:38 -07:00
74667cf4dc use 'item' instead of 'entity' 2020-03-31 15:34:34 -07:00
3d142377ca rename compile script 2020-03-31 15:27:39 -07:00
55110c7f21 update compile script 2020-03-31 15:27:21 -07:00
4fd516a700 Improve README.md for keywords 2020-03-31 15:25:51 -07:00
98b07b8098 rename 'transliterations' to 'keywords' 2020-03-31 15:15:01 -07:00
20ad09d155
Update README.md
linking to project pages more fully
2020-03-31 17:09:58 -05:00
10a7d915a5
Merge pull request #10 from makoshark/master
stop writing writing header to one-column list
2020-03-31 12:23:36 -07:00
Benjamin Mako Hill
72bf7bcd37 stop writing writing header to one-column list
This feels like it's asking for trouble. Description of the contents
of the list is in the filename.
2020-03-31 08:35:23 -07:00
09d171608f reorganize file structure
- move 'input' files to resources
- outputs not meant for downstream go in output/intermediate
- csv outputs for downstream go in output/csv
2020-03-29 21:49:57 -07:00
Kaylea Champion
50f58a3887 migrating to new directory structure 2020-03-29 13:42:01 -05:00
a86c3a97ee
Merge pull request #7 from kayleachampion/master
cleanup with merge
2020-03-29 11:39:32 -07:00
Kaylea Champion
317c32cdb5 all march data 2020-03-29 00:19:54 -07:00
Kaylea Champion
3bd1c684df adding a logs dir without adding my log files, assuming those don't
belong in repo
2020-03-28 23:50:04 -07:00
Kaylea Champion
fa8e977741 new version of this from scrape. no double quotes around articles any
more
2020-03-28 23:47:55 -07:00
Kaylea Champion
4226b45b97 adds a scraper to update the articles file 2020-03-28 23:46:48 -07:00
Kaylea Champion
c7af46f8fb adds in new logging capability 2020-03-28 18:46:35 -07:00
05b8025e15
Merge pull request #9 from aaronshaw/master
minimal analysis example with pageview data
2020-03-28 20:42:40 -05:00
aaronshaw
5dfbe3dab4 minimal analysis example with pageview data 2020-03-28 20:33:23 -05:00
c0e50fe297
Merge pull request #8 from aaronshaw/master
Update to load data from github url and include 3/28 data in output
2020-03-28 17:38:20 -05:00
aaronshaw
1f5b15f099 regenerated following update to R src that creates this file 2020-03-28 17:31:36 -05:00
aaronshaw
9e0c92242e Loading data directly from github URL. Commenting out commands that assume cloned repository. 2020-03-28 17:30:37 -05:00
Kaylea Champion
7b3062ffb1 Merge branch 'master' of https://github.com/CommunityDataScienceCollective/COVID-19_Digital_Observatory 2020-03-28 14:46:00 -07:00
033149776c
Merge pull request #5 from kayleachampion/master
view data
2020-03-28 14:17:21 -07:00
dd7d968bb6
Merge pull request #1 from CommunityDataScienceCollective/kaylea/master
Some suggested changes.
2020-03-28 14:15:53 -07:00
c690df4852 Merge branch 'kaylea/master' of github.com:CommunityDataScienceCollective/COVID-19_Digital_Observatory into kaylea/master 2020-03-28 14:13:46 -07:00
f5ac92330c Merge branch 'kaylea/master' of github.com:CommunityDataScienceCollective/COVID-19_Digital_Observatory into kaylea/master 2020-03-28 14:13:26 -07:00
1b2bb7d1df Merge branch 'kaylea/master' of github.com:CommunityDataScienceCollective/COVID-19_Digital_Observatory into kaylea/master 2020-03-28 14:12:36 -07:00
ee91df4c04 Read the whole input file before making api calls 2020-03-28 14:12:17 -07:00
24e5590836 Read the whole input file before making api calls 2020-03-28 14:09:28 -07:00
groceryheist
0fb8ac2ed9
Merge pull request #4 from CommunityDataScienceCollective/translations
Transliterations: Use data from google trends and wikidata to find transliterations.
2020-03-28 14:07:04 -07:00
2b56ed26f4 Update transliteration results for 2020-03-28
- renamed results from yesterday into time stamped file
2020-03-28 14:03:16 -07:00
207b1f8b95 Read entire input files before making api calls.
This is nicer style to not hold onto resources for as long.
It will use a bit more memory.
2020-03-28 13:55:52 -07:00
282208507a Keep better track of time.
- Add timestamp ot transliterations output file.
- Append wikidata search terms instead of overwriting
2020-03-28 13:52:54 -07:00
Kaylea Champion
ed0641ecc7 Merge branch 'master' of https://github.com/CommunityDataScienceCollective/COVID-19_Digital_Observatory
updates my branch with all the master changes so far
2020-03-28 12:21:37 -07:00
Kaylea Champion
cd08294288 trialing new approach 2020-03-28 12:18:01 -07:00
Kaylea Champion
c677d8d70a trialing new approach 2020-03-28 12:17:45 -07:00
e720653a23 typo fix 2020-03-28 10:01:43 -07:00
a9f129f1d6 Merge branch 'translations' of github.com:CommunityDataScienceCollective/COVID-19_Digital_Observatory into translations 2020-03-28 09:58:43 -07:00