Merge branch 'master' of github.com:CommunityDataScienceCollective/COVID-19_Digital_Observatory
This commit is contained in:
commit
061105b7b4
@ -1,8 +1,7 @@
|
||||
# COVID-19 Digital Observatory
|
||||
The COVID-19 Digital Observatory collects, aggregates, and distributes data from social media, search engine results, and Wikipedia to support immediate public health response and social and data science research related to the pandemic.
|
||||
The [COVID-19 Digital Observatory](https://covid19.communitydata.science "Covid-19 Digital Observatory homepage") collects, aggregates, and distributes data from social media, search engine results, and Wikipedia to support immediate public health response and social and data science research related to the pandemic.
|
||||
|
||||
The [community data science collective](https://wiki.communitydata.science/Main_Page "The community data science collective wiki") is the early stages of building this project. We expect to make rapid progess and to begin releasing code and data soon.
|
||||
The [community data science collective](https://wiki.communitydata.science/Main_Page "The community data science collective wiki") is working with [Pushshift](https://pushshift.io) and others to build this project. We expect to make rapid progess and to release additional code and data soon.
|
||||
|
||||
We eagerly welcome contributors! Please get in touch.
|
||||
Contributors are held to the [code of conduct](code_of_conduct.md "link to code of conduct.md").
|
||||
We eagerly welcome contributors! Please get in touch, submit pull requests, and visit the [project homepage](https://covid19.communitydata.science "Covid-19 Digital Observatory homepage") for more info. Also, please note that contributors are held to the [code of conduct](code_of_conduct.md "link to code of conduct.md").
|
||||
|
||||
|
14
keywords/README.md
Normal file
14
keywords/README.md
Normal file
@ -0,0 +1,14 @@
|
||||
# Keywords
|
||||
|
||||
This code finds trending web searches related to the COVID-19 pandemic using Google trends (`collect_trends.py`). It then searches for relevant keywords on Wikidata (`wikidata_search`) in order to find high-quality translations of important words and phrases (`wikidata_translations.py`). The goal is to support efforts expanding the Observatory to information in many languages beyond English.
|
||||
|
||||
We search the Wikidata API for entities in `src/wikidata_search.py` and then we make simple SPARQL queries in `src/wikidata_translations.py` to collect labels and aliases the entities. The labels come with language metadata. This seems to provide a decent initial list of relevant terms across multiple languages.
|
||||
|
||||
The output data lives at [covid19.communitydata.science](https://covid19.communitydata.science/datasets/keywords).
|
||||
|
||||
The output files have 4 colums:
|
||||
|
||||
- `itemid` links to the wikidata entity
|
||||
- `label` is the translation of the relevant keyword
|
||||
- `langcode` is the [iso 639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) code corresponding the language of the label.
|
||||
- `is_alt` indicates whether the label is an [alias](https://www.wikidata.org/wiki/Help:Aliases).
|
17
keywords/analysis/translations_example.R
Normal file
17
keywords/analysis/translations_example.R
Normal file
@ -0,0 +1,17 @@
|
||||
## example reading latest file straight from the server
|
||||
df <- read.csv("https://covid19.communitydata.science/datasets/keywords/csv/latest.csv")
|
||||
|
||||
## make the data more R-friendly
|
||||
df$is.alt <- df$is_alt == "True"
|
||||
df$is_alt <- NULL
|
||||
|
||||
## find all translations for coronavirus
|
||||
coronavirus.itemids <- df[ (tolower(df$label) == "coronavirus") &
|
||||
(df$langcode == 'en')
|
||||
,"itemid"]
|
||||
|
||||
## there are actually 5 item ids. The one referring to the family of virus is Q57751738
|
||||
coronavirus.translations <- df[df$itemid == "http://www.wikidata.org/entity/Q57751738",]
|
||||
|
||||
## let's only look at non-aliases
|
||||
print(coronavirus.translations[c(coronavirus.translations$is.alt == FALSE), c("label","langcode")])
|
13
keywords/analysis/translations_example.py
Normal file
13
keywords/analysis/translations_example.py
Normal file
@ -0,0 +1,13 @@
|
||||
import pandas as pd
|
||||
|
||||
# read the latest dataset
|
||||
df = pd.read_csv("https://covid19.communitydata.science/datasets/keywords/csv/latest.csv")
|
||||
|
||||
# find translations of "coronavirus"
|
||||
coronavirus_itemids = df.loc[df.label.str.lower() == "coronavirus"]
|
||||
|
||||
# there are actually 5 item ids. The one referring to the family of virus is Q57751738
|
||||
coronavirus_translations = df.loc[df.itemid == "http://www.wikidata.org/entity/Q57751738"]
|
||||
|
||||
# let's only look at unique, non-aliases
|
||||
print(coronavirus_translations.loc[df.is_alt == False,['label','langcode']])
|
Can't render this file because it is too large.
|
Can't render this file because it is too large.
|
Can't render this file because it is too large.
|
Can't render this file because it is too large.
|
12536
keywords/output/csv/2020-03-31_wikidata_item_labels.csv
Normal file
12536
keywords/output/csv/2020-03-31_wikidata_item_labels.csv
Normal file
File diff suppressed because it is too large
Load Diff
13758
keywords/output/csv/2020-03-31_wikidata_item_labels_1.csv
Normal file
13758
keywords/output/csv/2020-03-31_wikidata_item_labels_1.csv
Normal file
File diff suppressed because it is too large
Load Diff
1
keywords/output/csv/latest.csv
Symbolic link
1
keywords/output/csv/latest.csv
Symbolic link
@ -0,0 +1 @@
|
||||
2020-03-31_wikidata_item_labels_1.csv
|
|
@ -75,3 +75,37 @@ date,term,top
|
||||
2020-03-29,Matthew Faber,17
|
||||
2020-03-29,Dave Grohl,18
|
||||
2020-03-29,Daisy Keech,19
|
||||
date,term,top
|
||||
2020-03-31,Sophie Brussaux,0
|
||||
2020-03-31,Virginia stay-at-home order,1
|
||||
2020-03-31,Furlough,2
|
||||
2020-03-31,SBA disaster loans,3
|
||||
2020-03-31,Schoology,4
|
||||
2020-03-31,MyPillow,5
|
||||
2020-03-31,John Krasinski,6
|
||||
2020-03-31,Modern Warfare 2 Remastered,7
|
||||
2020-03-31,Michigan unemployment,8
|
||||
2020-03-31,Tomie dePaola,9
|
||||
2020-03-31,The Good Doctor,10
|
||||
2020-03-31,Arizona news,11
|
||||
2020-03-31,Instacart strike,12
|
||||
2020-03-31,USNS Comfort,13
|
||||
2020-03-31,Crocs,14
|
||||
2020-03-31,"Texas, unemployment",15
|
||||
2020-03-31,Unemployment NY,16
|
||||
2020-03-31,Arizona stay-at-home order,17
|
||||
2020-03-31,N.C unemployment,18
|
||||
2020-03-31,Unemployment benefits,19
|
||||
date,term,top
|
||||
2020-03-31,Chris Cuomo,0
|
||||
2020-03-31,Andrew Jack,1
|
||||
2020-03-31,David Geffen,2
|
||||
2020-03-31,April fools' pranks,3
|
||||
2020-03-31,Taco Bell free taco,4
|
||||
2020-03-31,Face mask for sale,5
|
||||
2020-03-31,Seoul,6
|
||||
2020-03-31,Myron Rolle,7
|
||||
2020-03-31,DaBaby,8
|
||||
2020-03-31,Carole Baskin's husband,9
|
||||
2020-03-31,Everlane,10
|
||||
2020-03-31,Governor Abbott,11
|
|
@ -476,3 +476,253 @@ coronavirus pandemic covid-19 live world map/count,42200,covid-19 pandemic,2020-
|
||||
world health organization pandemic,39050,covid-19 pandemic,2020-03-29
|
||||
coronavirus update,35750,covid-19 pandemic,2020-03-29
|
||||
covid-19 pandemic unemployment payment,29400,covid-19 pandemic,2020-03-29
|
||||
covid 19,78400,coronavirus,2020-03-31
|
||||
coronavirus mapa,54500,coronavirus,2020-03-31
|
||||
trump coronavirus,51600,coronavirus,2020-03-31
|
||||
wuhan coronavirus,45600,coronavirus,2020-03-31
|
||||
coronavirus worldometer,42500,coronavirus,2020-03-31
|
||||
worldometer coronavirus,42500,coronavirus,2020-03-31
|
||||
nyc coronavirus,38850,coronavirus,2020-03-31
|
||||
coronavirus ultime notizie,37400,coronavirus,2020-03-31
|
||||
coronavirus ohio,32450,coronavirus,2020-03-31
|
||||
coronavirus update india,30200,coronavirus,2020-03-31
|
||||
coronavirus uk cases,28900,coronavirus,2020-03-31
|
||||
coronavirus no brasil,27900,coronavirus,2020-03-31
|
||||
india coronavirus cases,27900,coronavirus,2020-03-31
|
||||
covid-19,26950,coronavirus,2020-03-31
|
||||
coronavirus live update,25000,coronavirus,2020-03-31
|
||||
spain coronavirus,24200,coronavirus,2020-03-31
|
||||
coronavirus live map,23950,coronavirus,2020-03-31
|
||||
coronavirus map live,23600,coronavirus,2020-03-31
|
||||
coronavirus meme,22900,coronavirus,2020-03-31
|
||||
coronavirus update uk,20850,coronavirus,2020-03-31
|
||||
coronavirus news india,20750,coronavirus,2020-03-31
|
||||
decreto coronavirus,20650,coronavirus,2020-03-31
|
||||
coronavirus italien,20150,coronavirus,2020-03-31
|
||||
coronavirus cases in india,19850,coronavirus,2020-03-31
|
||||
morti coronavirus,18850,coronavirus,2020-03-31
|
||||
coronavirus,1063800,covid-19,2020-03-31
|
||||
coronavirus covid-19,1047900,covid-19,2020-03-31
|
||||
covid,637400,covid-19,2020-03-31
|
||||
covid-19 cases,573500,covid-19,2020-03-31
|
||||
covid 19,461000,covid-19,2020-03-31
|
||||
covid-19 virus,367300,covid-19,2020-03-31
|
||||
corona,354700,covid-19,2020-03-31
|
||||
covid-19 map,272800,covid-19,2020-03-31
|
||||
covid-19 symptoms,240850,covid-19,2020-03-31
|
||||
covid-19 updates,233650,covid-19,2020-03-31
|
||||
covid-19 news,225350,covid-19,2020-03-31
|
||||
covid-19 live,201250,covid-19,2020-03-31
|
||||
covid-19 update,199350,covid-19,2020-03-31
|
||||
corona virus,184050,covid-19,2020-03-31
|
||||
us covid-19,172900,covid-19,2020-03-31
|
||||
covid-19 canada,162200,covid-19,2020-03-31
|
||||
what is covid-19,161350,covid-19,2020-03-31
|
||||
covid-19 who,160200,covid-19,2020-03-31
|
||||
who,158300,covid-19,2020-03-31
|
||||
covid-19 world,135350,covid-19,2020-03-31
|
||||
covid-19 test,131350,covid-19,2020-03-31
|
||||
usa covid-19,122850,covid-19,2020-03-31
|
||||
covid-19 cdc,119600,covid-19,2020-03-31
|
||||
cdc,117850,covid-19,2020-03-31
|
||||
covid-19 china,117250,covid-19,2020-03-31
|
||||
covid,1474550,covid19,2020-03-31
|
||||
covid 19,1242600,covid19,2020-03-31
|
||||
coronavirus covid19,880550,covid19,2020-03-31
|
||||
coronavirus,865800,covid19,2020-03-31
|
||||
covid19 cases,539150,covid19,2020-03-31
|
||||
corona,401550,covid19,2020-03-31
|
||||
corona covid19,378950,covid19,2020-03-31
|
||||
virus covid19,362050,covid19,2020-03-31
|
||||
covid19 news,267250,covid19,2020-03-31
|
||||
covid19 update,239700,covid19,2020-03-31
|
||||
map covid19,213600,covid19,2020-03-31
|
||||
covid19 symptoms,211000,covid19,2020-03-31
|
||||
corona virus,199350,covid19,2020-03-31
|
||||
corona virus covid19,196750,covid19,2020-03-31
|
||||
covid19 us,178050,covid19,2020-03-31
|
||||
covid19 who,169050,covid19,2020-03-31
|
||||
who,162400,covid19,2020-03-31
|
||||
what is covid19,162250,covid19,2020-03-31
|
||||
test covid19,155900,covid19,2020-03-31
|
||||
covid19 canada,144900,covid19,2020-03-31
|
||||
italy,142650,covid19,2020-03-31
|
||||
covid19 italy,141050,covid19,2020-03-31
|
||||
covid19 india,140500,covid19,2020-03-31
|
||||
china covid19,136900,covid19,2020-03-31
|
||||
covid19 china,136800,covid19,2020-03-31
|
||||
coronavirus sars-cov-2,778000,sars-cov-2,2020-03-31
|
||||
coronavirus,732150,sars-cov-2,2020-03-31
|
||||
covid-19 sars-cov-2,622800,sars-cov-2,2020-03-31
|
||||
covid-19,558700,sars-cov-2,2020-03-31
|
||||
covid,392100,sars-cov-2,2020-03-31
|
||||
sars,359450,sars-cov-2,2020-03-31
|
||||
virus sars-cov-2,325750,sars-cov-2,2020-03-31
|
||||
covid 19,302350,sars-cov-2,2020-03-31
|
||||
corona,279700,sars-cov-2,2020-03-31
|
||||
corona virus,147050,sars-cov-2,2020-03-31
|
||||
sars-cov-2 vs covid-19,142250,sars-cov-2,2020-03-31
|
||||
what is sars-cov-2,105850,sars-cov-2,2020-03-31
|
||||
sars-cov,86500,sars-cov-2,2020-03-31
|
||||
sars cov 2,80950,sars-cov-2,2020-03-31
|
||||
sars-cov-2 wiki,75050,sars-cov-2,2020-03-31
|
||||
covid19,74900,sars-cov-2,2020-03-31
|
||||
koronawirus,66600,sars-cov-2,2020-03-31
|
||||
cdc,63600,sars-cov-2,2020-03-31
|
||||
sars-cov-2 rna,40300,sars-cov-2,2020-03-31
|
||||
the proximal origin of sars-cov-2,34250,sars-cov-2,2020-03-31
|
||||
covid-19 or sars-cov-2,31550,sars-cov-2,2020-03-31
|
||||
sars-cov-2 vs cod-19,31400,sars-cov-2,2020-03-31
|
||||
pubmed,25750,sars-cov-2,2020-03-31
|
||||
sars-cov-1,25600,sars-cov-2,2020-03-31
|
||||
sars-cov-2 cell entry depends on ace2 and tmprss2 and is blocked by a clinically proven protease inhibitor,22300,sars-cov-2,2020-03-31
|
||||
coronavirus,765800,covid-19 pandemic,2020-03-31
|
||||
covid-19 coronavirus pandemic,741400,covid-19 pandemic,2020-03-31
|
||||
coronavirus pandemic,741000,covid-19 pandemic,2020-03-31
|
||||
who pandemic,313750,covid-19 pandemic,2020-03-31
|
||||
is covid-19 a pandemic,307600,covid-19 pandemic,2020-03-31
|
||||
who covid-19 pandemic,306500,covid-19 pandemic,2020-03-31
|
||||
covid 19,247450,covid-19 pandemic,2020-03-31
|
||||
covid 19 pandemic,224900,covid-19 pandemic,2020-03-31
|
||||
epidemic,183750,covid-19 pandemic,2020-03-31
|
||||
pandemic meaning,159450,covid-19 pandemic,2020-03-31
|
||||
what is a pandemic,133300,covid-19 pandemic,2020-03-31
|
||||
pandemic definition,120900,covid-19 pandemic,2020-03-31
|
||||
covid-19 symptoms,83100,covid-19 pandemic,2020-03-31
|
||||
covid-19 pandemic plan,77650,covid-19 pandemic,2020-03-31
|
||||
pandemic define,64800,covid-19 pandemic,2020-03-31
|
||||
covid-19 live world map/count,57250,covid-19 pandemic,2020-03-31
|
||||
coronavirus pandemic covid-19 live world map/count,56900,covid-19 pandemic,2020-03-31
|
||||
covid19,54950,covid-19 pandemic,2020-03-31
|
||||
covid-19 pandemic unemployment payment,51950,covid-19 pandemic,2020-03-31
|
||||
covid-19 updates,46000,covid-19 pandemic,2020-03-31
|
||||
pandemic vs endemic,40700,covid-19 pandemic,2020-03-31
|
||||
coronavirus update,36850,covid-19 pandemic,2020-03-31
|
||||
when was the last pandemic,26050,covid-19 pandemic,2020-03-31
|
||||
who declared covid-19 pandemic,23050,covid-19 pandemic,2020-03-31
|
||||
when was covid-19 declared a pandemic,8400,covid-19 pandemic,2020-03-31
|
||||
covid 19,80900,coronavirus,2020-03-31
|
||||
trump coronavirus,50550,coronavirus,2020-03-31
|
||||
coronavirus numbers,46250,coronavirus,2020-03-31
|
||||
coronavirus worldometer,45100,coronavirus,2020-03-31
|
||||
wuhan coronavirus,44200,coronavirus,2020-03-31
|
||||
us coronavirus cases,37850,coronavirus,2020-03-31
|
||||
coronavirus in italy,36500,coronavirus,2020-03-31
|
||||
coronavirus ultime notizie,35050,coronavirus,2020-03-31
|
||||
coronavirus update india,33500,coronavirus,2020-03-31
|
||||
uk coronavirus cases,29050,coronavirus,2020-03-31
|
||||
covid-19,27250,coronavirus,2020-03-31
|
||||
coronavirus update live,26850,coronavirus,2020-03-31
|
||||
coronavirus spain,24550,coronavirus,2020-03-31
|
||||
coronavirus map live,24050,coronavirus,2020-03-31
|
||||
coronavirus meme,23600,coronavirus,2020-03-31
|
||||
coronavirus lombardia,22100,coronavirus,2020-03-31
|
||||
coronavirus romania,21800,coronavirus,2020-03-31
|
||||
coronavirus uk update,21400,coronavirus,2020-03-31
|
||||
coronavirus cases in india,21300,coronavirus,2020-03-31
|
||||
decreto coronavirus,21150,coronavirus,2020-03-31
|
||||
coronavirus chine,20200,coronavirus,2020-03-31
|
||||
coronavirus italien,19650,coronavirus,2020-03-31
|
||||
coronavirus österreich,18550,coronavirus,2020-03-31
|
||||
coronavirus death rate,18100,coronavirus,2020-03-31
|
||||
john hopkins coronavirus map,18000,coronavirus,2020-03-31
|
||||
coronavirus,1024700,covid-19,2020-03-31
|
||||
coronavirus covid-19,1024200,covid-19,2020-03-31
|
||||
covid,646350,covid-19,2020-03-31
|
||||
covid-19 cases,574950,covid-19,2020-03-31
|
||||
covid 19,460800,covid-19,2020-03-31
|
||||
covid-19 virus,360650,covid-19,2020-03-31
|
||||
corona,356450,covid-19,2020-03-31
|
||||
covid-19 map,270200,covid-19,2020-03-31
|
||||
covid-19 symptoms,255750,covid-19,2020-03-31
|
||||
covid-19 updates,222850,covid-19,2020-03-31
|
||||
covid-19 news,218900,covid-19,2020-03-31
|
||||
covid-19 live,204350,covid-19,2020-03-31
|
||||
covid-19 update,195050,covid-19,2020-03-31
|
||||
corona virus,180950,covid-19,2020-03-31
|
||||
covid-19 us,173600,covid-19,2020-03-31
|
||||
what is covid-19,165150,covid-19,2020-03-31
|
||||
covid-19 who,156700,covid-19,2020-03-31
|
||||
covid-19 canada,156650,covid-19,2020-03-31
|
||||
canada covid-19,154950,covid-19,2020-03-31
|
||||
who,154750,covid-19,2020-03-31
|
||||
covid-19 world,135550,covid-19,2020-03-31
|
||||
covid-19 test,130350,covid-19,2020-03-31
|
||||
china covid-19,120950,covid-19,2020-03-31
|
||||
covid-19 cdc,119900,covid-19,2020-03-31
|
||||
cdc,115500,covid-19,2020-03-31
|
||||
covid,1560950,covid19,2020-03-31
|
||||
covid 19,1341850,covid19,2020-03-31
|
||||
coronavirus,887550,covid19,2020-03-31
|
||||
coronavirus covid19,872500,covid19,2020-03-31
|
||||
covid19 cases,559500,covid19,2020-03-31
|
||||
corona covid19,395150,covid19,2020-03-31
|
||||
corona,383850,covid19,2020-03-31
|
||||
virus covid19,353200,covid19,2020-03-31
|
||||
covid19 news,274400,covid19,2020-03-31
|
||||
covid19 update,227950,covid19,2020-03-31
|
||||
covid19 map,216950,covid19,2020-03-31
|
||||
covid19 symptoms,203700,covid19,2020-03-31
|
||||
corona virus,190250,covid19,2020-03-31
|
||||
covid19 who,168600,covid19,2020-03-31
|
||||
who,167300,covid19,2020-03-31
|
||||
what is covid19,165600,covid19,2020-03-31
|
||||
us covid19,162000,covid19,2020-03-31
|
||||
canada covid19,156400,covid19,2020-03-31
|
||||
test covid19,154050,covid19,2020-03-31
|
||||
covid19 india,144700,covid19,2020-03-31
|
||||
china covid19,144650,covid19,2020-03-31
|
||||
covid19 italy,139850,covid19,2020-03-31
|
||||
italy,138350,covid19,2020-03-31
|
||||
covid19 world,137150,covid19,2020-03-31
|
||||
covid-19,132400,covid19,2020-03-31
|
||||
coronavirus sars-cov-2,755050,sars-cov-2,2020-03-31
|
||||
coronavirus,718050,sars-cov-2,2020-03-31
|
||||
covid-19,620250,sars-cov-2,2020-03-31
|
||||
sars-cov-2 covid-19,572350,sars-cov-2,2020-03-31
|
||||
covid,418700,sars-cov-2,2020-03-31
|
||||
sars-cov-2 virus,385000,sars-cov-2,2020-03-31
|
||||
sars,330000,sars-cov-2,2020-03-31
|
||||
covid 19,285750,sars-cov-2,2020-03-31
|
||||
corona,280100,sars-cov-2,2020-03-31
|
||||
sars-cov-2 covid 19,247900,sars-cov-2,2020-03-31
|
||||
sars-cov-2 vs covid-19,188550,sars-cov-2,2020-03-31
|
||||
corona virus,121600,sars-cov-2,2020-03-31
|
||||
what is sars-cov-2,109950,sars-cov-2,2020-03-31
|
||||
sars-cov-2 wiki,98100,sars-cov-2,2020-03-31
|
||||
covid19,83200,sars-cov-2,2020-03-31
|
||||
sars-cov,75050,sars-cov-2,2020-03-31
|
||||
sars cov 2,63750,sars-cov-2,2020-03-31
|
||||
sars-cov-2 origin,63700,sars-cov-2,2020-03-31
|
||||
sars-cov-2 rna,62850,sars-cov-2,2020-03-31
|
||||
cdc,57000,sars-cov-2,2020-03-31
|
||||
koronawirus,40550,sars-cov-2,2020-03-31
|
||||
sars-cov-2 genome,37550,sars-cov-2,2020-03-31
|
||||
pubmed,32000,sars-cov-2,2020-03-31
|
||||
the proximal origin of sars-cov-2,31800,sars-cov-2,2020-03-31
|
||||
sars-cov-2 vs cod-19,28700,sars-cov-2,2020-03-31
|
||||
coronavirus pandemic,738550,covid-19 pandemic,2020-03-31
|
||||
coronavirus,722750,covid-19 pandemic,2020-03-31
|
||||
covid-19 coronavirus pandemic,706650,covid-19 pandemic,2020-03-31
|
||||
is covid-19 a pandemic,301100,covid-19 pandemic,2020-03-31
|
||||
covid 19,270450,covid-19 pandemic,2020-03-31
|
||||
who,263600,covid-19 pandemic,2020-03-31
|
||||
who pandemic covid-19,262950,covid-19 pandemic,2020-03-31
|
||||
covid 19 pandemic,262550,covid-19 pandemic,2020-03-31
|
||||
who pandemic,252750,covid-19 pandemic,2020-03-31
|
||||
epidemic,178700,covid-19 pandemic,2020-03-31
|
||||
pandemic meaning,148850,covid-19 pandemic,2020-03-31
|
||||
what is a pandemic,113550,covid-19 pandemic,2020-03-31
|
||||
pandemic definition,113200,covid-19 pandemic,2020-03-31
|
||||
coronavirus pandemic covid-19 live world map/count,67300,covid-19 pandemic,2020-03-31
|
||||
pandemic vs epidemic,64750,covid-19 pandemic,2020-03-31
|
||||
covid-19 updates,59400,covid-19 pandemic,2020-03-31
|
||||
covid-19 symptoms,53850,covid-19 pandemic,2020-03-31
|
||||
covid-19 pandemic unemployment payment,48650,covid-19 pandemic,2020-03-31
|
||||
covid-19 pandemic plan,45850,covid-19 pandemic,2020-03-31
|
||||
coronavirus update,45650,covid-19 pandemic,2020-03-31
|
||||
who declared covid-19 pandemic,43100,covid-19 pandemic,2020-03-31
|
||||
pandemic vs endemic,34900,covid-19 pandemic,2020-03-31
|
||||
when was the last pandemic,29400,covid-19 pandemic,2020-03-31
|
||||
coronavirus pandemic covid-19 live world map count,21350,covid-19 pandemic,2020-03-31
|
||||
psa safe grocery shopping in covid-19 pandemic – updated,15900,covid-19 pandemic,2020-03-31
|
|
@ -476,3 +476,253 @@ coronavirus pandemic covid-19 live world map/count,6,covid-19 pandemic,2020-03-2
|
||||
world health organization pandemic,6,covid-19 pandemic,2020-03-29
|
||||
coronavirus update,5,covid-19 pandemic,2020-03-29
|
||||
covid-19 pandemic unemployment payment,4,covid-19 pandemic,2020-03-29
|
||||
coronavirus update,100,coronavirus,2020-03-31
|
||||
corona,89,coronavirus,2020-03-31
|
||||
coronavirus symptoms,82,coronavirus,2020-03-31
|
||||
coronavirus news,72,coronavirus,2020-03-31
|
||||
coronavirus cases,70,coronavirus,2020-03-31
|
||||
coronavirus uk,51,coronavirus,2020-03-31
|
||||
corona virus,47,coronavirus,2020-03-31
|
||||
coronavirus india,43,coronavirus,2020-03-31
|
||||
coronavirus map,41,coronavirus,2020-03-31
|
||||
coronavirus china,39,coronavirus,2020-03-31
|
||||
italia coronavirus,38,coronavirus,2020-03-31
|
||||
coronavirus france,34,coronavirus,2020-03-31
|
||||
coronavirus italy,31,coronavirus,2020-03-31
|
||||
sintomas coronavirus,30,coronavirus,2020-03-31
|
||||
usa coronavirus,29,coronavirus,2020-03-31
|
||||
coronavirus us,28,coronavirus,2020-03-31
|
||||
coronavirus españa,27,coronavirus,2020-03-31
|
||||
symptoms of coronavirus,25,coronavirus,2020-03-31
|
||||
coronavirus live,24,coronavirus,2020-03-31
|
||||
coronavirus death,23,coronavirus,2020-03-31
|
||||
coronavirus in india,22,coronavirus,2020-03-31
|
||||
coronavirus tips,21,coronavirus,2020-03-31
|
||||
coronavirus latest,18,coronavirus,2020-03-31
|
||||
what is coronavirus,18,coronavirus,2020-03-31
|
||||
coronavirus brasil,18,coronavirus,2020-03-31
|
||||
coronavirus,100,covid-19,2020-03-31
|
||||
coronavirus covid-19,99,covid-19,2020-03-31
|
||||
covid,60,covid-19,2020-03-31
|
||||
covid-19 cases,54,covid-19,2020-03-31
|
||||
covid 19,43,covid-19,2020-03-31
|
||||
covid-19 virus,35,covid-19,2020-03-31
|
||||
corona,33,covid-19,2020-03-31
|
||||
covid-19 map,26,covid-19,2020-03-31
|
||||
covid-19 symptoms,23,covid-19,2020-03-31
|
||||
covid-19 updates,22,covid-19,2020-03-31
|
||||
covid-19 news,21,covid-19,2020-03-31
|
||||
covid-19 live,19,covid-19,2020-03-31
|
||||
covid-19 update,19,covid-19,2020-03-31
|
||||
corona virus,17,covid-19,2020-03-31
|
||||
us covid-19,16,covid-19,2020-03-31
|
||||
covid-19 canada,15,covid-19,2020-03-31
|
||||
what is covid-19,15,covid-19,2020-03-31
|
||||
covid-19 who,15,covid-19,2020-03-31
|
||||
who,15,covid-19,2020-03-31
|
||||
covid-19 world,13,covid-19,2020-03-31
|
||||
covid-19 test,12,covid-19,2020-03-31
|
||||
usa covid-19,12,covid-19,2020-03-31
|
||||
covid-19 cdc,11,covid-19,2020-03-31
|
||||
cdc,11,covid-19,2020-03-31
|
||||
covid-19 china,11,covid-19,2020-03-31
|
||||
covid,100,covid19,2020-03-31
|
||||
covid 19,84,covid19,2020-03-31
|
||||
coronavirus covid19,60,covid19,2020-03-31
|
||||
coronavirus,59,covid19,2020-03-31
|
||||
covid19 cases,37,covid19,2020-03-31
|
||||
corona,27,covid19,2020-03-31
|
||||
corona covid19,26,covid19,2020-03-31
|
||||
virus covid19,25,covid19,2020-03-31
|
||||
covid19 news,18,covid19,2020-03-31
|
||||
covid19 update,16,covid19,2020-03-31
|
||||
map covid19,14,covid19,2020-03-31
|
||||
covid19 symptoms,14,covid19,2020-03-31
|
||||
corona virus,14,covid19,2020-03-31
|
||||
corona virus covid19,13,covid19,2020-03-31
|
||||
covid19 us,12,covid19,2020-03-31
|
||||
covid19 who,11,covid19,2020-03-31
|
||||
who,11,covid19,2020-03-31
|
||||
what is covid19,11,covid19,2020-03-31
|
||||
test covid19,11,covid19,2020-03-31
|
||||
covid19 canada,10,covid19,2020-03-31
|
||||
italy,10,covid19,2020-03-31
|
||||
covid19 italy,10,covid19,2020-03-31
|
||||
covid19 india,10,covid19,2020-03-31
|
||||
china covid19,9,covid19,2020-03-31
|
||||
covid19 china,9,covid19,2020-03-31
|
||||
coronavirus sars-cov-2,100,sars-cov-2,2020-03-31
|
||||
coronavirus,94,sars-cov-2,2020-03-31
|
||||
covid-19 sars-cov-2,80,sars-cov-2,2020-03-31
|
||||
covid-19,72,sars-cov-2,2020-03-31
|
||||
covid,50,sars-cov-2,2020-03-31
|
||||
sars,46,sars-cov-2,2020-03-31
|
||||
virus sars-cov-2,42,sars-cov-2,2020-03-31
|
||||
covid 19,39,sars-cov-2,2020-03-31
|
||||
corona,36,sars-cov-2,2020-03-31
|
||||
corona virus,19,sars-cov-2,2020-03-31
|
||||
sars-cov-2 vs covid-19,18,sars-cov-2,2020-03-31
|
||||
what is sars-cov-2,14,sars-cov-2,2020-03-31
|
||||
sars-cov,11,sars-cov-2,2020-03-31
|
||||
sars cov 2,10,sars-cov-2,2020-03-31
|
||||
sars-cov-2 wiki,10,sars-cov-2,2020-03-31
|
||||
covid19,10,sars-cov-2,2020-03-31
|
||||
koronawirus,9,sars-cov-2,2020-03-31
|
||||
cdc,8,sars-cov-2,2020-03-31
|
||||
sars-cov-2 rna,5,sars-cov-2,2020-03-31
|
||||
the proximal origin of sars-cov-2,4,sars-cov-2,2020-03-31
|
||||
covid-19 or sars-cov-2,4,sars-cov-2,2020-03-31
|
||||
sars-cov-2 vs cod-19,4,sars-cov-2,2020-03-31
|
||||
pubmed,3,sars-cov-2,2020-03-31
|
||||
sars-cov-1,3,sars-cov-2,2020-03-31
|
||||
sars-cov-2 cell entry depends on ace2 and tmprss2 and is blocked by a clinically proven protease inhibitor,3,sars-cov-2,2020-03-31
|
||||
coronavirus,100,covid-19 pandemic,2020-03-31
|
||||
covid-19 coronavirus pandemic,97,covid-19 pandemic,2020-03-31
|
||||
coronavirus pandemic,97,covid-19 pandemic,2020-03-31
|
||||
who pandemic,41,covid-19 pandemic,2020-03-31
|
||||
is covid-19 a pandemic,40,covid-19 pandemic,2020-03-31
|
||||
who covid-19 pandemic,40,covid-19 pandemic,2020-03-31
|
||||
covid 19,32,covid-19 pandemic,2020-03-31
|
||||
covid 19 pandemic,29,covid-19 pandemic,2020-03-31
|
||||
epidemic,24,covid-19 pandemic,2020-03-31
|
||||
pandemic meaning,21,covid-19 pandemic,2020-03-31
|
||||
what is a pandemic,17,covid-19 pandemic,2020-03-31
|
||||
pandemic definition,16,covid-19 pandemic,2020-03-31
|
||||
covid-19 symptoms,11,covid-19 pandemic,2020-03-31
|
||||
covid-19 pandemic plan,10,covid-19 pandemic,2020-03-31
|
||||
pandemic define,8,covid-19 pandemic,2020-03-31
|
||||
covid-19 live world map/count,7,covid-19 pandemic,2020-03-31
|
||||
coronavirus pandemic covid-19 live world map/count,7,covid-19 pandemic,2020-03-31
|
||||
covid19,7,covid-19 pandemic,2020-03-31
|
||||
covid-19 pandemic unemployment payment,7,covid-19 pandemic,2020-03-31
|
||||
covid-19 updates,6,covid-19 pandemic,2020-03-31
|
||||
pandemic vs endemic,5,covid-19 pandemic,2020-03-31
|
||||
coronavirus update,5,covid-19 pandemic,2020-03-31
|
||||
when was the last pandemic,3,covid-19 pandemic,2020-03-31
|
||||
who declared covid-19 pandemic,3,covid-19 pandemic,2020-03-31
|
||||
when was covid-19 declared a pandemic,1,covid-19 pandemic,2020-03-31
|
||||
coronavirus update,100,coronavirus,2020-03-31
|
||||
corona,90,coronavirus,2020-03-31
|
||||
coronavirus symptoms,73,coronavirus,2020-03-31
|
||||
coronavirus cases,71,coronavirus,2020-03-31
|
||||
coronavirus news,69,coronavirus,2020-03-31
|
||||
coronavirus uk,51,coronavirus,2020-03-31
|
||||
corona virus,46,coronavirus,2020-03-31
|
||||
india coronavirus,45,coronavirus,2020-03-31
|
||||
coronavirus map,41,coronavirus,2020-03-31
|
||||
coronavirus italia,38,coronavirus,2020-03-31
|
||||
coronavirus china,36,coronavirus,2020-03-31
|
||||
france coronavirus,33,coronavirus,2020-03-31
|
||||
usa coronavirus,29,coronavirus,2020-03-31
|
||||
italy,28,coronavirus,2020-03-31
|
||||
coronavirus italy,28,coronavirus,2020-03-31
|
||||
coronavirus us,28,coronavirus,2020-03-31
|
||||
coronavirus españa,26,coronavirus,2020-03-31
|
||||
sintomas coronavirus,26,coronavirus,2020-03-31
|
||||
coronavirus live,25,coronavirus,2020-03-31
|
||||
symptoms of coronavirus,23,coronavirus,2020-03-31
|
||||
coronavirus in india,22,coronavirus,2020-03-31
|
||||
coronavirus tips,21,coronavirus,2020-03-31
|
||||
coronavirus latest,19,coronavirus,2020-03-31
|
||||
coronavirus brasil,18,coronavirus,2020-03-31
|
||||
what is coronavirus,18,coronavirus,2020-03-31
|
||||
coronavirus,100,covid-19,2020-03-31
|
||||
coronavirus covid-19,100,covid-19,2020-03-31
|
||||
covid,63,covid-19,2020-03-31
|
||||
covid-19 cases,56,covid-19,2020-03-31
|
||||
covid 19,45,covid-19,2020-03-31
|
||||
covid-19 virus,35,covid-19,2020-03-31
|
||||
corona,35,covid-19,2020-03-31
|
||||
covid-19 map,26,covid-19,2020-03-31
|
||||
covid-19 symptoms,25,covid-19,2020-03-31
|
||||
covid-19 updates,22,covid-19,2020-03-31
|
||||
covid-19 news,21,covid-19,2020-03-31
|
||||
covid-19 live,20,covid-19,2020-03-31
|
||||
covid-19 update,19,covid-19,2020-03-31
|
||||
corona virus,18,covid-19,2020-03-31
|
||||
covid-19 us,17,covid-19,2020-03-31
|
||||
what is covid-19,16,covid-19,2020-03-31
|
||||
covid-19 who,15,covid-19,2020-03-31
|
||||
covid-19 canada,15,covid-19,2020-03-31
|
||||
canada covid-19,15,covid-19,2020-03-31
|
||||
who,15,covid-19,2020-03-31
|
||||
covid-19 world,13,covid-19,2020-03-31
|
||||
covid-19 test,13,covid-19,2020-03-31
|
||||
china covid-19,12,covid-19,2020-03-31
|
||||
covid-19 cdc,12,covid-19,2020-03-31
|
||||
cdc,11,covid-19,2020-03-31
|
||||
covid,100,covid19,2020-03-31
|
||||
covid 19,86,covid19,2020-03-31
|
||||
coronavirus,57,covid19,2020-03-31
|
||||
coronavirus covid19,56,covid19,2020-03-31
|
||||
covid19 cases,36,covid19,2020-03-31
|
||||
corona covid19,25,covid19,2020-03-31
|
||||
corona,25,covid19,2020-03-31
|
||||
virus covid19,23,covid19,2020-03-31
|
||||
covid19 news,18,covid19,2020-03-31
|
||||
covid19 update,15,covid19,2020-03-31
|
||||
covid19 map,14,covid19,2020-03-31
|
||||
covid19 symptoms,13,covid19,2020-03-31
|
||||
corona virus,12,covid19,2020-03-31
|
||||
covid19 who,11,covid19,2020-03-31
|
||||
who,11,covid19,2020-03-31
|
||||
what is covid19,11,covid19,2020-03-31
|
||||
us covid19,10,covid19,2020-03-31
|
||||
canada covid19,10,covid19,2020-03-31
|
||||
test covid19,10,covid19,2020-03-31
|
||||
covid19 india,9,covid19,2020-03-31
|
||||
china covid19,9,covid19,2020-03-31
|
||||
covid19 italy,9,covid19,2020-03-31
|
||||
italy,9,covid19,2020-03-31
|
||||
covid19 world,9,covid19,2020-03-31
|
||||
covid-19,8,covid19,2020-03-31
|
||||
coronavirus sars-cov-2,100,sars-cov-2,2020-03-31
|
||||
coronavirus,95,sars-cov-2,2020-03-31
|
||||
covid-19,82,sars-cov-2,2020-03-31
|
||||
sars-cov-2 covid-19,76,sars-cov-2,2020-03-31
|
||||
covid,55,sars-cov-2,2020-03-31
|
||||
sars-cov-2 virus,51,sars-cov-2,2020-03-31
|
||||
sars,44,sars-cov-2,2020-03-31
|
||||
covid 19,38,sars-cov-2,2020-03-31
|
||||
corona,37,sars-cov-2,2020-03-31
|
||||
sars-cov-2 covid 19,33,sars-cov-2,2020-03-31
|
||||
sars-cov-2 vs covid-19,25,sars-cov-2,2020-03-31
|
||||
corona virus,16,sars-cov-2,2020-03-31
|
||||
what is sars-cov-2,15,sars-cov-2,2020-03-31
|
||||
sars-cov-2 wiki,13,sars-cov-2,2020-03-31
|
||||
covid19,11,sars-cov-2,2020-03-31
|
||||
sars-cov,10,sars-cov-2,2020-03-31
|
||||
sars cov 2,8,sars-cov-2,2020-03-31
|
||||
sars-cov-2 origin,8,sars-cov-2,2020-03-31
|
||||
sars-cov-2 rna,8,sars-cov-2,2020-03-31
|
||||
cdc,8,sars-cov-2,2020-03-31
|
||||
koronawirus,5,sars-cov-2,2020-03-31
|
||||
sars-cov-2 genome,5,sars-cov-2,2020-03-31
|
||||
pubmed,4,sars-cov-2,2020-03-31
|
||||
the proximal origin of sars-cov-2,4,sars-cov-2,2020-03-31
|
||||
sars-cov-2 vs cod-19,4,sars-cov-2,2020-03-31
|
||||
coronavirus pandemic,100,covid-19 pandemic,2020-03-31
|
||||
coronavirus,98,covid-19 pandemic,2020-03-31
|
||||
covid-19 coronavirus pandemic,96,covid-19 pandemic,2020-03-31
|
||||
is covid-19 a pandemic,41,covid-19 pandemic,2020-03-31
|
||||
covid 19,37,covid-19 pandemic,2020-03-31
|
||||
who,36,covid-19 pandemic,2020-03-31
|
||||
who pandemic covid-19,36,covid-19 pandemic,2020-03-31
|
||||
covid 19 pandemic,36,covid-19 pandemic,2020-03-31
|
||||
who pandemic,34,covid-19 pandemic,2020-03-31
|
||||
epidemic,24,covid-19 pandemic,2020-03-31
|
||||
pandemic meaning,20,covid-19 pandemic,2020-03-31
|
||||
what is a pandemic,15,covid-19 pandemic,2020-03-31
|
||||
pandemic definition,15,covid-19 pandemic,2020-03-31
|
||||
coronavirus pandemic covid-19 live world map/count,9,covid-19 pandemic,2020-03-31
|
||||
pandemic vs epidemic,9,covid-19 pandemic,2020-03-31
|
||||
covid-19 updates,8,covid-19 pandemic,2020-03-31
|
||||
covid-19 symptoms,7,covid-19 pandemic,2020-03-31
|
||||
covid-19 pandemic unemployment payment,7,covid-19 pandemic,2020-03-31
|
||||
covid-19 pandemic plan,6,covid-19 pandemic,2020-03-31
|
||||
coronavirus update,6,covid-19 pandemic,2020-03-31
|
||||
who declared covid-19 pandemic,6,covid-19 pandemic,2020-03-31
|
||||
pandemic vs endemic,5,covid-19 pandemic,2020-03-31
|
||||
when was the last pandemic,4,covid-19 pandemic,2020-03-31
|
||||
coronavirus pandemic covid-19 live world map count,3,covid-19 pandemic,2020-03-31
|
||||
psa safe grocery shopping in covid-19 pandemic – updated,2,covid-19 pandemic,2020-03-31
|
|
2997
keywords/output/intermediate/wikidata_search_results.csv
Normal file
2997
keywords/output/intermediate/wikidata_search_results.csv
Normal file
File diff suppressed because it is too large
Load Diff
256674
keywords/output/intermediate/wikidata_search_results_from_gtrends.csv
Normal file
256674
keywords/output/intermediate/wikidata_search_results_from_gtrends.csv
Normal file
File diff suppressed because it is too large
Load Diff
18
keywords/src/compile_translated_phrases.sh
Executable file
18
keywords/src/compile_translated_phrases.sh
Executable file
@ -0,0 +1,18 @@
|
||||
#!/bin/bash
|
||||
|
||||
# For now these scripts don't accept command line arguments. It's an MVP
|
||||
|
||||
echo "Reading Google trends"
|
||||
python3 collect_trends.py
|
||||
|
||||
echo "Searching for Wikidata items using base_terms.txt"
|
||||
python3 wikidata_search.py ../resources/base_terms.txt --output ../output/intermediate/wikidata_search_results.csv
|
||||
|
||||
echo "Searching for Wikidata items using Google trends"
|
||||
python3 wikidata_search.py ../output/intermediate/related_searches_rising.csv ../output/intermediate/related_searches_top.csv --use-gtrends --output ../output/intermediate/wikidata_search_results_from_gtrends.csv
|
||||
|
||||
echo "Finding translations from Wikidata using sparql"
|
||||
python3 wikidata_translations.py ../output/intermediate/wikidata_search_results_from_gtrends.csv ../output/intermediate/wikidata_search_results.csv --topN 10 20 --output ../output/csv/$(date '+%Y-%m-%d')_wikidata_item_labels.csv
|
||||
|
||||
rm latest.csv
|
||||
ln -s $(ls -tr | tail -n 1) latest.csv
|
8
transliterations/src/compile_transliterated_phrases.sh → keywords/src/compile_transliterated_phrases.sh
Executable file → Normal file
8
transliterations/src/compile_transliterated_phrases.sh → keywords/src/compile_transliterated_phrases.sh
Executable file → Normal file
@ -5,12 +5,12 @@
|
||||
echo "Reading Google trends"
|
||||
python3 collect_trends.py
|
||||
|
||||
echo "Searching for Wikidata entities using base_terms.txt"
|
||||
echo "Searching for Wikidata items using base_terms.txt"
|
||||
python3 wikidata_search.py ../resources/base_terms.txt --output ../output/intermediate/wikidata_search_results.csv
|
||||
|
||||
echo "Searching for Wikidata entities using Google trends"
|
||||
echo "Searching for Wikidata items using Google trends"
|
||||
python3 wikidata_search.py ../output/intermediate/related_searches_rising.csv ../output/intermediate/related_searches_top.csv --use-gtrends --output ../output/intermediate/wikidata_search_results_from_gtrends.csv
|
||||
|
||||
echo "Finding transliterations from Wikidata using sparql"
|
||||
python3 wikidata_transliterations.py ../output/intermediate/wikidata_search_results_from_gtrends.csv ../output/intermediate/wikidata_search_results.csv --topN 10 20 --output ../output/csv/$(date '+%Y-%m-%d')_wikidata_entity_labels.csv
|
||||
echo "Finding translations from Wikidata using sparql"
|
||||
python3 wikidata_translations.py ../output/intermediate/wikidata_search_results_from_gtrends.csv ../output/intermediate/wikidata_search_results.csv --topN 10 20 --output ../output/csv/$(date '+%Y-%m-%d')_wikidata_item_labels.csv
|
||||
|
@ -1,4 +1,4 @@
|
||||
# generate a list of wikidata entities related to keywords
|
||||
# generate a list of wikidata items related to keywords
|
||||
from os import path
|
||||
from sys import stdout
|
||||
from wikidata_api_calls import search_wikidata, get_wikidata_api
|
||||
@ -30,8 +30,8 @@ class Wikidata_ResultSet:
|
||||
|
||||
|
||||
class Wikidata_Result:
|
||||
# store unique entities found in the search results, the position in the search result, and the date
|
||||
__slots__=['search_term','entityid','pageid','search_position','timestamp']
|
||||
# store unique items found in the search results, the position in the search result, and the date
|
||||
__slots__=['search_term','itemid','pageid','search_position','timestamp']
|
||||
|
||||
def __init__(self,
|
||||
term,
|
||||
@ -39,14 +39,14 @@ class Wikidata_Result:
|
||||
position):
|
||||
|
||||
self.search_term = term.strip()
|
||||
self.entityid = search_result['title']
|
||||
self.itemid = search_result['title']
|
||||
self.pageid = int(search_result['pageid'])
|
||||
self.search_position = int(position)
|
||||
self.timestamp = search_result['timestamp']
|
||||
|
||||
def to_list(self):
|
||||
return [self.search_term,
|
||||
self.entityid,
|
||||
self.itemid,
|
||||
self.pageid,
|
||||
self.search_position,
|
||||
self.timestamp]
|
||||
@ -79,11 +79,11 @@ def trawl_base_terms(infiles, outfile = None, mode='w'):
|
||||
|
||||
## search each of the base terms in wikidata
|
||||
|
||||
# store unique entities found in the search results, the position in the search result, and the date
|
||||
# store unique items found in the search results, the position in the search result, and the date
|
||||
|
||||
if __name__ == "__main__":
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser("Search wikidata for entities related to a set of terms.")
|
||||
parser = argparse.ArgumentParser("Search wikidata for items related to a set of terms.")
|
||||
parser.add_argument('inputs', type=str, nargs='+', help='one or more files to read')
|
||||
parser.add_argument('--use-gtrends', action='store_true', help = 'toggle whether the input is the output from google trends')
|
||||
parser.add_argument('--output', type=str, help='an output file. defaults to stdout')
|
@ -5,33 +5,34 @@ from json import JSONDecodeError
|
||||
from os import path
|
||||
|
||||
class LabelData:
|
||||
__slots__ = ['entityid','label','langcode','is_alt']
|
||||
__slots__ = ['itemid','label','langcode','is_alt']
|
||||
|
||||
def __init__(self, wd_res, is_alt):
|
||||
obj = wd_res.get('label',None)
|
||||
self.label = obj.get('value',None)
|
||||
self.langcode = obj.get('xml:lang',None)
|
||||
self.entityid = wd_res.get('entity',None).get('value',None)
|
||||
self.itemid = wd_res.get('item',None).get('value',None)
|
||||
self.is_alt = is_alt
|
||||
|
||||
def to_list(self):
|
||||
return [self.entityid,
|
||||
return [self.itemid,
|
||||
self.label,
|
||||
self.langcode,
|
||||
self.is_alt]
|
||||
|
||||
def GetAllLabels(in_csvs, outfile, topNs):
|
||||
|
||||
def load_entity_ids(in_csv, topN=5):
|
||||
def load_item_ids(in_csv, topN=5):
|
||||
with open(in_csv,'r',newline='') as infile:
|
||||
reader = list(csv.DictReader(infile))
|
||||
for row in reader:
|
||||
if int(row['search_position']) < topN:
|
||||
yield row["entityid"]
|
||||
yield row["itemid"]
|
||||
|
||||
ids = set(chain(* map(lambda in_csv, topN: load_entity_ids(in_csv, topN), in_csvs, topNs)))
|
||||
|
||||
labeldata = GetEntityLabels(ids)
|
||||
ids = set(chain(* map(lambda in_csv, topN: load_item_ids(in_csv, topN), in_csvs, topNs)))
|
||||
ids = ids.union(open("../resources/main_items.txt"))
|
||||
|
||||
labeldata = GetItemLabels(ids)
|
||||
|
||||
with open(outfile, 'w', newline='') as of:
|
||||
writer = csv.writer(of)
|
||||
@ -39,7 +40,7 @@ def GetAllLabels(in_csvs, outfile, topNs):
|
||||
writer.writerows(map(LabelData.to_list,labeldata))
|
||||
|
||||
|
||||
def GetEntityLabels(entityids):
|
||||
def GetItemLabels(itemids):
|
||||
|
||||
def run_query_and_parse(query, is_alt):
|
||||
results = run_sparql_query(query)
|
||||
@ -50,7 +51,7 @@ def GetEntityLabels(entityids):
|
||||
if res is not None:
|
||||
res = res.get('bindings',None)
|
||||
if res is None:
|
||||
raise requests.APIError(f"got invalid response from wikidata for {query % entityid}")
|
||||
raise requests.APIError(f"got invalid response from wikidata for {query % itemid}")
|
||||
|
||||
for info in res:
|
||||
yield LabelData(info, is_alt)
|
||||
@ -59,20 +60,20 @@ def GetEntityLabels(entityids):
|
||||
print(e)
|
||||
print(query)
|
||||
|
||||
def prep_query(query, prop, entityids):
|
||||
values = ' '.join(('wd:{0}'.format(id) for id in entityids))
|
||||
def prep_query(query, prop, itemids):
|
||||
values = ' '.join(('wd:{0}'.format(id) for id in itemids))
|
||||
return query.format(prop, values)
|
||||
|
||||
base_query = """
|
||||
SELECT DISTINCT ?entity ?label WHERE {{
|
||||
?entity {0} ?label;
|
||||
VALUES ?entity {{ {1} }}
|
||||
SELECT DISTINCT ?item ?label WHERE {{
|
||||
?item {0} ?label;
|
||||
VALUES ?item {{ {1} }}
|
||||
}}"""
|
||||
|
||||
# we can't get all the entities at once. how about 100 at a time?
|
||||
# we can't get all the items at once. how about 100 at a time?
|
||||
chunksize = 100
|
||||
entityids = (id for id in entityids)
|
||||
chunk = list(islice(entityids, chunksize))
|
||||
itemids = (id for id in itemids)
|
||||
chunk = list(islice(itemids, chunksize))
|
||||
calls = []
|
||||
while len(chunk) > 0:
|
||||
label_query = prep_query(base_query, "rdfs:label", chunk)
|
||||
@ -80,7 +81,7 @@ def GetEntityLabels(entityids):
|
||||
label_results = run_query_and_parse(label_query, is_alt=False)
|
||||
altLabel_results = run_query_and_parse(altLabel_query, is_alt=True)
|
||||
calls.extend([label_results, altLabel_results])
|
||||
chunk = list(islice(entityids, chunksize))
|
||||
chunk = list(islice(itemids, chunksize))
|
||||
|
||||
return chain(*calls)
|
||||
|
||||
@ -89,13 +90,13 @@ def find_new_output_file(output, i = 1):
|
||||
if path.exists(output):
|
||||
name, ext = path.splitext(output)
|
||||
|
||||
return find_new_output_file(f"{name}_{i}.{ext}", i+1)
|
||||
return find_new_output_file(f"{name}_{i}{ext}", i+1)
|
||||
else:
|
||||
return output
|
||||
|
||||
if __name__ == "__main__":
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser("Use wikidata to find transliterations of terms")
|
||||
parser = argparse.ArgumentParser("Use wikidata to find translations of terms")
|
||||
parser.add_argument('inputs', type=str, nargs='+', help='one or more files to read. the inputs are generated by wikidata_search.py')
|
||||
parser.add_argument('--topN', type=int, nargs='+', help='limit number of wikidata search results to use, can pass one arg for each source.')
|
||||
parser.add_argument('--output', type=str, help='an output file. defaults to stdout',default=20)
|
@ -1,3 +0,0 @@
|
||||
# Transliterations
|
||||
|
||||
This part of the project collects tranliterations of key phrases related to COVID-19 using Wikidata. We search the Wikidata API for entities in `src/wikidata_search.py` and then we make simple SPARQL queries in `src/wikidata_transliterations.py` to collect labels and aliases the entities. The labels come with language metadata. This seems to provide a decent initial list of relevant terms across multiple languages.
|
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
Loading…
Reference in New Issue
Block a user