cdsc_reddit

Author	SHA1	Message	Date
Nate E TeBlunthuis	4c8bd14992	Bugfix (typo)	2020-11-10 13:38:11 -08:00
Nate E TeBlunthuis	39c581bee9	Reuse code for term and author cosine similarity.	2020-11-10 13:18:57 -08:00
Nate E TeBlunthuis	5632a971c6	Refactor tfidf code to for code resuse.	2020-11-10 13:18:19 -08:00
Nate E TeBlunthuis	772f3a8fbd	rename 'idf' files to 'tfidf'	2020-11-10 13:16:55 -08:00
Nate E TeBlunthuis	6edd155749	Improvements to idf code	2020-11-10 13:12:11 -08:00
Nate E TeBlunthuis	8b8c45ee2d	Merge branch 'master' of code:cdsc_reddit	2020-11-02 10:40:12 -08:00
Nate E TeBlunthuis	3dc17bd27c	add term_cosine_similarity.py	2020-11-02 10:40:02 -08:00
Nathan TeBlunthuis	0882878166	Add Cosine similarities to README.md	2020-11-02 09:48:10 -08:00
Nathan TeBlunthuis	b50b08a3ea	Update Readme.	2020-11-02 08:42:13 -08:00
Nathan TeBlunthuis	9075a8153c	Merge branch 'master' of code:cdsc_reddit into master	2020-11-01 21:50:44 -08:00
Nathan TeBlunthuis	4c78f2c527	Create README.md	2020-11-01 21:50:27 -08:00
Nate E TeBlunthuis	4ced659d19	Update reddit comments data with daily dumps.	2020-10-03 16:42:22 -07:00
Nate E TeBlunthuis	2740f55915	Compute IDF for terms and authors.	2020-08-23 11:57:55 -07:00
Nate E TeBlunthuis	2d425600a8	Update submissions to parse using the backfill queue.	2020-08-11 22:37:36 -07:00
Nate E TeBlunthuis	c92b50e050	bugfix in checking submission shas	2020-08-11 14:21:54 -07:00
Nate E TeBlunthuis	c0da8f4dbf	Use multiword expressions in tf.	2020-08-10 16:57:46 -07:00
Nate E TeBlunthuis	57951050c0	Finish generating multiword expressions.	2020-08-09 22:43:48 -07:00
Nate E TeBlunthuis	529b7f0511	Bugfix	2020-08-09 02:34:42 -07:00
Nate E TeBlunthuis	2d1c8013f2	Use groupby - joins instead of windows	2020-08-09 00:21:50 -07:00
Nate E TeBlunthuis	f28effe2c3	renamte tf_comments part 2.	2020-08-04 13:39:49 -07:00
Nate E TeBlunthuis	39fde9884e	rename tf_reddit_comments.py step1.	2020-08-04 13:39:20 -07:00
Nate E TeBlunthuis	78ab514d6b	Improve tokenization following data. Generate author counts.	2020-08-04 13:24:37 -07:00
Nate E TeBlunthuis	b3ffaaba1d	improve tokenizer.	2020-08-03 22:55:10 -07:00
Nate E TeBlunthuis	ddf2adb8a6	TF reddit comments.	2020-08-03 22:43:57 -07:00
Nate E TeBlunthuis	40be7bedb6	code to sort tf	2020-08-03 17:56:36 -07:00
Nate E TeBlunthuis	c666302b4a	remove is_submitter field from submissions which doesn't exist.	2020-07-09 17:12:14 -07:00
Nate E TeBlunthuis	aa84a7df03	Bugfixes in scripts.	2020-07-07 23:29:36 -07:00
Nate E TeBlunthuis	06fd99e7cd	clean up comments in streaming example.	2020-07-07 12:28:57 -07:00
Nate E TeBlunthuis	7d0e020f9d	update .gitignore	2020-07-07 12:28:44 -07:00
Nate E TeBlunthuis	e22ddf23da	update examples with working streaming	2020-07-07 11:47:17 -07:00
Nate E TeBlunthuis	40d4563770	Build comments dataset similarly to submissions and improve partitioning scheme	2020-07-07 11:45:43 -07:00
Nate E TeBlunthuis	fc6575a287	update .gitignore	2020-07-07 00:58:26 -07:00
Nate E TeBlunthuis	4efd72a916	Script for example of streaming pyarrow.	2020-07-07 00:57:05 -07:00
Nate E TeBlunthuis	90fe976b26	Script to demonstrate reading parquet.	2020-07-07 00:51:40 -07:00
Nate E TeBlunthuis	9cd0954288	Check the shas when we download dumps	2020-07-06 23:31:52 -07:00
Nate E TeBlunthuis	33e088492c	Script to run both parts of submissions_2_parquet.sh	2020-07-06 23:27:18 -07:00
Nate E TeBlunthuis	fd3b615544	Cache before sorting so we don't extract twice.	2020-07-06 22:30:04 -07:00
Nate E TeBlunthuis	4ec9c14247	Move the spark part of submissions_2_parquet to a separate script.	2020-07-06 22:27:34 -07:00
Nate E TeBlunthuis	4eb82d2740	Fix whitespace at top of file.	2020-07-05 23:32:00 -07:00
Nate E TeBlunthuis	34185337c9	Secondary sort for the by_author dataset should be CreatedAt.	2020-07-05 23:29:35 -07:00
Nate E TeBlunthuis	67857a3b05	Create a second dataset sorted by author.	2020-07-05 23:27:05 -07:00
Nate E TeBlunthuis	6d4344355b	Create parquet datasets of reddit submissions from pushshift.	2020-07-05 23:20:17 -07:00
Nate E TeBlunthuis	6dca79a41f	Rename spark script to reflect that it is for comments.	2020-07-03 14:00:36 -07:00
Nate E TeBlunthuis	94c7a74bd9	update .gitignore	2020-07-03 13:55:25 -07:00
Nate E TeBlunthuis	4dd9a210e6	bugfix in retrieving old data and rename file.	2020-07-03 13:54:55 -07:00
Nate E TeBlunthuis	c972d828b3	Script for checking shas for submissions.	2020-07-03 13:35:46 -07:00
Nate E TeBlunthuis	7da18e33ba	Bugfix: use timestamp types Also change the canonical file path.	2020-07-03 11:38:43 -07:00
Nate E TeBlunthuis	db2d6248fc	update the reddit comment dumps	2020-07-03 10:41:13 -07:00
Nate E TeBlunthuis	d05da6441f	Don't clobber old dumps so that we can just download the new ones.	2020-07-03 10:40:43 -07:00
Nate E TeBlunthuis	592d2c7dda	script for getting submissions dumps from pushshift.	2020-07-02 17:40:17 -07:00

1 2

51 Commits