| 
							
							
								 Nate E TeBlunthuis | cd43a94865 | increase iterations and perplectity and early_exaggeration | 2020-11-11 16:55:39 -08:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | ca6a8f0896 | increase learning rate | 2020-11-11 16:48:41 -08:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | ed0e1a8235 | Fix bug in tsne. | 2020-11-11 16:43:41 -08:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 6baa08889b | git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit | 2020-11-11 16:39:44 -08:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 4447c60265 | split fitting and plotting tsne. | 2020-11-11 16:38:22 -08:00 |  | 
			
				
					|  | db53c0138a | Add file to plot related subreddits using tsne. | 2020-11-11 16:05:36 -08:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 4c8bd14992 | Bugfix (typo) | 2020-11-10 13:38:11 -08:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 39c581bee9 | Reuse code for term and author cosine similarity. | 2020-11-10 13:18:57 -08:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 5632a971c6 | Refactor tfidf code to for code resuse. | 2020-11-10 13:18:19 -08:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 772f3a8fbd | rename 'idf' files to 'tfidf' | 2020-11-10 13:16:55 -08:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 6edd155749 | Improvements to idf code | 2020-11-10 13:12:11 -08:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 8b8c45ee2d | Merge branch 'master' of code:cdsc_reddit | 2020-11-02 10:40:12 -08:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 3dc17bd27c | add term_cosine_similarity.py | 2020-11-02 10:40:02 -08:00 |  | 
			
				
					|  | 0882878166 | Add Cosine similarities to README.md | 2020-11-02 09:48:10 -08:00 |  | 
			
				
					|  | b50b08a3ea | Update Readme. | 2020-11-02 08:42:13 -08:00 |  | 
			
				
					|  | 9075a8153c | Merge branch 'master' of code:cdsc_reddit into master | 2020-11-01 21:50:44 -08:00 |  | 
			
				
					|  | 4c78f2c527 | Create README.md | 2020-11-01 21:50:27 -08:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 4ced659d19 | Update reddit comments data with daily dumps. | 2020-10-03 16:42:22 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 2740f55915 | Compute IDF for terms and authors. | 2020-08-23 11:57:55 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 2d425600a8 | Update submissions to parse using the backfill queue. | 2020-08-11 22:37:36 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | c92b50e050 | bugfix in checking submission shas | 2020-08-11 14:21:54 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | c0da8f4dbf | Use multiword expressions in tf. | 2020-08-10 16:57:46 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 57951050c0 | Finish generating multiword expressions. | 2020-08-09 22:43:48 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 529b7f0511 | Bugfix | 2020-08-09 02:34:42 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 2d1c8013f2 | Use groupby - joins instead of windows | 2020-08-09 00:21:50 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | f28effe2c3 | renamte tf_comments part 2. | 2020-08-04 13:39:49 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 39fde9884e | rename tf_reddit_comments.py step1. | 2020-08-04 13:39:20 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 78ab514d6b | Improve tokenization following data. Generate author counts. | 2020-08-04 13:24:37 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | b3ffaaba1d | improve tokenizer. | 2020-08-03 22:55:10 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | ddf2adb8a6 | TF reddit comments. | 2020-08-03 22:43:57 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 40be7bedb6 | code to sort tf | 2020-08-03 17:56:36 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | c666302b4a | remove is_submitter field from submissions which doesn't exist. | 2020-07-09 17:12:14 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | aa84a7df03 | Bugfixes in scripts. | 2020-07-07 23:29:36 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 06fd99e7cd | clean up comments in streaming example. | 2020-07-07 12:28:57 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 7d0e020f9d | update .gitignore | 2020-07-07 12:28:44 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | e22ddf23da | update examples with working streaming | 2020-07-07 11:47:17 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 40d4563770 | Build comments dataset similarly to submissions and improve partitioning scheme | 2020-07-07 11:45:43 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | fc6575a287 | update .gitignore | 2020-07-07 00:58:26 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 4efd72a916 | Script for example of streaming pyarrow. | 2020-07-07 00:57:05 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 90fe976b26 | Script to demonstrate reading parquet. | 2020-07-07 00:51:40 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 9cd0954288 | Check the shas when we download dumps | 2020-07-06 23:31:52 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 33e088492c | Script to run both parts of submissions_2_parquet.sh | 2020-07-06 23:27:18 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | fd3b615544 | Cache before sorting so we don't extract twice. | 2020-07-06 22:30:04 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 4ec9c14247 | Move the spark part of submissions_2_parquet to a separate script. | 2020-07-06 22:27:34 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 4eb82d2740 | Fix whitespace at top of file. | 2020-07-05 23:32:00 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 34185337c9 | Secondary sort for the by_author dataset should be CreatedAt. | 2020-07-05 23:29:35 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 67857a3b05 | Create a second dataset sorted by author. | 2020-07-05 23:27:05 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 6d4344355b | Create parquet datasets of reddit submissions from pushshift. | 2020-07-05 23:20:17 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 6dca79a41f | Rename spark script to reflect that it is for comments. | 2020-07-03 14:00:36 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 94c7a74bd9 | update .gitignore | 2020-07-03 13:55:25 -07:00 |  |