Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							82d184d9c6
							
						
					 | 
					
						
						
							
							Update code for building simlarity matrices.
						
						
						
						
						
					 | 
					
						2020-11-17 12:52:48 -08:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							e794214653
							
						
					 | 
					
						
						
							
							bugfix in completing tfidf similarity matrices.
						
						
						
						
						
					 | 
					
						2020-11-12 11:47:53 -08:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							220a540beb
							
						
					 | 
					
						
						
							
							increase learning rate.
						
						
						
						
						
					 | 
					
						2020-11-11 16:58:39 -08:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							cd43a94865
							
						
					 | 
					
						
						
							
							increase iterations and perplectity and early_exaggeration
						
						
						
						
						
					 | 
					
						2020-11-11 16:55:39 -08:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							ca6a8f0896
							
						
					 | 
					
						
						
							
							increase learning rate
						
						
						
						
						
					 | 
					
						2020-11-11 16:48:41 -08:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							ed0e1a8235
							
						
					 | 
					
						
						
							
							Fix bug in tsne.
						
						
						
						
						
					 | 
					
						2020-11-11 16:43:41 -08:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							6baa08889b
							
						
					 | 
					
						
						
							
							git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit
						
						
						
						
						
					 | 
					
						2020-11-11 16:39:44 -08:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							4447c60265
							
						
					 | 
					
						
						
							
							split fitting and plotting tsne.
						
						
						
						
						
					 | 
					
						2020-11-11 16:38:22 -08:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
						
						
							
						
						
							db53c0138a
							
						
					 | 
					
						
						
							
							Add file to plot related subreddits using tsne.
						
						
						
						
						
					 | 
					
						2020-11-11 16:05:36 -08:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							4c8bd14992
							
						
					 | 
					
						
						
							
							Bugfix (typo)
						
						
						
						
						
					 | 
					
						2020-11-10 13:38:11 -08:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							39c581bee9
							
						
					 | 
					
						
						
							
							Reuse code for term and author cosine similarity.
						
						
						
						
						
					 | 
					
						2020-11-10 13:18:57 -08:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							5632a971c6
							
						
					 | 
					
						
						
							
							Refactor tfidf code to for code resuse.
						
						
						
						
						
					 | 
					
						2020-11-10 13:18:19 -08:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							772f3a8fbd
							
						
					 | 
					
						
						
							
							rename 'idf' files to 'tfidf'
						
						
						
						
						
					 | 
					
						2020-11-10 13:16:55 -08:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							6edd155749
							
						
					 | 
					
						
						
							
							Improvements to idf code
						
						
						
						
						
					 | 
					
						2020-11-10 13:12:11 -08:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							8b8c45ee2d
							
						
					 | 
					
						
						
							
							Merge branch 'master' of code:cdsc_reddit
						
						
						
						
						
					 | 
					
						2020-11-02 10:40:12 -08:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							3dc17bd27c
							
						
					 | 
					
						
						
							
							add term_cosine_similarity.py
						
						
						
						
						
					 | 
					
						2020-11-02 10:40:02 -08:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
						
						
							
						
						
							0882878166
							
						
					 | 
					
						
						
							
							Add Cosine similarities to README.md
						
						
						
						
						
					 | 
					
						2020-11-02 09:48:10 -08:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
						
						
							
						
						
							b50b08a3ea
							
						
					 | 
					
						
						
							
							Update Readme.
						
						
						
						
						
					 | 
					
						2020-11-02 08:42:13 -08:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
						
						
							
						
						
							9075a8153c
							
						
					 | 
					
						
						
							
							Merge branch 'master' of code:cdsc_reddit into master
						
						
						
						
						
					 | 
					
						2020-11-01 21:50:44 -08:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
						
						
							
						
						
							4c78f2c527
							
						
					 | 
					
						
						
							
							Create README.md
						
						
						
						
						
					 | 
					
						2020-11-01 21:50:27 -08:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							4ced659d19
							
						
					 | 
					
						
						
							
							Update reddit comments data with daily dumps.
						
						
						
						
						
					 | 
					
						2020-10-03 16:42:22 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							2740f55915
							
						
					 | 
					
						
						
							
							Compute IDF for terms and authors.
						
						
						
						
						
					 | 
					
						2020-08-23 11:57:55 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							2d425600a8
							
						
					 | 
					
						
						
							
							Update submissions to parse using the backfill queue.
						
						
						
						
						
					 | 
					
						2020-08-11 22:37:36 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							c92b50e050
							
						
					 | 
					
						
						
							
							bugfix in checking submission shas
						
						
						
						
						
					 | 
					
						2020-08-11 14:21:54 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							c0da8f4dbf
							
						
					 | 
					
						
						
							
							Use multiword expressions in tf.
						
						
						
						
						
					 | 
					
						2020-08-10 16:57:46 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							57951050c0
							
						
					 | 
					
						
						
							
							Finish generating multiword expressions.
						
						
						
						
						
					 | 
					
						2020-08-09 22:43:48 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							529b7f0511
							
						
					 | 
					
						
						
							
							Bugfix
						
						
						
						
						
					 | 
					
						2020-08-09 02:34:42 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							2d1c8013f2
							
						
					 | 
					
						
						
							
							Use groupby - joins instead of windows
						
						
						
						
						
					 | 
					
						2020-08-09 00:21:50 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							f28effe2c3
							
						
					 | 
					
						
						
							
							renamte tf_comments part 2.
						
						
						
						
						
					 | 
					
						2020-08-04 13:39:49 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							39fde9884e
							
						
					 | 
					
						
						
							
							rename tf_reddit_comments.py step1.
						
						
						
						
						
					 | 
					
						2020-08-04 13:39:20 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							78ab514d6b
							
						
					 | 
					
						
						
							
							Improve tokenization following data. Generate author counts.
						
						
						
						
						
					 | 
					
						2020-08-04 13:24:37 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							b3ffaaba1d
							
						
					 | 
					
						
						
							
							improve tokenizer.
						
						
						
						
						
					 | 
					
						2020-08-03 22:55:10 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							ddf2adb8a6
							
						
					 | 
					
						
						
							
							TF reddit comments.
						
						
						
						
						
					 | 
					
						2020-08-03 22:43:57 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							40be7bedb6
							
						
					 | 
					
						
						
							
							code to sort tf
						
						
						
						
						
					 | 
					
						2020-08-03 17:56:36 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							c666302b4a
							
						
					 | 
					
						
						
							
							remove is_submitter field from submissions which doesn't exist.
						
						
						
						
						
					 | 
					
						2020-07-09 17:12:14 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							aa84a7df03
							
						
					 | 
					
						
						
							
							Bugfixes in scripts.
						
						
						
						
						
					 | 
					
						2020-07-07 23:29:36 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							06fd99e7cd
							
						
					 | 
					
						
						
							
							clean up comments in streaming example.
						
						
						
						
						
					 | 
					
						2020-07-07 12:28:57 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							7d0e020f9d
							
						
					 | 
					
						
						
							
							update .gitignore
						
						
						
						
						
					 | 
					
						2020-07-07 12:28:44 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							e22ddf23da
							
						
					 | 
					
						
						
							
							update examples with working streaming
						
						
						
						
						
					 | 
					
						2020-07-07 11:47:17 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							40d4563770
							
						
					 | 
					
						
						
							
							Build comments dataset similarly to submissions and improve partitioning scheme
						
						
						
						
						
					 | 
					
						2020-07-07 11:45:43 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							fc6575a287
							
						
					 | 
					
						
						
							
							update .gitignore
						
						
						
						
						
					 | 
					
						2020-07-07 00:58:26 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							4efd72a916
							
						
					 | 
					
						
						
							
							Script for example of streaming pyarrow.
						
						
						
						
						
					 | 
					
						2020-07-07 00:57:05 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							90fe976b26
							
						
					 | 
					
						
						
							
							Script to demonstrate reading parquet.
						
						
						
						
						
					 | 
					
						2020-07-07 00:51:40 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							9cd0954288
							
						
					 | 
					
						
						
							
							Check the shas when we download dumps
						
						
						
						
						
					 | 
					
						2020-07-06 23:31:52 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							33e088492c
							
						
					 | 
					
						
						
							
							Script to run both parts of submissions_2_parquet.sh
						
						
						
						
						
					 | 
					
						2020-07-06 23:27:18 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							fd3b615544
							
						
					 | 
					
						
						
							
							Cache before sorting so we don't extract twice.
						
						
						
						
						
					 | 
					
						2020-07-06 22:30:04 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							4ec9c14247
							
						
					 | 
					
						
						
							
							Move the spark part of submissions_2_parquet to a separate script.
						
						
						
						
						
					 | 
					
						2020-07-06 22:27:34 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							4eb82d2740
							
						
					 | 
					
						
						
							
							Fix whitespace at top of file.
						
						
						
						
						
					 | 
					
						2020-07-05 23:32:00 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							34185337c9
							
						
					 | 
					
						
						
							
							Secondary sort for the by_author dataset should be CreatedAt.
						
						
						
						
						
					 | 
					
						2020-07-05 23:29:35 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Nate E TeBlunthuis
							
						 
					 | 
					
						
						
						
						
							
						
						
							67857a3b05
							
						
					 | 
					
						
						
							
							Create a second dataset sorted by author.
						
						
						
						
						
					 | 
					
						2020-07-05 23:27:05 -07:00 | 
					
					
						
						
							
							
							
						
					 |