| 
							
							
								 Nate E TeBlunthuis | c666302b4a | remove is_submitter field from submissions which doesn't exist. | 2020-07-09 17:12:14 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | aa84a7df03 | Bugfixes in scripts. | 2020-07-07 23:29:36 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 06fd99e7cd | clean up comments in streaming example. | 2020-07-07 12:28:57 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 7d0e020f9d | update .gitignore | 2020-07-07 12:28:44 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | e22ddf23da | update examples with working streaming | 2020-07-07 11:47:17 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 40d4563770 | Build comments dataset similarly to submissions and improve partitioning scheme | 2020-07-07 11:45:43 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | fc6575a287 | update .gitignore | 2020-07-07 00:58:26 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 4efd72a916 | Script for example of streaming pyarrow. | 2020-07-07 00:57:05 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 90fe976b26 | Script to demonstrate reading parquet. | 2020-07-07 00:51:40 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 9cd0954288 | Check the shas when we download dumps | 2020-07-06 23:31:52 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 33e088492c | Script to run both parts of submissions_2_parquet.sh | 2020-07-06 23:27:18 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | fd3b615544 | Cache before sorting so we don't extract twice. | 2020-07-06 22:30:04 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 4ec9c14247 | Move the spark part of submissions_2_parquet to a separate script. | 2020-07-06 22:27:34 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 4eb82d2740 | Fix whitespace at top of file. | 2020-07-05 23:32:00 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 34185337c9 | Secondary sort for the by_author dataset should be CreatedAt. | 2020-07-05 23:29:35 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 67857a3b05 | Create a second dataset sorted by author. | 2020-07-05 23:27:05 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 6d4344355b | Create parquet datasets of reddit submissions from pushshift. | 2020-07-05 23:20:17 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 6dca79a41f | Rename spark script to reflect that it is for comments. | 2020-07-03 14:00:36 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 94c7a74bd9 | update .gitignore | 2020-07-03 13:55:25 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 4dd9a210e6 | bugfix in retrieving old data and rename file. | 2020-07-03 13:54:55 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | c972d828b3 | Script for checking shas for submissions. | 2020-07-03 13:35:46 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 7da18e33ba | Bugfix: use timestamp types Also change the canonical file path. | 2020-07-03 11:38:43 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | db2d6248fc | update the reddit comment dumps | 2020-07-03 10:41:13 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | d05da6441f | Don't clobber old dumps so that we can just download the new ones. | 2020-07-03 10:40:43 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 592d2c7dda | script for getting submissions dumps from pushshift. | 2020-07-02 17:40:17 -07:00 |  | 
			
				
					| 
							
							
								 Nate E TeBlunthuis | 64e9408a65 | Extract variables from pushshift comment to parquet A spark script | 2020-07-02 14:35:55 -07:00 |  |