The Tweets Sampling Toolkit contains a set of tools for 1) creating a random sample from massive (100M+) Tweet ID datasets, such as those available in the Social Media Lab's COVID-19 Twitter Pandemic Archive and for 2) performing set operations with Tweet ID datasets such as intersection, difference, or union.
The main feature of this toolkit is that it’s optimized to reliably process extremely large CSV files (multiple gigabytes) with minimal memory use. Specifically, the toolkit reads an input file one line at a time whenever possible to minimize memory use.
This toolkit is designed to ingest any text/CSV files consisting of a list of uniformly-sized integers separated by line breaks. The first line is skipped if it's non-numeric. You can find sample input files in the COVID-19 Twitter Pandemic Archive.
-
Install Python 3
-
Install tqdm (progress bar utility):
pip3 install tqdm -
Place the file(s) you will be working with in a directory with tweets_sampling.py and external_sort.py.
-
For testing purposes, you can download and use one of the COVID-19 related datasets from COVID-19 Twitter Pandemic Archive (Note: In the sample code below, we’re using this file.)
-
Import tweets_sampling and create a “file manager” object as shown below:
import tweets_sampling
ifm = tweets_sampling.id_file_manager('april_2021_COVID-19_+_Vaccines_Twitter_Streaming_Dataset.csv')- Use the following call to confirm the number of Tweet IDs stored in the input file.
ifm.id_countUse get_random_sample() to produce a new randomly chosen, duplicate-free list of Tweet IDs from an existing text/csv file.
Parameters:
- n: Number or percentage of Tweet IDs to include in a sample
- output_file: Relative path of a text file to store the sample
- sample_mode: Method for choosing the sample size: either "absolute" (default) or "percent" based on the total number of Tweet IDs in the input file.
Returns: An id_file_manager object linking to the resulting file with the sample data.
# Create a sample containing 20% of our original file
percent_sample = ifm.get_random_sample( 0.2, 'percent_sample.csv', sample_mode='percent')
percent_sample.id_count
50280# Create a sample with 300 Tweet IDs from our first sample
absolute_sample = percent_sample.get_random_sample(300, 'absolute_sample.csv')
absolute_sample.id_count
300Use get_page_samples() to break a large file into multiple smaller files. This is helpful when you need to process a large dataset but don't have the resources to do it all at once.
Parameters:
- page_count: The number of output files to create.
- output_file: The relative path of resulting files. Page numbers are added to the file names (pages.csv becomes pages_0.csv, pages_1.csv, etc.). Page numbers are zero-padded as needed.
Returns: a list of id_file_manager objects to work with the resulting files if needed.
# Split the large file into 5 subsets
pages = ifm.get_page_samples(5, 'pages.csv')
for page in pages:
print(f'{page.file_name}: {page.id_count}')
pages_0.csv: 50280
pages_1.csv: 50280
pages_2.csv: 50280
pages_3.csv: 50280
pages_4.csv: 50281This package also includes several methods for comparing the contents of two datasets and performing set operations such as intersection, difference, or union.
Use get_intersection() to create a file containing only Tweet IDs that are in both of the files.
Parameters:
- file_manager: A file manager object to compare to
- output_file: The relative path of the resulting file.
Returns: An id_file_manager object to interact with the resulting dataset.
#Start by establishing a connection with two input files (further referenced as a and b objects)
a = ifm.get_random_sample(0.3, 'a.csv', sample_mode='percent')
b = ifm.get_random_sample(0.3, 'b.csv', sample_mode='percent')# Compare Tweet IDs stored in each file and save/return only those Tweet IDs that are stored in both datasets: a and b
intersection = a.get_intersection(b, 'intersection.csv')
intersection.id_count
22511Using a.get_difference(b, 'difference.csv') will create a file called 'difference.csv’ with Tweet IDs that are in a but not in b.
Parameters:
- file_manager: A file manager object to compare to
- output_file: The relative path of the resulting file.
Returns: An id_file_manager object with the resulting dataset.
# Get all of the IDs that are in a, but not b
difference = a.get_difference(b, 'difference.csv')
difference.id_count
52909Use a.get_union(b, 'union.csv') to combine Tweet IDs from two datasets (referenced here as a and b) and save the resulting union in 'union.csv'. The output file will store Tweet IDs that are in either of two input files, excluding any duplicates.
Parameters:
- file_manager: An “id_file_manager” object that is linked to the dataset to compare to
- output_file: The relative path of the resulting file.
Returns: An id_file_manager object with the resulting dataset.
# Merge all Tweet IDs from dataset a and b
union = a.get_union(b, 'union.csv')
union.id_countThis toolkit relies on a sort algorithm by @manangandhi7 for union, difference, and intersection operations. Learn more at his blog or the original GitHub repository.