If you hear that a dataset of 20 million malware samples is now available online, how would that make you feel? Maybe worried? The truth is that such a dataset is now accessible to the public, but as a part of a mission on accomplishing “open knowledge and understanding cyber threats.”
SoReL-20M, A Dataset of Million Malware Samples
Cybersecurity companies Sophos and ReversingLabs just released SoReL-20M, a dataset of 20 million Windows portable executable files (.PE), including 10 million disarmed malware samples. The purpose of the remarkable effort is to improve machine learning for better malware detection capabilities.
Thanks to this project, “defenders will be able to anticipate what attackers are doing and be better prepared for their next move,” Sophos said.
“This dataset is the first production scale malware research dataset available to the general public, with a curated and labeled set of samples and security-relevant metadata, which we anticipate will further accelerate research for malware detection via machine learning,” the company added.
How is the dataset content arranged? The samples are divided into several section, including training, validation, and testing splits on the basis of first-seen time. Each sample contains the following details:
1. Features extracted as per the EMBER 2.0 dataset
2. Labels obtained by aggregating both external and Sophos internal sources into a single, high-quality label
3. Sample-per-sample detection metadata, including total number of positive results on ReversingLabs engines, and tags describing important attributes of the samples obtained as per our paper “Automatic Malware Description via Attribute Tagging and Similarity Embedding” https://arxiv.org/abs/1905.06262
4. Complete dumps of file metadata obtained from the pefile library using the dump_dict() method
5. For malware samples, we provide complete binaries, with the OptionalHeader.Subsystem flag and the FileHeader.Machine header value both set to 0 to prevent accidental execution.
The researchers also released a set of pre-trained PyTorch models and LightGBM models alongside SoReL-20M. Scripts to load and iterate over the data and test the models are also available.
In truth, this isn’t the first dataset of malware samples gathered for research purposes. EMBER, short for Endgame Malware BEnchmark for Research was released in 2018 as an open-source malware classifier.
However, the project’s size was not sufficient enough, and only limited experimentation was possible with it. This is where SoReL-20M comes in, with its 20 million PE samples and 10 million disarmed malware samples. Extracted features and metadata for an additional 10 million benign samples are also available.
Concerns of malicious attempts
Since the malware in the dataset is disarmed, it can’t be executed. Or at least it would be challenging to reconstitute the samples and get them to run. This process would require specific, sophisticated skills and knowledge. However, it is not entirely impossible that a skilled threat actor could devise techniques to use the samples.
In reality, though, attackers can leverage plenty of other resources to get malware information in less complicated ways. In” other words, this disarmed sample set will have much more value to researchers looking to improve and develop their independent defenses than it will have to attackers,” Sophos added.