Home > Cyber News > SoReL-20M Dataset of 20 Million Malware Samples Is Now Online

SoReL-20M Dataset of 20 Million Malware Samples Is Now Online

bella remote administration trojan mac removal guideIf you hear that a dataset of 20 million malware samples is now available online, how would that make you feel? Maybe worried? The truth is that such a dataset is now accessible to the public, but as a part of a mission on accomplishing “open knowledge and understanding cyber threats.”

SoReL-20M, A Dataset of Million Malware Samples

Cybersecurity companies Sophos and ReversingLabs just released SoReL-20M, a dataset of 20 million Windows portable executable files (.PE), including 10 million disarmed malware samples. The purpose of the remarkable effort is to improve machine learning for better malware detection capabilities.

Thanks to this project, “defenders will be able to anticipate what attackers are doing and be better prepared for their next move,” Sophos said.

“This dataset is the first production scale malware research dataset available to the general public, with a curated and labeled set of samples and security-relevant metadata, which we anticipate will further accelerate research for malware detection via machine learning,” the company added.

How is the dataset content arranged? The samples are divided into several section, including training, validation, and testing splits on the basis of first-seen time. Each sample contains the following details:

1. Features extracted as per the EMBER 2.0 dataset
2. Labels obtained by aggregating both external and Sophos internal sources into a single, high-quality label
3. Sample-per-sample detection metadata, including total number of positive results on ReversingLabs engines, and tags describing important attributes of the samples obtained as per our paper “Automatic Malware Description via Attribute Tagging and Similarity Embedding” https://arxiv.org/abs/1905.06262
4. Complete dumps of file metadata obtained from the pefile library using the dump_dict() method
5. For malware samples, we provide complete binaries, with the OptionalHeader.Subsystem flag and the FileHeader.Machine header value both set to 0 to prevent accidental execution.

The researchers also released a set of pre-trained PyTorch models and LightGBM models alongside SoReL-20M. Scripts to load and iterate over the data and test the models are also available.
In truth, this isn’t the first dataset of malware samples gathered for research purposes. EMBER, short for Endgame Malware BEnchmark for Research was released in 2018 as an open-source malware classifier.

However, the project’s size was not sufficient enough, and only limited experimentation was possible with it. This is where SoReL-20M comes in, with its 20 million PE samples and 10 million disarmed malware samples. Extracted features and metadata for an additional 10 million benign samples are also available.

Concerns of malicious attempts

Since the malware in the dataset is disarmed, it can’t be executed. Or at least it would be challenging to reconstitute the samples and get them to run. This process would require specific, sophisticated skills and knowledge. However, it is not entirely impossible that a skilled threat actor could devise techniques to use the samples.

In reality, though, attackers can leverage plenty of other resources to get malware information in less complicated ways. In” other words, this disarmed sample set will have much more value to researchers looking to improve and develop their independent defenses than it will have to attackers,” Sophos added.

Milena Dimitrova

An inspired writer and content manager who has been with SensorsTechForum since the project started. A professional with 10+ years of experience in creating engaging content. Focused on user privacy and malware development, she strongly believes in a world where cybersecurity plays a central role. If common sense makes no sense, she will be there to take notes. Those notes may later turn into articles! Follow Milena @Milenyim

More Posts

Follow Me:

Leave a Comment

Your email address will not be published. Required fields are marked *

This website uses cookies to improve user experience. By using our website you consent to all cookies in accordance with our Privacy Policy.
I Agree