It’s early 2019, and we’ve already witnessed not one but two record-breaking data dumps. The first one, known as Collection #1, consisted of approximately 773 million unique email IDs and 21 million unique passwords, and was reported by Troy Hunt in January.
2,692,818,238 Records in Collection #2-5
The second data collection is even larger than the first one, consisting of 2,692,818,238 records spread across 12,000 files. The data dump was reported by German security website Heise and is 845 GB in size. It’s referred to as Collection #2-5.
Apparently, Collection #2-5 includes mainly data of old leaks but this doesn’t mean it can’t be exploited again. Moreover, the files are now hosted on the Mega file-sharing service. Security researchers say that the data has been downloaded more than 1,000 times. There is even a service that enables people to check whether their data has been included in Collection #2-5 – Info Leak Checker.
The data dump is quite impressive in size, but most of the stolen data appears to originate from previous data thefts, such as the breaches of Yahoo, LinkedIn, and Dropbox. Security researchers at Wired examined a sample of the data and confirmed that the credentials are indeed valid, but mostly represent passwords from old data leaks.
Hasso Plattner Institute’s researchers who created the Info Leak Checker, estimated that 750 million of the credentials weren’t previously included in their database of leaked usernames and passwords. They also found that 611 million of the credentials in Collections #2–5 weren’t included in the Collection #1 data.
It’s noteworthy that Hasso Plattner Institute researcher David Jaeger believes that some parts of the collection may originate from the automated hacking of smaller, obscure websites to steal their password databases, meaning that a significant chunk of the passwords are being leaked for the first time.
As for Collection #1, one of Hunt’s contacts pointed him to a popular hacking forum where the data was being “socialized”. On an image associated with the data there was a root folder named “Collection #1”, and so the researcher decided to name the breach this way. The data came from multiple sources, and is perhaps “a collection of 2000+ dehashed databases and combos stored by topic”, as explained on a forum post where the breach was “advertised”.