In November last year Alexa admitted in a tweet that they had stopped releasing their CSV file with the one million most popular domains.
Members of the Internet measurement and infosec research communities were outraged, surprised and disappointed since this domain list had become the de-facto tool for evaluating the popularity of a domain. As a result of this Cisco Umbrella (previously OpenDNS) released a free top 1 million list of their own in December the same year. However, by then Alexa had already announced that their “top-1m.csv” file was back up again.
The Alexa list was unavailable for just about a week but this was enough for many researchers, developers and security professionals to make the move to alternative lists, such as the one from Umbrella. This move was perhaps fueled by Alexa saying that the “file is back for now”, which hints that they might decide to remove it again later on.
We’ve been leveraging the Alexa list for quite some time in NetworkMiner and CapLoader in order to do DNS whitelisting, for example when doing threat hunting with Rinse-Repeat. But we haven’t made the move from Alexa to Umbrella, at least not yet.
Malware Domains in the Top 1 Million Lists
Threat hunting expert Veronica Valeros recently pointed out that there are a great deal of malicious domains in the Alexa top one million list.
I often recommend analysts to use the Alexa list as a whitelist to remove “normal” web surfing from their PCAP dataset when doing threat hunting or network forensics. And, as previously mentioned, both NetworkMiner and CapLoader make use of the Alexa list in order to simplify domain whitelisting. I therefore decided to evaluate just how many malicious domains there are in the Alexa and Umbrella lists.
hpHosts EMD (by Malwarebytes)
- URL: https://hosts-file.net/emd.txt
- Total malicious domains: 154144
|Whitelisted malicious domains:||1365||1458|
|Percent of malicious domains whitelisted:||0.89%||0.95%|
Malware Domain Blocklist
- URL: http://www.malwaredomains.com/?page_id=66
- Total malicious domains: 18270
|Whitelisted malicious domains:||84||63|
|Percent of malicious domains whitelisted:||0.46%||0.34%|
- URL: http://cybercrime-tracker.net/ Total malicious domains: 7861
|Whitelisted malicious domains:||15||10|
|Percent of malicious domains whitelisted:||0.19%||0.13%|
The results presented above indicate that Alexa and Umbrella both contain roughly the same number of malicious domains. The percentages also reveal that using Alexa or Umbrella as a whitelist, i.e. ignore all traffic to the top one million domains, might result in ignoring up to 1% of the traffic going to malicious domains. I guess this is an acceptable number of false negatives since techniques like Rinse-Repeat Intrusion Detection isn’t intended to replace traditional intrusion detection systems, instead it is meant to be use as a complement in order to hunt down the intrusions that your IDS failed to detect. Working on a reduced dataset containing 99% of the malicious traffic is an acceptable price to pay for having removed all the “normal” traffic going to the one million most popular domains.
One significat difference between the two lists is that the Umbrella list contains subdomains (such as www.google.com, safebrowsing.google.com and accounts.google.com) while the Alexa list only contains main domains (like “google.com”). In fact, the Umbrella list contains over 1800 subdomains for google.com alone! This means that the Umbrella list in practice contains fewer main domains compared to the one million main domains in the Alexa list. We estimate that roughly half of the domains in the Umbrella list are redundant if you only are interested in main domains. However, having sub domains can be an asset if you need to match the full domain name rather than just the main domain name.
Data Sources used to Compile the Lists
Image: The Alexa Extension for Firefox
The two lists are compiled in different ways, which can be important to be aware of depending on what type of traffic you are analyzing. Alexa primarily receives web browsing data from users who have installed one of Alexa’s many browser extensions (such as the Alexa browser toolbar shown above). They also gather additional data from users visiting web sites that include Alexa’s tracker script.
Cisco Umbrella, on the other hand, compile their data from “the actual world-wide usage of domains by Umbrella global network users”. We’re guessing this means building statistics from DNS queries sent through the OpenDNS service that was recently acquired by Cisco.
This means that the Alexa list might be better suited if you are only analyzing HTTP traffic from web browsers, while the Umbrella list probably is the best choice if you are analyzing non-HTTP traffic or HTTP traffic that isn’t generated by browsers (for example HTTP API communication).
As noted by Greg Ferro, the Umbrella list contains test domains like “www.example.com”. These domains are not present in the Alexa list.
We have also noticed that the Umbrella list contains several domains with non-authorized gTLDs, such as “.home”, “.mail” and “.corp”. The Alexa list, on the other hand, only seem to contain real domain names.
Resources and Raw Data
Both the Alexa and Cisco Umbrella top one million lists are CSV files named “top-1m.csv”. The CSV files can be downloaded from these URL’s:
- Alexa: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
- Umbrella: http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip
The analysis results presented in this blog post are based on top-1m.csv files downloaded from Alexa and Umbrella on March 31, 2017. The malware domain lists were also downloaded from the three respective sources on that same day.
We have decided to share the “false negatives” (malware domains that were present in the Alexa and Umbrella lists)
for transparency. You can download the lists with all false negatives from here:
Hands-on Practice and Training
If you wanna learn more about how a list of common domains can be used to hunt down intrusions in your network, then please register for one of our network forensic trainings. The next training will be a pre-conference training at 44CON in London.
Posted by Erik Hjelmvik on Monday, 03 April 2017 14:47:00 (UTC/GMT)