Using Amazon Athena to check if a password has been pawned
Author
Hilton D
Date Published
Ever wonder if a password you use has been used before or more importantly whether this password is widely known to hackers?
Troy Hunt runs an excellent site called “Have I Been Pwned” that allows one to check if their account details have been compromised as a result of a data breach. This works well if you want to check a dozen or so accounts but what if you want to check a couple thousand or million passwords?
Amazon Athena is a service that allows you to upload a data set to Amazon S3 and then query that data using SQL queries. The best bit is that you only pay for the data crawled by the query — cost effective big data analysis!
Objective
Check a list of 4000 passwords (or hashes of passwords) against the “Have I Been Pawned” password list using Amazon Athena
Method overview
Download the password hash list
Some simple data wrangling needs to be performed to transform the list into a format that Athena can query. G-Zipping the list will also save you on storage costs, perhaps at the expense of increased query duration.
Upload the list to a S3 bucket
Create the Athena HIBP Database
1CREATE EXTERNAL TABLE `pwndpasswords`(2 `hash` string,3 `numoccur` int)4ROW FORMAT DELIMITED5 FIELDS TERMINATED BY ':'6 MAP KEYS TERMINATED BY 'undefined'7WITH SERDEPROPERTIES (8 'collection.delim'='undefined')9STORED AS INPUTFORMAT10 'org.apache.hadoop.mapred.TextInputFormat'11OUTPUTFORMAT12 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'13LOCATION14 's3://qwerty.slicehost.com/'15TBLPROPERTIES (16 'has_encrypted_data'='false',17 'transient_lastDdlTime'='1528802289')
Now depending on how many passwords you want to check you can either just query the database directly or create another database with the passwords you want to check and use a SQL INNER JOIN to obtain the matches:
1SELECT my.hash,2my.password3FROM leaked4INNER JOIN pwndpasswords5ON my.hash = pwndpasswords.hash
Result
After a few minutes you will get a list (or .csv file) of hashes that match between these two databases which you can use to take remedial action such as changing passwords or — heaven forbid — notifying customers that they should change their passwords.
While this method probably is slower than other methods (I surmise that using GPU accelerated Hashcat would be quicker) it is without question simple and efficient in the sense that there is no infrastructure to maintain or drivers to update.
Another benefit is that you can run multiple queries concurrently (I think Athena limits an account to twenty simultaneous queries by default) which means that multiple analysts can be doing investigations simultaneously!