Downloading your email attachments in bulk using Python

I recently had to download the attachments from a few thousand emails that were sent to me by an automated process at my current client.  These emails contained log reports and were sent to me every hour in preparation for the project.  Microsoft Outlook does not provide you with an elegant solution to download the attachments from multiple emails simultaneously, so I only had two options to get these attachments.  One option would involve a lot of work by downloading these attachments manually with Outlook, while the other would require me to create some code to have the necessary downloading capabilities.  I chose the automated approach, as it would be much quicker than the manual one, it would produce something that my client and I could reuse later, and, as it turns out, it was much more fun!

To me, the most obvious approach would involve using Python, which is a high-level and general-purpose programming language.  It is known to have many modules available, which are in turn high-level functions, definitions, and statements to achieve something specific, such as mathematical operations, machine learning, or connecting to email servers.  As such, Python is ideal for automating your boring stuff (in reference to one practical Python handbook).

A brief search on the internet indicated that two Python modules were of particular interest.  These are email, to work with email messages (duh!), and imaplib, which is an SDK (software development kit) to use the Internet Messaging Protocol version 4 (IMAP4).  IMAP handles all the messaging from and to the email servers and is used at most popular email services, such as Gmail, Outlook Office365 and Yahoo.

I identified that I would need three building blocks to enable me to download my attachments.  These would form the main functions in the upcoming code.

  1. A method to connect to a specified email account. For this, I need to have the account name, its password and to know its host.

  2. A method to search for specific emails. Similar to working with Microsoft Outlook, it needs to be able to search on the subject, message sender or switch to a different subfolder of your inbox.

  3. A method to determine whether the email contains an attachment and when it does, it needs to download all attachments to my local computer.

The preparatory work was now finished, and it was time for me to do some actual coding!  I started with some sandboxing and testing in a Jupyter notebook, which is an interactive Python environment and enables you to add Markdown text for structuring your code and provide documentation.  After I was happy with the basic functionality that I developed and tested, I transformed all steps to separate functions and centralized them within one Python file.  I would then be able to call this functionality from somewhere else, being it Jupyter notebooks or other Python code while being able to keep and maintain this personal module outside my other code.  The centralized code is available on Github and it has plenty of room left to add more exciting functionality.

Let me briefly overview the code to demonstrate which functions relate to each of the three main building blocks I defined earlier.  I hope this will convince any non-Python experts on the simplicity of this coding language!

𝟏. The connection is created with the connect_to_inbox function, which uses imaplib functionality to make the connection to the server and log in with the provided credentials. The function returns this connection object, which can be reused later.

𝟐. Two functions are created to retrieve the ids of the mails. The first, get_emailids_inbox, returns all emails from a specified inbox, where an inbox is a folder on your email account.  The other, get_emailids_query, uses an explicit query on the sender or subject.  Both functions use the search method on the connection we created earlier and return the ids of the emails.

𝟑. The last step enables the user to get the content and attachments of the email. The function get_emailbody will fetch the email message and convert its raw byte payload to something understandable.  This mail_ object can then be called to review information on the date, sender, or attachments.  All attachments from the mail_ object are retrieved with the download_attachments_email function and they are stored at the outputdir_ with their original filename.

After the centralization process of the email handling functionality was completed, I updated my initial notebook with references to this central Python module.  This final Jupyter notebook now enables me to download all attachments from the automatic process to my client in a quick and hassle-free manner.  This downloading procedure is now something quite elegant with only a few lines of code and for-loop overall retrieved email ids.

In summary, Python is an elegant tool to work with data but it can do so much more!

Blog by Bram Buysschaert