The ICAEW Blogs & Forums will no longer be updated with new posts. Your community announcements and articles will now be hosted on icaew.com under their respective community areas. This site and its contents will be closed and made available in an archive at the end of October.
This blog is brought to you by the Excel Community where you can find additional blogs, extended articles and webinar recordings on a variety of Excel related topics. In addition to live training events, Excel Community members have access to a full suite of online training modules from Filtered. There is also an online forum where you can ask questions and share ideas with other community members.
Hello all and welcome back to the Excel Tip of the Week! This week we have a General User level post in which we're making a definitive re-examination of how to remove duplicates in your data using Excel. This was most recently covered in TOTW #98.
How removing duplicates works
All of the tools we are going to look at work in fundamentally the same way, so before we get into them, let's review how they work.
What counts as a duplicate?
Excel counts two cells as being the same if their contents are exactly the same - i.e. the same characters in the same order. Note that, like almost all things in Excel, it is however NOT case-sensitive.
So in this example, the APPLE is removed as it is considered to be identical to Apple.
Some things that look identical might not be
Sometimes you will see things like this:
In this case, one of the "Banana" cells actually contains "Banana ", with a trailing space. This is enough to have Excel count the cells as different, even though they may not appear to be.
If you're looking to remove excess spaces, try the TRIM function - see TOTW #212 for more on that.
Unique values are kept starting from the top
As you can see from the Apple / APPLE example above, the first occurrence of each unique value is kept, and the later ones are removed. The order of items is left unchanged.
This is useful because if e.g. you have some updated contact information for some, but not all, of your clients, you can selectively merge them by:
By following these steps, the older data will be preferentially removed where newer data is available, but old data will be left alone if no newer version exists.
How to remove duplicates
The main way to do it is with the tool on Data => Remove Duplicates:
This is a simple tool to use and will quickly do the job. However, it's worth noting that the process is a bit different for multi-column data. Take this example:
In this case, we want to keep the most recent date for each client, but we want to trim down the entire table. So we would only care about duplicates in the Client ID column. Compare this example, where we are merging orders from two different databases:
Here, we only want to remove duplicates where the entire row is repeated. And you can imagine cases in between these two.
So, when we go to Data => Remove duplicates for multi-column regions, we get this menu:
Here we can choose which columns are important for the removal - as in this example, where we only care about removing duplicate client IDs.
You might notice that the above option only works for removing duplicates from rows. Our first alternative gets around that - the Microsoft 365-only dynamic array function, UNIQUE:
Instead of removing the duplicates from the original, like on the Data tab, this function separately produces the unique values as a spilled array. The full syntax is:
=UNIQUE(array, by column, exactly once)
Array is the row, column, or 2D array to be handled. Note that UNIQUE always considers all the values, unlike Remove Duplicates, so in our above example the latest sales contacts had to be obtained through a lookup.By column is an optional input that can be used to have UNIQUE consider unique columns, rather than the default of unique rows, by entering TRUE.Exactly once is an optional input that can be used to instead return the values which appear exactly once in the range, rather than each distinct item, again by entering TRUE.
There is also a "remove duplicates" option in Power Query; this works effectively identically to the version on the Data tab and only works row-wise.
And that's all there is to say about removing duplicates! Give it a try next time you need to manipulate your data.
Previous post - Audit sampling templates: Traditional formulas TOTW index Next post - Revisiting FREQUENCY