This post discusses my experience of building an Excel import API service.
Who Should Read This Post?
If you are a developer and want to build a tool or an API to read / import data Excel files, this post can be helpful for you. You will find the libraries we researched on and the design considerations when building such an API service.
Why Excel? Why not CSV or Google Sheet?
While usage scenario differs from business to business, if you or your team is tasked with building something to support bulk import. And for some reasons, the format has to be in Excel, not even CSV. In our case, the clients are in finance space, and our software needs to process financial statements in Excel formats.
You may ask: Can we ask clients to convert their files into CSV or Google Sheets?
This is a UX question. For industries or clients that are used to Excel, it’s a tough sell to ask clients to convert many files into CSV formats.
Given the UX consideration, it is natural to think that, we allow clients to upload Excel files and we do the CSV conversion in the backend. If you can take this approach and deal with CSVs in the code from then on, things will be much simpler. There are many posts online that teach you how to import CSV into MySQL / Postgres / SQL Server.
However, the simple CSV conversions approach has its limitations. Excel data can be lost during the conversion, including
- Formulas and references
- Lots of other propritory Excel features
This post has an informative CSV vs. Excel comparison table.
In our case, because clients use some of those Excel features and we couldn’t afford to lose data, so we chose to support reading data directly from Excel files.
We also consider the possibility of relying on Google Sheets, but we didn’t pick it for the following reasons:
- Hard limit on the number of cells.
- Tricky to handle Google Sheets permissions (Our client files store confidential data!)
After we did the research, we started our quest to build an Excel import service.
Open Source libraries for Excel
Our team was mostly Ruby / Python / Elixir devs. And we didn’t have any .NET developers, or anyone familiar with Microsoft stack at that time. So we leaned on using open-source libraries / tools to Excel files.
We found the following libraries
We did some research and finally picked openpyxl, which was the most matrue compared with the rest. openpyxl has support for many Excel features, including validations, sheet protections, which can be handy if we want to build a companion Excel export services in the future.
With openpyxl, reading an Excel file is straightforward. Here’s an example usage:
from openpyxl import load_workbook wb = load_workbook(filename = 'spreadsheet.xlsx') # Loop through sheets for sheet in wb: print(sheet.title) # Read a cell value sheet = wb.active print(sheet['D18'].value) # Read a range cell_range = ws['A1':'C2'] print(cell_range['B2'].value)
Handling Excel Formula
This is a sore point of dealing with Excel files. Excel, as it turns out, usually caches the calculated value on the file. A formula cell has two values, one for the formula string and the other for the cached calculated value. Most Excel lib will only read the cached value for the formula-typed cells. openpyxl is no exception.
This created an issue for us. For some reasons, some files that clients gave us didn’t have the cached calculated values. We suspect it was because those files were programmatically generated and never opened by any MS Excel software. Interestingly, a fresh Excel file generated by openpyxl also has this issue.
To solve this problem, we first tried to find an open source library that can parse and calculate Excel formulas. We found Apache POI which looked promising then. It was written in Java, and so we tried downloading the jar file and ran it on demand whenever users upload a Excel file. But it was very slow. Openning and parsing a file took more than 30 seconds which seriouesly impacted UX.
Finally, we decided POI was not a scalable solution. And we came up with one hacky idea.
Our reasoning was like this, who can do a better job at reading Excel file than MS Excel itself? All we want is all formulas are cached. As long as an Excel file is opened at least once by MS Excel, openpyxl could read the cached values without a problem.
Following this idea, we setup a Windows box with MS Excel installed. What it does was simple: 1) download an Excel file from AWS S3, 2) open in Excel, save it and 3) re-upload to S3.
To our surprise, this approach worked pretty well.
Since our tech stack was written in Python, we used xlwings to communicate with MS Excel for opening files. We had something like below:
if sys.platform != "linux2": import xlwings as xw xw_app = xw.apps.active filename = "example.xlsx" wb = xw_app.books.open(filename) xw_app.calculate() wb.save(filename) xw_app.quit()
This is a question about how users use the Excel rows to express their intent. For imports, uploading an Excel file has a few possible intentions.
1) Pure Replace: Replace the entire database with the content in Excel (rare)
2) Update / Create: Update the rows that exist already in database and create new ones, if not exist (common)
For 1), basically the Excel is the entire database dump. Importing the Excel means it would wipe out the data that don’t exist on the Excel. So this is a rare use case.
2) is a more common use case. Shopify has a similar import function for vendor product bulk upload.
Unique Key Column(s)
When users are working on a wide spreadsheet (aka, a sheet that has many columns), and each row is a variant is of another with mostly similar data, users may not like entering repeated rows.
Take Shopify product upload as an example. Users want to upload a t-shirt with a few variants, with differences only in a few columns, such as size, images, and the rest columns are the same. We need a way to let the users to tell the uploader:
Hey, these three rows are mostly the same and only this X, Y, Z columns are different. I don’t enter a bunch of duplicate values. Please use the first non-empty row as a fallback.
Shopify’s solution is to let users enter a unique key (their term is
Handle), which uniquely identies a product.
You should consider adding such a unique key column(s). Depending on the business domain, it is can be one or many columns combined.
I’ve also seen the tricky case where the data are messy in that, some cells have typos or extra spaces. The users expect the program to be “smart” enough to deduplicate those rows, based on text similarity, which is a lot of work. In such cases, we need to be clear with the business that this is the domain of data clean-up and out of scope of an simple Excel importer.
Often times, a row in the import file maps to multiple objects in the backend. Below is an example:
If we model the data as Person and Address, one row means adding/updating one person and two addresses of different types.
There are some complexities in handling such cases.
Replace or update?
Users usually don’t provide the primary key to the nested columns, so it becomes tricky to deal with existing nested objects. In the above case, imagine John already has a home address on record. It is natural to say that we want to replace the home address with the new one.
But in some cases, there is no order among the nested objects. For example:
A product can have multiple categories and subcategories. We need to decide if:
1) replace all existing nested objects with the new set, or update
2) If it’s update, what’s the key we use to sort the nested objects.
Nested column limit
We need to set a limit on how many nested objects we can set.
In the above Person-Address case, it is straight-forward. Only two addresses are supported: home and billing.
In the latter Product-Category case, in theory, we can handle unlimitted number of categories. If we decide to not impose a cap, it makes the implementation tricky because we need to either 1) group columns into one objects depending on the positions of the column, or 2) add prefixes to the column so that we can recover the group.
Import Templated Files Only
From the beginning, we get the clients to agree that the service only accepts Excel file with data organized in a specific way. This is because Excel, or spreadsheets in general, are optimized for human behaviour. Because it is flexible, it is easy to manipulated in a way that is hard for programs to parse.
Having a template where we can embed hidden, protected sheets and cells help us limit what users can do with the files. As a result, it greatly simplifies the code with less checks and error handling logic.
Alternatives to Consider
We built our service based on open source libraries because 1) we have a budget, and 2) the libraries are good enough for our use case. If you find the existing open source libraries too limitted, or you need support for advanced Excel features, e.g., pivot tables or VBA, buying services from Microsoft is always an option.
Google Sheets is another option. It is good for human collaboration and comes with plug-ins and AppScript that we can build certain UI or automation (easier than programming in VBA).
There is also LibreOffice. We didn’t explore too deep in this topic, so I will refrain from commenting.
Finally, Excel is quite a flexible format, so it is important to understand your business use case, set boundaries and pin down the necessary requirements. Otherwise, the project can easily go out of scope.
We managed to implement a simple Excel import API service that satisfies most of the user requirements. The service is based on the open source libraries, so we didn’t have to pay extra for proprietary services.
Hopefully, you find useful information to help you decide whether/how to build something similar.
Thank you for your read!