Extracting Data from (Azure) SQL Server Huge Tables in RFC 4180-Compliant CSV Files | by Luca Zavarella | Jan, 2023

By Jessie Hobb On Jan 5, 2023

The nightmare of extracting data containing strings with special characters from (Azure) SQL Server huge tables into CSV files

When a team of Data Scientists from outside your company is hired to implement Machine Learning models, you somehow have to share with them the data to be used for model training. If the aforementioned team cannot directly access the data persisted in a database, the first option is to extract it from the database into files in CSV format. Considering that most of the time these data are in large quantities (5+ GB) and that some fields may contain special characters (comma, which coincides with the field separator; carriage return and/or new line characters), the usual tools used for export by non-developer users may not be adequate, even causing memory problems.

In this article you will see how to solve the problem of extracting a large amount of data containing special characters from an (Azure) SQL Server database in RFC 4180-Compliant CSV Files using PowerShell functions.

When you need to extract data from an (Azure) SQL Server database, the first tools that come to mind for the user are SQL Server Management Studio (SSMS) and Azure Data Studio (ADS). This is because both contain simple features that allow you to extract data from a database with a few clicks.

The interfacing tool with (Azure) SQL Server par excellence is SSMS. Recently Microsoft has been investing heavily in adding features in ADS to make it the tool of choice for the Microsoft data platform on Azure and beyond. Therefore, when you install the latest versions of SSMS today, the setup also installs ADS behind the scenes.

Any third-party system that involves importing a CSV file to load a dataset must be based on a standard that defines the CSV format. Therefore, before moving on to practical tests, let’s see if there is a standard definition of the CSV format.

RFC 4180 is a standard that formalizes the format used for Comma-Separated Values (CSV) files and the specific Multipurpose Internet Mail Extensions (MIME) type associated with the CSV format (“text/csv”). The contents of this standard can be found here:

As you can see from the definition of the format in the previous link, while the first four points are fairly obvious, the remaining three need to be read carefully:

Each field may or may not be enclosed in double quotes (however
some programs, such as Microsoft Excel, do not use double quotes
at all). If fields are not enclosed with double quotes, then
double quotes may not appear inside the fields.
Fields containing line breaks (CR/LF), double quotes, and commas
should be enclosed in double-quotes.
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote.

Keeping in mind also the examples given in the link, it is evident that the value of a field will be enclosed with double quotes only when needed. It doesn’t make sense to use double quotes for all the values of a field when only some of the values need them.

When you need to share information with third-party systems using CSV format files, the following applies:

It’s important that the CSV files you generate from your exports are RFC 4180 compliant to be sure that the files can be read by any external system that provides CSV file import capability.

In order to test how the upon mentioned tools extracts data in CSV format, let’s create a simple table containing special characters mentioned in the RFC 4180 standard, and Unicode characters to ensure the generality of the contents in text fields.

First, you have to create the extract_test table in your SQL Server instance using the following script:

CREATE TABLE [dbo].[extract_test](
[id] [int] IDENTITY(1,1) NOT NULL,
[name] [nvarchar](50) NULL,
[notes] [nvarchar](100) NULL
)

Then you can add data to this table using the following script:

SET IDENTITY_INSERT [dbo].[extract_test] ON 
GO
INSERT [dbo].[extract_test] ([id], [name], [notes]) VALUES (1, N'Luca', N'let''s add a carriage return
here')
GO
INSERT [dbo].[extract_test] ([id], [name], [notes]) VALUES (2, N'Zavarella, Luca', N'the name contains a comma')
GO
INSERT [dbo].[extract_test] ([id], [name], [notes]) VALUES (3, N'Luca Zavarella', N'here we have a comma and a double quotation mark: ,"')
GO
INSERT [dbo].[extract_test] ([id], [name], [notes]) VALUES (4, N'秋彦', N'this japanese name means "bright prince"')
GO
SET IDENTITY_INSERT [dbo].[extract_test] OFF
GO

As you can see from the contents of the INSERT statements, we have provided all the special characters mentioned in the standard. We have also used Japanese characters, so that we can verify that the CSV file is written correctly using the Unicode character table.

Evidently the table created in this case will not be a 5 GB table, but will contain special characters to test CSV format exports. Here the output of a SELECT in ADS:

Figure 1 — Output of the content of your dummy table on ADS (by the author)

Do not worry from the fact that the carriage return does not show up in the output grid of ADS or SSMS. The way the INSERT of that row was done, the carriage return is there.

So, let’s try to extract data from this table using SSMS and ADS.

Let’s first try to use the traditional tool with which we interface with SQL Server, namely the SQL Server Management Studio.

Extracting data with SSMS

Once you have opened SSMS and connected to your database instance, right-click on the name of the database hosting the table you’ve just created, go to Tasks and then Export Data:

Figure 2 — Exporting data from a database using SSMS (by the author)

You will be shown an initial screen describing the Extract Data activity. If you go forward, you will be shown this window:

Figure 3 — Selecting a data source from the Export Wizard (by the author)

Select the SQL Server Client data source, enter your server instance name, then choose the authentication to use to login to database. In my case, having persisted the test table on an Azure SQL database, I used a SQL Server authentication to access my test-sql-bug database, as you can see in Figure 3.

On the next screen of the Wizard you have the option of selecting the export destination. In our case, select Flat File Destination, create a CSV destination file via the Browse button in your preferred folder (remember to select the CSV extension in the Open window that opens after pressing Browse). Remember to check the Unicode flag to make sure you also handle the Japanese characters in our example. After that, select Delimited as the format, leaving the Text qualifier at “<none>”. Also make sure that the “Column names in the first data row” flag is checked. Then press Next:

Figure 4 — Selecting a destination for the data output (by the author)

In the next window select Copy data from one or more tables or views and press Next again.

In the configuration window that appears you can then select the table ` [dbo].[extract_text] as Source table or view. For the other options, you can leave everything as is, since the row delimiter (CR\LF) and the column delimiter (comma) are as defined by the RFC 4180 standard. Then press Next:

Figure 5 — Configuring the flat file destination options (by the author)

In the next window keep Run immediately selected and press Finish. A summary window of the selected options will appear. Press Finish again and the extraction will start. When finished, press Close.

If you now try to open the output CSV file with a text editor (not Excel), you will notice the following:

Figure 6 — The output of the SSMS Export Wizard without text qualifier (by the author)

Basically, in this case the Export Wizard extracts the contents of each text field regardless of whether it may contain special characters (comma and carriage return). This means that any carriage return contained in a text field is interpreted as a row delimiter by the system that has to read the file, just as any comma contained in a text field is interpreted as a field delimiter. Unicode characters, on the other hand, have been treated correctly. Therefore, the generated CSV file will not be recognized as correct by any third-party system that needs to import that information.

If you try to repeat the export, this time entering the double quotes as a text qualifier, you will get the following:

Figure 7 — The output of the SSMS Export Wizard using double quotes as text qualifier (by the author)

In this case, all extracted values are surrounded by double quotes, including the header. However, this forces an external system that must read the data to consider all numeric values as strings. Moreover, if a value in a text field contains a double quote character, it is not escaped, generating parsing problems for external systems. Therefore, again, the generated CSV file will not be recognized as correct by any third-party system that needs to import this information.

Regarding the scalability of the extraction operation on very large masses of data, there are no problems, because the Export Wizard uses SQL Server Integration Services (SSIS) as its engine, which is developed to handle huge bulk volumes of data.

Moreover, it may sometimes happen that you need to take action on the data source data types to avoid some errors during export with the Export Wizard, as highlighted in this blog post:

We can conclude this section by stating the following:

Using the SSMS Export Wizard as a tool for extracting data in CSV format from an (Azure) SQL Server database doesn’t guarantee having a format that complies with the standard defined by RFC 4180, with the consequence that the extracted information may not be properly read by an external system.

Instead, let’s see what happens when we use Azure Data Studio to extract the information in CSV format.

Extracting data with ADS

Once Azure Data Studio is open, the first thing to do is to add a new connection to your server instance. Watch out that starting with newer versions, the Encrypted option is set to True by default. This will not result in connection errors if you connect to an Azure SQL database, but it might generate one if your data source is an on-prem SQL Server. In that case, you can set the option to False.

That said, in order to extract the contents of a table (or view, or query) in ADS, you must first perform a SELECT query and display its contents in the output grid running it. After that, simply press the “Save As CSV” button at the top right of the grid:

Figure 8 — Saving the output of a query in ADS in CSV format (by the author)

An output file selection window will open, allowing you to name the file that will be extracted (in our case ExtractTestADS.csv). As soon as you press the Save button, the contents of the CSV file will be shown directly within ADS:

Figure 9 — The output of ADS in CSV format (by the author)

Wow! The output generated by ADS complies with the RFC 4180 standard to all intents and purposes! Thus, it would seem that ADS is the perfect tool for extracting information in CSV format from an (Azure) SQL database.

However, there is a scalability problem. Since ADS requires that the query output be first exposed in the output grid, this limits the functionality when dealing with many GB of data. In these cases, containing all that data in a grid involves taking up so much RAM on the system, causing the application to crash.

We can therefore conclude this section as follows:

ADS’s CSV format data export procedure guarantees output that conforms to the RFC 4180 standard. However, the use of ADS for extraction tasks is indicated when the size of the dataset to be exported is rather limited. When more than 3–4 GB of data needs to be extracted, ADS may occupy the entire system memory and crash.

In general, we can therefore conclude that:

Unfortunately, the user-friendly features provided by Microsoft’s data platform tools don’t allow to extract huge amount of data in CSV format following the RFC 4180 standard.

Let’s try to see if we can achieve our goal through more specific tools known by expert users.

The Bulk Copy Program (BCP) command line utility is used to import large numbers of new rows into SQL Server tables or to export data from tables to data files in a user-specified format. This is the solution that imports or exports data as fast as possible in even very large quantities. Therefore, it has no problem with scalability.

In addition to being installed by default with a standard on-prem SQL Server installation, and in addition to being able to be installed stand-alone on a Windows operating system, the BCP utility can also be used from the Azure cloud shell to interact with an Azure SQL database, as shown in this blog post:

Without going into too much detail, the main problem with BCP is that it doesn’t extract table headers and doesn’t handle double quotes in a simple way out of the box. This is evidenced by the fact that Erland Sommarskog’s reference guide for its use reports a number of workarounds for getting both headers and double quotes, as you can see here:

One of the drawbacks of this is approach is that you have to know in advance which fields need double quotes (unless you provide them for all text fields). Generally, I do not have the ability to know in advance which fields might have the need for double quotes. I just want to extract the data worry-free. Should you be able to get the headers and double quotes via Erland’s advice, however, the quotes would be applied to all values in the selected fields. As Erland himself points out:

… the assumption is that the data should always be quoted. If you only want to quote when needed, you will need to handle this in your query, which is outside the scope of this article. All I can say is: good luck. Or more directly: avoid it if you can.

Also, should a field with double quotes have a string containing both a comma and a double quote, the BCP does not handle the feature of escaping the double quote by doubling it.

We can therefore state that:

Using BCP to export data in a CSV format that includes both headers and double quotes is very arcane for the non-expert user. One downside is that you have to know in advance for which fields to provide double quotes. In addition, it would still not result in a format consistent with the RFC 4180 standard.

I will not go into the details of using Microsoft’s other command-line tool called SQLCMD, because the issues are similar to those highlighted in this section.

So what? How to proceed? Since I couldn’t find an application on the Internet that was able to extract data in an RFC 4180-compliant CSV format and at the same time handle very large data masses, the only possible solution was to develop a custom solution that can be easily used even by the non-expert user. Let’s see how this solution works.

The first thing I asked myself when I decided to develop a specific solution for this problem was what programming language to use. The first language that came to mind was definitely Python. I then thought, however, that a standard user approaching the world of automation on a Windows machine may not know Python, and he would not find it preinstalled on the operating system. That is why the choice fell on PowerShell, which provides, among other things, a specific module for SQL Server.

Issues with the SQL Server PowerShell module

The first attempt I made was to use the SQL Server PowerShell module, which allows SQL Server developers, administrators, and business intelligence professionals to automate database development and server administration.

Specifically, the command I tried to use to send the query needed to retrieve the data to the Azure SQL database was Invoke-Sqlcmd. This command does nothing more than invoke the sqlcmd.exe command-line utility, often used by automation processes to retrieve information from a SQL Server database. So far, so good. The problem is that Invoke-Sqlcmd persists all query output directly into PowerShell data structures. As you can guess, when the query output takes up more than 3–4 GB, you have the same problem encountered with extraction done in ADS, which is that your system becomes unstable due to excessive RAM consumption.

Therefore, I found it appropriate to directly use ADO.NET objects in PowerShell to try to work around the problem. Let’s see how I used them in this solution.

Batch exporting data to output file

The main idea of my solution is to always use an intermediate data structure (a DataTable) that would collect the query data, but a number of rows at a time. Once the maximum capacity of the intermediate data structure is reached, its contents are written to the target file, it is emptied and is immediately loaded with the next rows of data from the data source:

Figure 10 — Main process of the solution (image by the author)

This process goes on until there are new lines to read in the data source.

You might wonder why I used an intermediate DataTable and didn’t implement a direct write stream to the output file through the StreamWriter. The answer lies in the ability to use PowerShell’s Export-Csv cmdlet directly.

Writing data using Export-Csv

One of the goals I set for myself when I have to solve a problem is always not to reinvent the wheel if there are already convenient solutions that help you solve it completely or partially. In this case, I thought I would dispense with rewriting all the logic that handles the special characters mentioned by the RFC 4180 standard using the Export-Csv cmdlet.

Checking the PowerShell cmdlet guide, I realized that Export-Csv provides the parameters that control the use of double quotes only as of version 7:

Figure 11 — Differences between Export-Csv versions 6 and 7 (image by the author)

Specifically, the UseQuotes parameter provides the value AsNeeded and defines its functionality as follows:

only quote fields that contain a delimiter character, double-quote, or newline character

Basically, it’s what we want in order to meet the requirements of the RFC 4180 standard.

Should you wish to provide double quotes only for certain fields, you can specify them explicitly via the QuoteFields parameter.

Now there is just a small problem with PowerShell versioning. Keep in mind that Windows 10, Windows 11, and Windows Server 2019 preinstall version 5.1 of Windows PowerShell (also known as Desktop edition). In order to use the newer versions of the Export-Csv cmdlet, you must install the newer version of PowerShell (at least PowerShell 7.0), which is for all intents and purposes a separate piece of software from Windows PowerShell based on .NET Core (if you are interested in learning about its evolution over time, you can learn more at this link).

It’s important to emphasize the following:

Since this module was developed for Core versions of PowerShell, it can also be used on Linux and macOS systems.

That said, let’s see how to use this new module.

The new SqlBulkExport module is available on GitHub here:

It provides two functions:

Export-SqlBulkCsv: Exports the content of a SQL Server database table, view or query in an RFC 4180-Compliant CSV file. This function supports the export of huge result sets, writing the CSV file content in multiple batches.
Export-SqlBulkCsvByPeriod: Exports the content of a SQL Server database table, view or query in multiple RFC 4180-Compliant CSV files, broken down by time period (yearly, monthly or daily), based on the contents of a selected date field. This function supports the export of huge result sets, writing each CSV file content in multiple batches.

Both functions require the following parameters:

ServerName: The SQL Server instance name to connect to.
Port: The SQL Server instance port number. By default, it is 1433.
DatabaseName: The SQL Server database name to connect to.
SchemaName: The database schema of a table of view from which extract data. By default, it is ‘dbo’.
TableViewName: The database table or view name from which extract data.
Query: The T-SQL query with which extract data.
User: The username to use to connect to database.
Password: The password of the username to connect to database.
ConnectionTimeout: The connection timeout in seconds. By default it is 30 seconds.
DatabaseCulture: The database culture code (es. it-IT). It’s used to extract the decimal separator properly. By default, it is “en-US”.
BatchSize: The size (number of rows) of batches that are written to the output file until data to extract is over.
OutputFileFullPath: Full path (including filename and csv extension) of the output file.
SeparatorChar: Character used to build string separators shown in console.

The Export-SqlBulkCsvByPeriod function provides three more mandatory parameters to be able to partition the result set according to a time period:

DateColumnName: Date/time type column by which data will be broken down by the time period.
StartPeriod: Time period string (allowed formats: “yyyy”, “yyyy-MM”, “yyyy-MM-dd”) representing the period from which to start extracting data (period in question included).
EndPeriod: Time period string (allowed formats: “yyyy”, “yyyy-MM”, “yyyy-MM-dd”) representing the period up to which to extract data (period in question included).

It’s evident that the formats used for the two input periods must be consistent with each other.

It’s important to note that extracting multiple CSV files broken down by a time period using the Export-SqlBulkCsvByPeriod function is only possible using a table/view, and not a query. If there are, for example, special needs for selecting fields and filters to be applied to a table, one must then first expose a view with these logics to then be able to extract multiple CSV files by time period.

Moreover, the Export-SqlBulkCsvByPeriod function involves the use of the string token {} (curly brackets open and closed) within the name of the output CSV file, which token will be replaced by the string associated with the time period of the transactions contained in the CSV file in question.

Both functions automatically recognize when to connect using Windows authentication or SQL Server authentication based on whether or not the User and Password parameters are passed.

Before proceeding with the examples, make sure you have installed the latest version of PowerShell.

Installing the latest PowerShell and SqlBulkExport versions

In order to install the latest version of PowerShell on Windows machines, download and run the 64-bit installer (in our case, version 7.3.0) from this link.

Click Next to all the Setup Wizard windows. Then click Finish. You’ll see the PowerShell 7 (x64) prompt installed into your applications:

Figure 12 — PowerShell 7 just installed (image by the author)

Run it and you’ll see the PowerShell prompt ready for your commands:

Figure 13 — PowerShell 7 prompt ready (image by the author)

You can enter the $PSVersionTable command and press Enter to check if all is working fine:

Figure 14 — PSVersionTable output (image by the author)

Great! If necessary, you can also install PowerShell on Linux or macOS.

Now you have to download the SqlBulkExport module files:

Go to the SqlBulkExport GitHub repository.
Click on the green Code button on the top right of the file grid and then click Download Zip.
Once your sql-bulk-export-main.zip file is saved on your machine, unzip it and copy its content into the C:\Temp folder (or you can choose your preferred folder). So, your module files will be persisted into the C:\Temp\sql-bulk-export-main folder.

Ok! Now you are ready to try few examples.

Export the content of our dummy table in one CSV file

Let us try extracting the contents of the extract_test table created at the beginning of this article to check its consistency with the RFC 4180 standard. In our case, the table in question is persisted in an Azure SQL database:

Open the PowerShell 7 prompt, enter the cd C:\Temp\sql-bulk-export-main command and press Enter to change your working directory to the module’s one.
Enter the Import-Module -Name ".\SqlBulkExport.psd1" command to import the SqlBulkExport module.
Enter the Export-SqlBulkCsv -ServerName "<your-server-name>" -DatabaseName "<your-database-name>" -User "<username>" -Password "<password>" -TableViewName "export_test" -BatchSize 30000 -OutputFileFullPath "C:\Temp\ExtractedTestPS.csv" to export the content of a database table (or view) into the output.csv file in batches of 30K rows. Here the output:

Figure 15 — Console output of the command that extracts the contents of the dummy table into a CSV file (image by the author)

Here the content of the output CSV file:

Figure 16 — Dummy table extracted in a CSV file using the SqlBulkExport module (image by the author)

As you can see, the output CSV file content meets the RFC 4180 standard. Because the dummy table used had few rows, only one batch was used for extraction. Let’s now try to extract the contents of a table having a few tens of thousands of rows.

Exporting the content of a table/view in one CSV file

As before, also the table we’re going to use to extract data from, is persisted in an Azure SQL database:

Open the PowerShell 7 prompt, enter the cd C:\Temp\sql-bulk-export-main command and press Enter to change your working directory to the module’s one.
Enter the Import-Module -Name ".\SqlBulkExport.psd1" command to import the SqlBulkExport module.
Enter the Export-SqlBulkCsv -ServerName "<your-server-name>" -DatabaseName "<your-database-name>" -User "<username>" -Password "<password>" -TableViewName "<your-table-or-view-name>" -BatchSize 30000 -OutputFileFullPath "C:\Temp\output.csv" command to export the content of a database table (or view) into the output.csv file in batches of 30K rows. Here the output:

Figure 17 — Console output of the command that extracts the contents of a tab/view into a CSV file (image by the author)

As you can see, it took 3 batches of 30K rows to extract the contents of a table of about 74K rows, taking a total of 1 second and 88 milliseconds. Not bad!

Let’s try using a query to export the data.

Exporting the output of a query in one CSV file

In this case we will extract data from the same table as in the previous case, but using a query like SELECT * FROM <table> WHERE <condition>.

Open the PowerShell 7 prompt, enter the cd C:\Temp\sql-bulk-export-main command and press Enter to change your working directory to the module’s one.
Enter the Import-Module -Name ".\SqlBulkExport.psd1" command to import the SqlBulkExport module.
Enter the Export-SqlBulkCsv -ServerName "<your-server-name>" -DatabaseName "<your-database-name>" -User "<username>" -Password "<password>" -Query "SELECT * FROM <your-table-or-view-name> WHERE <condition>" -BatchSize 30000 -OutputFileFullPath "C:\Temp\output.csv" command to export the content of a query result set into the output.csv file in batches of 30K rows. Here the output:

Figure 18 — Console output of the command that extracts the output of a query into a CSV file (image by the author)

Everything works like a charm! Let us now try exporting the contents of one view to multiple monthly CSV files.

Exporting the content of a table/view in multiple monthly CSV files

Imagine you have a transaction table containing hundreds of thousands of rows per month. There is a group of Data Scientists from outside the company who are assigned to do advanced analysis on the transaction history. For convenience, they ask you to extract a dataset consisting of subsets of the fields available in the table for a couple of months of transactions. Instead of generating a single CSV file, they ask you to provide them with multiple CSV files broken down by month.
Let’s see how to do this thanks to the Export-SqlBulkCsvByPeriod function:

Open the PowerShell 7 prompt, enter the cd C:\Temp\sql-bulk-export-main command and press Enter to change your working directory to the module’s one.
Enter the Import-Module -Name ".\SqlBulkExport.psd1" command to import the SqlBulkExport module.
Enter the Export-SqlBulkCsvByPeriod -ServerName "<your-server-name>" -DatabaseName "<your-database-name>" -User "<username>" -Password "<password>" -TableViewName "<your-table-name>" -DateColumnName "<your-date-column-name>" -StartPeriod "2022-01" -EndPeriod "2022-03" -BatchSize 30000 -OutputFileFullPath "C:\Temp\output_{}.csv" command to export the content of a database table (or view) into multiple monthly CSV files in batches of 100K rows, starting from January 2022 to March 2022. Here the output:

Figure 19 — Console output of the command that extracts multiple monthly CSV file (image by the author)

Amazing! You just extracted about 1.5mln rows broken down into three monthly CSV files in just 1 minute and 19 seconds!

The need that prompted me to write this article is to extract a large amount of data (3–4+ GB) into one or more files in CSV format compliant with the RFC 4180 standard.

You have seen how the tools provided by Microsoft (whether they are IDEs, such as SSMS and ADS; whether they are command-line tools, such as BCP) are unable to meet the above need. The only tool that seemed a bit more suitable is ADS, but it cannot extract large amounts of data without crashing. Without mincing words, it is quite embarrassing that to date Microsoft has not yet made a tool available to users that would meet the requirements set forth above.

Not having found software on the Internet that met the above needs, I wrote the SqlBulkExport PowerShell module that solves the problem and made it available opensource on GitHub under an MIT license. I emphasize that I am not a PowerShell developer, so any input from you that would improve the solution is really welcome!