Introduction
Advance Analytics is becoming the norm now. Every organization wants to use the power of advanced analytics to predict future outcomes and take proactive corrective measures to remain competitive and win over its competitors. Therefore, it’s now becoming necessary for every organization to exploit the data that they have (or in some cases external data as well) and predict future outcomes to remain competitive by offering their customers what they want, when they want, and where they want. Looking at this potential, SQL Server 2016 brings native support for doing advanced analytics in the database itself using R Services, without moving data across.
In my last article of the series, I talked about what R is, what on-premise advance analytics options from Microsoft are and how to get started using R Services with SQL Server 2016. In this article, we will look into a detailed conceptual execution flow, executing R scripts from SQL Server Management Studio and other R client tools.
How the Execution Flows Conceptually
There are multiple ways that data scientists can write and execute R scripts for developing and building predictive models. Based on preference, they can use any R client tools. When a data scientist uses an R client tool on their local machine, he/she has options to either bring the data to the client workstation for local analysis or use the RevoScaleR APIs to push computations to the SQL Server computer, avoiding costly and insecure data movement across network. RevoScaleR APIs have been optimized to analyze data sets that are too big to fit in memory and to perform computations distributed over several cores or processors (unlike local workstations, this is often the case with server machines, where you have huge memory and multi-core processors). It also supports working with subsets of data for greater scalability. That means most RevoScaleR APIs can operate on smaller chunks of data, and use updating algorithms technic to aggregate intermediate results into the final result. This means your predictive solutions, based on the RevoScaleR APIs, will work with very large datasets and are not bound by the local memory of the workstation.
When SQL Server receives a request to execute R-script, it treats it as an external script and with the support of SQL Server Trusted Launchpad, SQL Server 2016 executes external scripts by external components (in the case of R-script, its R Interpreter). When SQL Server receives an R-scrip for execution, it launches the Advanced Analytics Extensions Launchpad process that enables integration with Microsoft R Open using standard T-SQL statements. Then the R-script is handed over to R Interpreter for execution and the result is returned to the client.
This allows data scientists to explore data and build predictive models from their workstation, keep doing iterations for testing and tuning until a good predictive model is achieved, and SQL Server manages execution of the R-scripts, resources needed for execution, and security.
SQL Server manages execution of the R-scripts, resources needed for execution, and security
After the R-script and model are finalized and ready (once ascertained that it is performing well as per expectation) they can be deployed into production (trained models are stored in binary form, in a single column of type varbinary(max) of the SQL Server table).
SQL Server lets you invoke save trained model using a T-SQL system stored procedure, which you can integrate with existing or new applications. In other words, a database developer or administrator can embed the R-script or model in stored procedures, and invoke the saved trained model from an application, to make predictions on new observations based on new input data passed to the stored procedure.
Embed the R-script or model in stored procedures
Executing R Script on SQL Server from SQL Server Management Studio (SSMS)
SQL Server 2016 brings a new system stored procedure (sp_execute_external_script) to execute scripts written in a supported external language (at this time R is the only supported language) by external runtime from the database engine itself. These are some of the important parameters for this stored procedure; you can get an extensive list here. Before you can start using this stored procedure, you must first enable external scripts execution feature as discussed in my last article.
Parameters |
Description |
@language |
With this parameter, you specify the language for your external script to be executed. For now, supported language is R only. |
@script |
With this parameter, you specify external scripts to be executed externally by SQL Server in supported language by external runtime. |
@input_data_1 |
With this parameter, you specify T-SQL query to read data from SQL Server and pass it as an input to the external script. |
@input_data_1_name |
With this parameter, you specify the name for the input data to the R-script by way of the data-frame variable in R-script, which will hold the result of the input query (specified with @input_data_1). This is an optional parameter; the default value is “InputDataSet”. |
@output_data_1_name |
With this parameter, you specify the name for the output data by referring the data-frame generated inside the R script to be returned to SQL Server upon completion of the stored procedure call. This is an optional parameter; the default value is “OutputDataSet |
@params |
With this parameter, you specify a list of input parameter declarations that will be used in the external script. |
WITH RESULT SETS |
Again, this is an optional clause and you use it when R-script returns a result set to specify the name and the data type of the data frame columns being returned by R Script. This becomes mandatory if you use @output_data_1_name clause to return a data-frame from R script unless you use the INSERT …EXEC statement to write output to a table. |
You might have noticed, there is only one parameter to pass just one dataset from SQL Server to the R script. If you have a scenario where you need more than one dataset from SQL Server in R-script, you can pass one dataset with the above mentioned parameter and for the others you can query using RODBC inside the R-scripts.
The example below passes a dataset or result-set from SQL Server to R-scripts, then inside the R-script execution, summary statistics are being calculated. Next, to return it back to the SQL Server, the output is converted to a data frame. Finally, the result is shown using the WITH RESULT SETS clause.
EXEC sp_execute_external_script @language = N'R' ,@script = N'SalesSummary <-summary(InputDataSet); SalesSummary_df <- data.frame(SalesSummary);' ,@input_data_1 = N'SELECT SUM(FIS.SalesAmount) AS TotalSalesAmount FROM [dbo].[FactInternetSales] FIS INNER JOIN [dbo].[DimGeography] DG ON FIS.SalesTerritoryKey = DG.SalesTerritoryKey WHERE DG.CountryRegionCode = ''US'' GROUP BY DG.StateProvinceCode, DG.StateProvinceName;' , @output_data_1_name = N'SalesSummary_df' WITH RESULT SETS ( ( Col1 varchar(100) null, Col2 varchar(100) null, Col3 varchar(100) null ) );
WITH RESULT SETS results
In the earlier example, we used the @input_data_1 parameter to specify a query to pass a dataset from SQL Server to R-script. By default, this dataset is referred as InputDataSet although you can change the name of this dataset with the @input_data_1_name parameter as shown below. In that case, whatever you specify with the @input_data_1_name parameter, you need to use the same value to refer to the dataset or data-frame inside R-script as shown below. Also, this script shows how you can refer to a variable\column from a data-frame to calculate quantiles using the quantile function in R instead of using all the variables\columns from the data-frame as in the last example for summary statistics.
EXEC sp_execute_external_script @language = N'R' ,@script = N'SalesQuantile <- quantile(inputdata$TotalSalesAmount); SalesQuantile_df <- data.frame(SalesQuantile);' ,@input_data_1_name = N'inputdata' ,@input_data_1 = N'SELECT DG.StateProvinceCode, DG.StateProvinceName, SUM(FIS.SalesAmount) AS TotalSalesAmount FROM [dbo].[FactInternetSales] FIS INNER JOIN [dbo].[DimGeography] DG ON FIS.SalesTerritoryKey = DG.SalesTerritoryKey WHERE DG.CountryRegionCode = ''US'' GROUP BY DG.StateProvinceCode, DG.StateProvinceName;' , @output_data_1_name = N'SalesQuantile_df' WITH RESULT SETS ( ( Quantiles float null ) );
Quantile function results
Earlier examples show how you can pass a dataset back and forth between SQL Server and R runtime during script execution. But there might be few scenarios where you would not be passing any dataset from SQL Server but getting a result-set back from R-script. The below example shows how you can pass back a sample iris dataset from R to SQL Server. In this way you can pass any values back to SQL Server.
EXEC sp_execute_external_script @language = N'R' , @script = N'iris_data <- iris;' , @output_data_1_name = N'iris_data' WITH RESULT SETS (( "Sepal.Length" float not null, "Sepal.Width" float not null, "Petal.Length" float not null, "Petal.Width" float not null, "Species" varchar(100) ));
Passing a sample iris dataset from R to SQL Server
Please note, its recommended to use R client tools during development as they provide more developer productivity features like IntelliSense, debugging, proper runtime execution message reporting, etc. Once you have finalized your model, you can save the finalized trained model in SQL Server and then use a T-SQL stored procedure for scoring on the new data or observations.
Executing R Script on SQL Server with R Tools for Visual Studio
You need to install Microsoft R Client on your machine in order to start using the RevoScaleR package on a local workstation. See “Get Started with Microsoft R Client” for instructions on installing and configuring the RevoScaleR package in Visual Studio. “Welcome to R Tools for Visual Studio Preview!” for documentation and FAQs on R Tools for Visual Studio.
There are a few additional steps required if you need to access SQL Server data, or run R commands from a remote data science workstation. “Set up SQL Server R Services (In-Database)” has details on those steps.
Whether you have experience using Visual Studio or not, you will like the R Tools for Visual Studio extension (free) that brings streamlined Visual Studio IDE experience, tailored for the needs of data scientists or R users. Some of the great features available with R Tools for Visual Studio are below and for complete details you can refer “R Tools for Visual Studio”.
- IntelliSense – IntelliSense is a handy feature that provides instant context-aware help when writing code in Visual Studio. IntelliSense is available in both the R-source code Editor and the R Interactive Window.
- Debugging – Install Microsoft R Open or Microsoft R Server to enjoy enhanced multi-threaded math libs, cluster scale computing, and a high performance CRAN repo with checkpoint capabilities. In other words, you can specify breakpoints in your code and step through it during execution.
- Variable Explorer – The Variable Explorer keeps track of all the objects in your environment and supports drilling down arbitrarily deep. You can click on the magnifying glass to get a tabular view of your data frames.
- Plots – It allows you to use the powerful R libraries such as ggplot2 and ggviz to make beautiful graphs to understand your data and communicate your insights. You can also save visuals in different formats based on your need and preference.
- Versioning – It allows you to create a project based on R programming language for better code management and to maintain versions in either git repository or on Team Foundation Server.
Once you have installed R Tools for Visual Studio, you will be able to see the R Tools menu in the bar as shown below:
R Tools menu
Before you begin, you first need to create a project based on R, which allows you to organize your code along with other features as mentioned above.
Create a project based on R
While working on your model, you can either use CRAN APIs or use RevoScaleR APIs. As discussed earlier, RevoScaleR APIs push computations to the SQL Server computer, avoiding costly and insecure data movement across the network and moreover, they are also optimized to analyze data sets that are too big to fit in local memory and to perform computations distributed over several cores or processors and hence it is recommended to use RevoScaleR APIs. You can find the summary of all those RevoScaleR APIs in “Summary of rx Functions” and a comparison of CRAN APIs and RevoScaleR APIs in “Comparison of Base R and ScaleR Functions”.
LearnRinVS – Microsoft Visual Studio
Executing R Script on SQL Server from R Studio
You need to install Microsoft R Client on your machine in order to start using the RevoScaleR package from R Studio on the local workstation. See “Install Microsoft R Client” for instructions on installing and configuring RevoScaleR package in R Studio.
As a data scientist, it doesn’t matter which R client tool you started with, you can switch from one tool to other by just moving the code as is; no changes are required. For example, in the earlier section on using R Tools for Visual Studio, we created a histogram based on data from SQL Server, the same script can be taken to R Studio and can be executed as is, as shown below.
In the script below, first I set the library path to point to the folder where RevoScaleR package is available and then I load that package with the library function.
# Set the library path to point to the folder where RevoSCaleR package is available # You can check all the current paths defined with .libPaths() command .libPaths() .libPaths(c(.libPaths(), "C:\\Program Files\\Microsoft SQL Server\\MSSQL13.MSSQLSERVER\\R_SERVICES\\library")) # Load RevoScaleR package library(RevoScaleR) # Define the connection string # connStr <- "Driver=SQL Server;Server={YourServerName};Database=AdventureWorksDW2014;Uid=<user_name>;Pwd=<user password>" ## SQL Authentication connStr <- "Driver=SQL Server;Server={YourServerName};Database=AdventureWorksDW2014;Trusted_Connection=Yes;" ## Windows Authentication # Set ComputeContext sqlShareDir <- paste("D:\\MyShare\\", Sys.getenv("USERNAME"), sep = "") sqlWait <- TRUE sqlConsoleOutput <- FALSE cc <- RxInSqlServer(connectionString = connStr, shareDir = sqlShareDir, wait = sqlWait, consoleOutput = sqlConsoleOutput) rxSetComputeContext(cc) sampleDataQuery <- "SELECT DG.CountryRegionCode, DG.EnglishCountryRegionName, SUM(FIS.SalesAmount) AS TotalSalesAmount FROM [dbo].[FactInternetSales] FIS INNER JOIN [dbo].[DimGeography] DG ON FIS.SalesTerritoryKey = DG.SalesTerritoryKey GROUP BY DG.CountryRegionCode, DG.EnglishCountryRegionName" inDataSource <- RxSqlServerData(sqlQuery = sampleDataQuery, connectionString = connStr, colClasses = c(StateProvinceCode = "string", StateProvinceName = "string", TotalSalesAmount = "numeric"), rowsPerRead = 500) rxHistogram( ~ TotalSalesAmount, data = inDataSource, title = "Sales Amount Histogram")
As you can see in the script above and the image below, I am using RxInSqlServer API to Generates a SQL Server compute context using SQL Server R Services and then using the rxSetComputeContext API to change the execution context from local to SQL Server. You can use rxGetComputeContext API to check currently in-effect execution context.
Executing R Script on SQL Server from R Studio
Next I have used RxSqlServerData API to generate a SQL Server data source object with the query I have specified and used the colClasses API parameter to map data types between SQL Server and R (SQL Server and R do not support the same data types and hence it’s obvious there are type conversions taking place when sending data from SQL Server to R and vice versa). See “Working with R Data Types” to learn more about how to map SQL Server data types to R data types.
Finally, I call rxHistogram API and pass a formula and data source to generate the histogram for that.
Conclusion
SQL Server 2016 brings native support to doing advance analytics in the database itself using R Services. In this article, we looked into the details of conceptual execution flow, executing R scripts from SQL Server Management Studio and other R client tools.
Resources
Getting Started with R Services in SQL Server 2016 – Part 1
New Components in SQL Server to Support R Services
Data Science End-to-End Walkthrough
Data Science Deep Dive: Using the RevoScaleR Packages