Spark |

Apache Spark in Microsoft Fabric

If you have used Spark in Azure Synapse, prepare to be pleasantly surprised with the compute experience in Microsoft Fabric as Spark compute starts a lot faster because the underlying technology has changed. The Data Engineering and Data Science Fabric experiences include a managed Spark compute, which like previous Spark compute charges you when it is in use. The difference is the nodes are reserved for you, rather than allocated when you start the compute which results in compute starting in 30 seconds or less versus the 4 minutes of waiting it takes for Azure Synapse compute to start. If you have different capacity needs that a default managed Spark compute will not provide, you can always create a custom pool. Custom pools are created in a specific workspace, so you will need Administrator permissions on the workspace to create them. You can choose to make the new pool your default pool as well, so it will be what starts in the workspace.

Writing Spark Code in Fabric

If you are writing code in Spark, the two languages you will most likely be using are Python or Scala, but you could also chose Java, Scala, R, or ANSI SQL. Notice that unlike with Azure Synapse, .Net is not included as a language you can use, which is an interesting development. The other thing to keep in mind when writing SQL code in Spark is you will be writing ANSI SQL, not TSQL which you use in Lakehouses and SQL endpoints within Fabric. While TSQL is ANSI compliant, I realized the extent of the differences when trying to use some of the DATEPART TSQL commands as they have underscores in Spark and you use instr instead of TSQLs CHARINDEX. The differences are minor and stackoverflow or copilot can help you with the differences. Just remember that you may not be able use the exact same code as in the SQL endpoint and you will be fine.

Lakehouse Integration and Autogenerated Code

Like all Fabric experiences, the lakehouse is an integral part of the coding experience. When you create a new notebook, the first step is to add a lakehouse. Once it has been added, you can drag and drop elements inside of the notebook and fabric will write the code for you. Literally this code block shown below was created when I dragged over the table publicholidays into the notebook.

Autogenerated Spark dataframe using the clicky-draggy method

Generating code with Copilot in Spark

Fabric in Spark includes a library called chat-magics, and this library includes AI features which you can incorporate in your code if you have copilot enabled in your tenant. There are a few administrative steps you need to include to make that work. To enable copilot the first step is to see if it is supported in your Fabric tenant as it is not available everywhere. Check the list to make sure it is possible. Also you will need to pay for the feature as Copilot is not available as part of the free trial and you will need a Fabric F64 SKU or a P1 capacity to use it. Once you have validated you can use Copilot, you will want to go to the Administrative settings and enable Copilot in your tenant, as shown below.

Fabric Copilot Admin settings

Once Copilot is enabled and active, you will be able to enable it by clicking on the copilot icon on the far right of the screen. If you don’t see it, click on the ellipse, the three dot menu where Microsoft hides all the cool stuff and you will see the icon in a dropdown menu.

Chat-magics: Copilot spark help

Here are 6 Chat magic commands designed to help you with your code.

%%chat – Designed to provide answers for you regarding items in your code such as variables
%%describe – Provides a summary of the contents of a dataframe
%%code – Explain what code you want written and copilot will generate it for you
%%add_comments – Most people forget to comment their code, and if this is you you can have AI generate meaningful comments for you.
%%fix_errors – Using this command, copilot will try to fix dependance, configuration and resource allocation errors for you.

In my next post I will provide examples of how to use chat magic commands in Fabric.

Yours Always,

Ginger Grant

Data aficionado et Data Raconteur

Exploring Azure Synapse

In the past few months, I have been examining Azure Synapse and what it can do. When it was first released in November of 2019, the first functionality that was rolled out was an update of Azure SQL DW. For this reason, many people think that Synapse is just an improved version of a cloud data warehouse. Microsoft did improve SQL DW when it moved it to Synapse. The biggest architectural design change is the separation of the code from the compute, a theme with many web projects, which allows the compute power to be increased when need dictates and scaled down when computing needs change. Within Synapse, resources are allocated as Pools and you can define a sql pools to run data warehouse and later change the compute to a different resource. You will still need to partition your DW as large datasets require partitioning to perform well. Subsequently Microsoft Released the Azure Synapse Studio to be a container for a larger environment of tools and notebooks to interact with them.

Non-Data Warehouse Elements of Azure Synapse

To me the more interesting parts about Azure Synapse have nothing to do with data warehouses. Azure Synapse also contains the ability to query files stored in Azure Data Lake Gen 2 as if they were SQL files. This is a great way to analyze large data without first cleaning it up and putting it into a relational environment. Within Synapse you can formulate a query using syntax for selecting parts of files, providing the ability to look at many files as if they were one. You can also create processes which bring data into your synapse environment using Orchestration, a process that people who are familiar with Azure Data Factory will find very familiar. Synapse also contains the ability to analyze data in Cosmos DB without doing ETL or moving the data at all using a scalable architecture which will not impact the transactions being processed simultaneously on the same Cosmos DB.

Azure Synapse and Spark

Check out the promotional video here

By far the most interesting component of Azure Synapse is the Spark connection. Microsoft has added the ability to create Spark Pools into Azure Synapse. To be honest I was somewhat surprised that this functionality is included here first and not in Azure Machine Learning, where to use Spark you need to access clusters created them in Databricks. Spark provides the ability to dynamically scale resources when running processes. This is very handy when writing machine learning code which can really use the performance improvements Spark brings. Because this is Microsoft’s Spark, you can also write your code to access it in .Net if you like, in addition to the more common Spark Languages, Scala, R or Python. You can also incorporate the AutoML API created for Azure Machine learning in R and Python so that you can use the power of Azure to select your algorithm and hyperparameters instead of spending time doing it yourself.

Getting up to Speed with Synapse

There is a lot to learn when it comes to Synapse as it combines a lot of different components into one environment. As more and more data is being migrated to the cloud, it is uniquely designed to handle both big data components containing raw data, managed data lakes as well as more traditional data warehouse needs. It can also be the location where all of the data is processed, secured, cleaned and analyzed using Machine Learning. There is a lot to cover and since it is new, there is not a lot of places yet where you can learn more about it. If you are interested in a deep dive on Azure Synapse and how to use it in a Modern Data Warehouse, sign up for my precon at PASS Summit 2020 where I will cover the topic in depth.

Yours Always