Salesforce Data Cloud Data Ingestion

Salesforce Data Cloud Data Ingestion - David Palencia - image

Data Cloud Topology

In a related article, I talked about the initial Data Cloud Setup and Provisioning Models that you can configure for Data Cloud.

Once you’ve decided this, the next step is to understand the high level data ingestion process of Data Cloud.

Data Ingestion Process Overview

In a nutshell:

  1. Data Cloud can be connected to a wide variety of data sources.
    These sources are called Data Streams and once ingested, they create Data Source Objects.
  2. The Data Source Objects go through transformation, validation and normalization settings in the Data Cloud Data Lake.
  3. Once normalised, the Data Source Objects automatically create a Data Lake Object and these type of objects can be mapped to the SF Data Model.
  4. Data Spaces act as partitions, so you can decide how to organise the ingested data, divide it by region/brand/purpose, etc.
  5. Once normalised data is into a Data Space, it can be mapped to the Salesforce Data Model to create a Data Model Object. This is the final data version that is leveraged for insights, data models, segments, etc.

Salesforce Data Cloud Objects

There are 3 levels of objects that are created and leveraged in Salesforce Data Cloud Ingestion. Here is a breakdown of each type:

NameTermStageFormatPurposeLocation
Data Source
Object
DSOStage 1:
raw, when data source is first ingested and a Data Stream is configured
- Multi Format (e.g: CSV, Json, etc).
- Original schema is preserved
- Multi sourced: Mulesoft, Cloud Storage, etc.
Original data source and files/transient data of the organization.

Needed so a Data Lake copy can be created for later mapping.
Not physically stored:

A staging or temporary view of the original, raw data, before it's created physically in the Data Lake as a DLO.
Data Lake
Object
DLOStage 2:
automatically created as a locally stored copy of the DSO, but including transformations, validation, normalisation, etc.
- Schema is enforced.
- Categorised into DC types of objects: Profile, Engagement, Other
- Parquet Formatted Iceberg tables
Normalised, validated and transformed version of the DSO so it can be mapped to the SF Data Model.

Objects cannot be used unless they're mapped to the Data Model.
Physically stored in the Data Lake assigned Data Space:

This is the core "copy" of the data ingested. Data only exists at the DLO level in Data Cloud.
Data Model
Object
DMOStage 3:
DLOs are mapped to a SF 360 Data Model object and this creates a harmonized grouping of data that is mapped to the canonical DM of Salesforce.
- Harmonized and grouped view of the DLO, mapped to a data model object.
- Type Category is inherited from the DLO mapped.
- Unified Individuals and Insights are DMOs.
Mapped version of the DLO so it can be leveraged in Data Cloud features:

insights, segments, identity resolution, etc.
Not physically stored:

A DMO is just a "view" of the DLO in the Data Lake, not a copy of the DLO itself. It is not materialised.

Data Sources

There are many data sources you can ingest into Data Cloud. They can be, for instance:

  • A Salesforce Marketing Cloud Account
  • Your Commerce Cloud Instance
  • Salesforce Service Cloud or Sales Cloud Orgs
  • Events from a mobile app
  • META Ads
  • An SFPT Connector
  • Azure Storage
  • API batching
  • A specific Data Set within a Salesforce Marketing Cloud Personalization Account

OOTB Connectors

To make the data ingestion process as easy as possible, Data Cloud comes with out-of-the-box connectors for many Data Sources.

The connectors can be categorised into 3 main types:

  1. Salesforce Connectors
    Usual native Salesforce apps such as Salesforce Marketing Cloud, Service Cloud, etc.
  2. Connector Service
    These connectors leverage SDKs and Ingestion API to expand the sources you can connect (e.g: Mulesoft Anypoint Connector).
  3. Third-Party Connectors
    These integrate Third-Party sources for data to flow into or out of Data Cloud.

Data Bundles

Data Bundles are data sets that import pre-defined objects to be mapped into the SF Data Model easily.

For example, there are Data Bundles for Salesforce Marketing Cloud that include engagement data objects (e.g: sends, clicks, opens, etc).

This way, you don’t have to manually decide which objects and datasets to ingest into Data Cloud and how to best normalise them to map them later on.

Salesforce CRM Data Source Options

One of the core Data Sources you will want to ingest into Data Cloud is a Salesforce Org (e.g: Sales Cloud, Service Cloud, Commerce Cloud, Loyalty, etc).

For these, there are different ingestion options and best practices to keep in mind:

  • 1:1 CRM to DC Org Connection
    You may connect 1 Data Cloud Org to –> 1 SF CRM OrgUse Case: the provisioning model is to host Data Cloud on a stand-alone Org and then this is connected to the CRM Org of the organization, with a single line of business and data.
  • N:1 CRM to DC Org Connection
    You can connect N SF CRM Orgs to –> 1 Data Cloud OrgUse Case: aggregated data from multiple business lines from different CRM orgs or brands is consolidated into one Data Cloud Org.
  • 1:N CRM to DC Org Connection
    You may also connect 1 single SF CRM Org to –> multiple Data Cloud Orgs.Use Case: the data from one single Salesforce CRM Org needs to be connected but organised by specific governance criteria (brand, region).

Salesforce Marketing Cloud Data Source Options

The second most common core data source you’ll want to leverage when using Data Cloud is, of course, Salesforce Marketing Cloud.

For marketing cloud, there are different ingestion options and an important drawback to mention:

  • 1:1 SFMC to DC Org Connection
    You can connect 1 SFMC Account to –> 1 Data Cloud Org.Use Case: Data Cloud data is connected to the Enterprise parent level of a SFMC instance so data can be activated and used in all the children Business Units.
  • 1:N SFMC to DC Org Connection
    You can also connect 1 SFMC Account to –> Multiple SF CRM OrgsUse Case: Data from different Data Cloud Orgs can be activated in a single SFMC instance and/or in different Business Units, at the regional level, for example.
  • N:1 SFMC to DC Org Connection (Currently not available)
    This is the current drawback of this connection type.At the moment, Data Cloud does not allow N:1 connections where a single Data Cloud Org is connected to –> more than 1 SFMC instances.

MC Personalisation Data Source Options

Finally, another of the most typical data sources you’ll want to connect is MC Personalisation. For this connector, here you have the diffeent options available:

  • 1:1 IS Account to Data Cloud Org
    You can connect 1 IS Account to –> 1 Data Cloud Org, and you can also decide if you want to connect only 1 Data Set or Multiple Data Sets from that IS Account.Use Case: only Data Set 1 from the IS Account is to be ingested and activated in the Data Cloud Org, for governance, brand or regional reasons.
  • N:1 IS Accounts to Data Cloud Org
    You may also connect multiple IS Accounts to –> 1 Data Cloud Org, and it’s possible to connect only specific Data Sets from each IS Account.Use Case: Data Set 02 from IS Account 1 and Data Set 01 from IS Account 2 are connected to the same Data Cloud Org 1.
  • 1:N IS Accounts to Multiple Data Cloud Orgs
    Out of the same IS Account, you can:
    connect one Data Set to –> Multiple Data Cloud OrgsBut you cannot:
    connect the same Data Set to –> Multiple Data Cloud Orgs

Data Spaces

Data Cloud Data Spaces are partitions available within Data Cloud so you can organise your ingested data however is best for your organization without having to purchase multiple Data Cloud Orgs.

Data Spaces can follow a logical pattern such as, for example:

  • By brand
  • By region
  • By purpose

Data Spaces require an additional add-on license and currently the limit is 50 Data Spaces and only 1 Data Space if it’s a Developer Org.

Also, keep in mind that:

  • Every Salesforce Data Cloud Org is created with a “default” Data Space. All your Data Lake Objects will be associated to this default Data Space until you create and configure new ones.
  • Identity Resolution, Data Models, Insights, Einstein Studio AI Models, Data Actions, etc are built within the context of a Data Space.
  • You may ingest the same Data Lake Object into multiple Data Spaces for different purposes.
  • Users can only manage and view data within their assigned Data Space.

Creating and Editing Data Spaces

Once you’ve purchased the additional add-on license to have more Data Spaces, follow these steps to create and edit Data Spaces:

  1. Go to Data Cloud Setup > Data Management > Data Spaces
  2. Click New and give a unique name to the Data Space
  3. Configure a Data Space prefix, starting with 1 letter and up to 3 alphanumeric characters.
    E.g: DS01
    Note: after you set up the prefix, you cannot change it afterwards. Prefix is used to distinguish between the objects mapped in different Data Spaces.
  4. For Editing an existing Data Space, just go to Data Cloud Setup > Data Management > Data Spaces and click on the data space you’d like to edit.

Adding Data to Data Spaces

Follow these steps to add data to a specific Data Space:

  1. Go to Data Cloud > tab “Data Spaces”
  2. Click on the Data Space you want to add data to
  3. Click Add Data and select the Data Lake Objects
  4. Before finishing, you can decide if you’d like to apply specific Filters to the DLO or just assign it without filters.Note: filters are typically in the format of Object > Column > Operator > Value and they can be used, for instance, to segregate data based on brand, region, etc.
    e.g:
    Case > Contact ID > EQUALS > Brand_2

Removing data from Data Spaces

You may decide to remove Data Lake Objects (DLOs) from a given Data Space. To do so, follow these steps:

  1. Data Cloud > tab “Data Spaces”
  2. Select your data space
  3. Search for the DLO and click on the dropdown menu
  4. Click on Delete

Privacy and Consent

Probably the most important aspect of all this data you want to ingest: how to protect and honour customer privacy, data rights and management of consent.

Data Subject Rights

Data Cloud allows you to honour the data subject rights of your customers, such as those defined by the famous European Union’s General Data Protection Regulation (GDPR).

There are 3 main Data Subject Rights that Data Cloud manages:

  1. Data Deletion or Right To Be Forgotten (RTBF)
  2. Data Access and Export (Portability)
  3. Restriction of Processing (RofP)

To manage all these data subject rights requests, you need to  always use the Data Cloud Consent API.

The Data Cloud Consent API reads and writes to the Profile of a customer in Data Cloud. It does so by accessing and updating org-wide consent data (e.g: links between records, values of consent flags, etc).

These are the actions supported by the Consent API, as documented by Salesforce:

ActionDescription
ShouldForgetThis is the Right To Be Forgotten: permanently deleting PII (Personally Identifiable Information) data and all related records.
ProcessingThis action restricts the use of Data Cloud processes (e.g: insights, query, segmentations) to process the customer data.
PortabilityThis is used to allow the customer to have thier Data Cloud data exported.

Data Subject Rights Request

To process this Data Subject Rights request, you need to use the Individual ID as the parameter that identifies the record in the Consent API.

If your Unified Individual for a customer named “David” is made of 3 source versions of David:
Name: David 1
Individual ID: 001

Name: David 2
Individual ID: 002

Name: David 3
Individual ID: 003

Then, you will have to process a Consent API request with each Individual ID: 001, then 002 and finally, 003.

Data Deletion Request

To process a Right To Be Forgotten Data Subject Rights request, you will have to use the Consent API.

This request deletes the individual record of the customer from the Individual DMO and all the related DMOs. This means that all the relevant DMOs must be related to the Individual DMO.

Data Deletion Requests are processed again after 30, 60 and 90 days to make sure that a full deletion is executed.

Restriction of Processing Request

To process this Data Subjects Right request, you need to use the Consent API.

This request can be processed for Individual and Unified Individual profiles. The request restricts all the data processing of the Individual and Unified Individual profile within 24 hours.

Data Portability Request

To process this Data Subjects Right request, you have to do so via the Consent API.

This request can be processed for Individual profiles. The request triggers an export of all the customer data stored in Salesforce Data Cloud for that Individual.

The export is a CSV file sent to the pre-defined AWS S3 bucket. The process may take up to 15 days.

Data Streams

Data Streams are the initial step for data ingestion in Data Cloud. In order to connect any data source, you need to go to Data Cloud and click on Data Streams to create your first one.

Below is an overview of some of the most important Data Streams currently available in Salesforce Data Cloud ingestion:

Salesforce CRM Data Stream

  • Ingest data from CRM across clouds (Service Cloud, Sales Cloud, Loyalty, Commerce Cloud, Omnichannel Inventory, etc).
  • Native, seamless integration thanks to SF proprietary APIs.
  • Pre-Made Data Bundles already mapped to the DC Data Model.

Salesforce Marketing Cloud Data Stream

  • Bring SFMC Data into Data Cloud.
  • You may ingest data from any Data Extension in SFMC, as well as all engagement data (e.g: clicks, opens), etc.
  • Native, seamless integration thanks to SF proprietary APIs.
  • Pre-Made Data Bundles already mapped to the DC Data Model.

Salesforce Commerce Cloud Data Stream

  • Bring Commerce Cloud Data into Data Cloud.
  • You may ingest Order, Related Customer and Catalog Data.
  • Native, seamless integration thanks to SF proprietary APIs.
  • Pre-Made Data Bundles already mapped to the DC Data Model.

Web SDK Data Stream

  • SDK/Tag to collect real-time web events and bring them into Data Cloud.
  • Unified SDK with Personalization enables you to both collect and action the data using the same tag.

Mobile SDK Data Stream

  • Mobile SDK to collect all mobile events, transactions, behaviours, etc and bring them into Data Cloud.
  • Unified SDK with Marketing Cloud enables you to both collect and action the data in push notifications, Journey Builder, in-app messages, etc.

Ingestion API Data Stream

  • Both Bulk and Streaming APIs available.
  • Use this data stream to send data from any application to Data Cloud.

Ingestion Personalization Data Stream

  • Native Integration to bring data from MC Personalization into Data Cloud.
  • You may ingest anonymous and known data.
  • Multiple options available with Data Sets: 1:1, N:1, 1:N.
  • Allows ingestion of all types of events in MC Personalization.

Ingestion Amazon S3 Data Stream

  • Allows you to ingest any S3 bucket data of any system into Data Cloud.
  • Automatic detection of data types, delimiters, date times, etc.
  • You may transform ingested data.
  • Custom schedule stream of data (5-10-30 minutes, hourly, weekly, daily, monthly)
  • Wildcard strings available for date time and file name matching.
  • Compression available (ZIP, GZ).
  • High-water mark for updates to only new records.
  • Seamless User Interface and quick integration.

Ingestion Google Cloud Storage Data Stream

  • Allows you to ingest data from any system via Google Cloud Storage.
  • Automatic detection of data types, delimiters, date times, etc.
  • You may transform ingested data.
  • Custom schedule stream of data (hourly, weekly, monthly)
  • Wildcard strings available for date time and file name matching.
  • High-water mark for updates to only new records.
  • Seamless User Interface and quick integration.

Ingestion Mulesoft Data Stream

  • Native Integration to bring data from Mulesoft into Data Cloud.
  • Bulk and Streaming APIs for ingestion.
  • Mulesoft expands connectivity with +250 OOTB connectors.
  • Easily build custom integrations.
  • API First Strategy and reduced time to value for use cases.

BYOL Data Federation

BYOL stands for Bring Your Own Lake.

The BYOL Data Federation is a capability of Salesforce Data Cloud that allows you to get direct access to your Partner Data.

Let’s see how this works:

General Data Storage in Data Cloud

Data Source > Data Stream > DSO > DLO

As we’ve explained above, when you want to ingest data from a Data Source, a Data Stream is configured in Data Cloud. This creates a temporary and non-materialised object called a Data Source Object, a DSO.

When this happens, a Data Lake Object is created off-core, on the Data Lake of Data Cloud. This happens automatically when you configure the DSO. The DLO is a reference to that physically stored data, real data.

That physical data is stored in the Data Cloud Data Lake, which is not in the Salesforce core, it’s a partner data lake.

BYOL Partner Data Storage in Data Cloud

Ok, so that’s the usual behaviour for most data ingestion.

However, you may have important, business data stored in a partner Data Lake, such as Snowflake, where there is already a “real, physical copy” of your company data. No need to create a data ingestion, DSO, etc and store it in Data Cloud (e.g: due to governance, security, etc).

What this BYOL means is that you can directly query the data stored in Snowflake, without having to create a copy in Data Cloud.

The access is near real-time and zero-copy.

Zero-copy means the data is never copied into Data Cloud, you just query it from Snowflake, and then you can map it, harmonize it, use it in insights, etc as if it was native to Data Cloud.

Here are the 4 BYOL Data Federation partners currently available:

  • Amazon Redshift
  • Databricks
  • Google BigQuery
  • Snowflake

Unstructured Data in Data Cloud

Unstructured data is data that does not have a consistent and specific format. It is therefore data that cannot be stored in a relational database.

Examples of unstructured data:

  • an audio file
  • chat transcripts of sales agents’ conversations with customers
  • a PDF file
  • Knowledge Articles

How to leverage this type of data, you might think?

Well, Salesforce Data Cloud allows you to ingest unstructured data so that you can lay out the foundation layer for your generative AI, analytics, etc.

The recommended steps for a Use Case using unstructured data are:

  1. External, unstructured data is connected to Data Cloud. There are 2 ways:
    external blob storage –> this creates an UDLO (Unstructured Data Lake Object)
    data stream from a DC connector –> this creates a structured Data Lake Object
  2. Configure a search index for the UDMOs or DMOs.
    This search index allows you to search and query that data.
  3. Then, you perform vector search queries from different apps such as Tableau, Prompt Builder, Einstein Copilot.

Data Ingestion Timings

Apart from what data to ingest, it’s important to decide how much data to ingest and how often this will happen.

Below is a table with the current Data Ingestion Timings, as shared by Salesforce documentation here:
https://help.salesforce.com/s/articleView?id=sf.c360_a_data_stream_schedule.htm&type=5

B2C Commerce

Data StreamData TypeRefresh ModeRefresh ScheduleLookback
BundleProductOtherFull RefreshDaily30 days
GoodsProductOtherFull RefreshDaily30 days
MasterProductOtherFull RefreshDaily30 days
ProductCatalogOtherFull RefreshDaily30 days
ProductCategoryOtherFull RefreshDaily30 days
ProductCategoryProductOtherFull RefreshDaily30 days
ProductOptionOtherFull RefreshDaily30 days
ProductOptionValueOtherFull RefreshDaily30 days
ProductProductOptionOtherFull RefreshDaily30 days
SalesOrderEngagementUpsertHourly30 days
SalesOrderCustomerProfileFull RefreshHourly30 days

Cloud File Storage – AWS S3, GCS, Azure

Data StreamData TypeRefresh ModeRefresh ScheduleLookback
AWS S3N/AUpsert or Full Refresh5 min +None
GCSN/AUpsert or Full Refresh5 min +None
AzureN/AUpsert or Full Refresh5 min +None

Salesforce Marketing Cloud

Data StreamData TypeRefresh ModeRefresh ScheduleLookback
StandardStandardUpsert or Full RefreshDaily90 days
Standard Data EngagementsStandard Data EngagementsUpsert or Full RefreshHourly90 days
Data Extension Full ExtractData Extension Full ExtractUpsert or Full RefreshDaily90 days
Data Extension Delta ExtractData Extension Delta ExtractUpsert or Full RefreshHourly90 days

Salesforce CRM

Data StreamRefresh ModeRefresh ScheduleLookbackDetails
SF CRMFull RefreshEvery other weekNo LimitOn top of the bi-weekly refresh, also when:
- A Data Stream is created
- Adding or removing a column
- >= 600k deletion records detected
SF CRMUpsert (incremental)Every 10 minsNo Limit- Starts after a full refresh
- If a field is deleted, it's processed in the next incremental refresh
- Refresh times may vary

Salesforce MC Personalization

Data StreamData TypeRefresh ModeRefresh ScheduleLookback
UsersProfileUpsert15 mins0 days
Catalog EventsEvents / EngagementInsert2 mins0 days
Cart & CartLineItem eventsEvents / EngagementInsert2 mins0 days
Order & OrderLineItem EventsEvents / EngagementInsert2 mins0 days
Custom engagement eventsEvents / EngagementInsert2 mins0 days

Ingestion API

Data StreamData TypeRefresh ModeRefresh ScheduleLookback
Bulk Ingest APIBatchUpsertDaily, Weekly, MonthlyN/A
Streaming Ingest APIStreamingUpsert15 minsN/A

Mobile and Web SDK

Data StreamData TypeRefresh ModeRefresh ScheduleLookback
Engagement SDKMobileUser Profiles - Hourly
Engagement - 15 mins
Salesforce Interactions SDKWebUser Profiles - Hourly
Engagement - 15 mins

Mulesoft (ingestion API)

Data StreamData TypeRefresh ModeRefresh ScheduleLookback
Ingestion API ConnectorExternal System15 minutes

Data Object Type Categories

When you set up a Data Stream in Data Cloud, you are required to select an Object Type Category for that Data Stream.

There are 3 Object Type Categories:

    1. PROFILE
      This data is geared towards segments, populations, etc. Any data you will want to segment by, use as the base for your segmentation, etc.
      Examples:
      Profile Attributes
      Party Identification
    2. ENGAGEMENT
      This data involves time oriented data. An Event Time field is required for this category.
      Examples:
      Order
      Case
    3. OTHER
      This type of data does not fit into any of the other two categories. It can be used for any of them, or for instance, if it’s Engagement, it does not have an inmutable date field.
      Examples:
      Price Book
      Catalog
      Lookups

Data Field Types

See below a detailed list of the currently available Data Field Types for Data Streams and data ingestion:

Data FieldDescription
TextIt stores a string of any type of text. It accepts both single-byte and multibyte characters, when supported by locale.

- Do not use colon : and single quote ' in field values.
- For text fields like file name, directory name, etc ensure ASCII is used. UTF-8 migth lead to errors.
- Strings that contain only quotes "" or contain no value are treated as "empty".
BooleanAccepted values are: "true", "false", blank

This type of field cannot be used as a Primary Key, Record Modified field, etc.
EmailThis field stores an email address value.

It works exactly like the Text field type in Data Cloud, so it accepts any text value that can be inserted into a Text. Data Cloud will not validate the format of this field type value.
NumberThis field stores number values with a fixed scale of 18 and a precision of 38.

Precision = total number of digits, irrespective of the decimal point location.
Scale = total number of digits to the right of the decimal point location.

Important: a NULL value is stored if a number value is non-numerical or does not fit into the field allowed range.
PercentThis field stores a percentage value. It works like the Number data type, accepting only numeric values.

- a Percent data type can be used as a Primary Key, but not as an engagement date field or internal organization field
- formula functions that accept the numeric value can also take percent values
- as it's modeled on the Number data type, values such as 25%, 19.2% are not accepted.
PhoneIt stores a phone number.

It works exactly like the Text field type in Data Cloud, so it accepts any text value that can be inserted into a Text. Data Cloud will not validate the format of this field type value.

- a Phone data type can be used as a Primary Key, but not as an engagement date field.
- formula functions that accept Text data type values can also take Phone data type values
URLThis field type stores URL values.

It works exactly like the Text field type in Data Cloud, so it accepts any text value that can be inserted into a Text.

- Data Cloud will not validate, parse or interpret the URL value.
- a URL data type can be used as a Primary Key, but not as an engagement date field
- formula functions that accept Text data type values can also take URL data type values
DateStores a calendar date value excluding the time and/or time zone parts.

If the data source includes a time or time zone part, the time part will be ignored when ingested.

- if a value cannot be parsed as a date, it will return a NULL value.
- Example of time part: 2024-11-21 10:30:00 UTC --> 2024-11-21
DatetimeStores a calendar date and time of day.

For the value to be valid, the date must include both the time and time zone parts. If these are not included, it returns 00:00:00 UTC as a default value.

- if a value cannot be parsed as a datetime, it will return a NULL value.
- time zones that are abbreviated such as CST are not valid. They are not supported by ISO8601 standard.

Formula Fields

During the Data Stream configuration process, you might want to add extra fields.

Formula Fields are custom fields you add to the Data Stream, either hard-coded or derived from the values of other fields.

Formulas available follow this statement pattern:
CONCAT(value1, value2, …)
ISEMPTY(value)
IF(condition,resultIfTrue,resultIfFalse)

How to create a Formula Field

  1. Click on New Data Stream
  2. On that page, click New Formula Field
  3. Provide the Field Label (this is the display name)
  4. Provide the Field API name (the programmatic reference)
  5. Provide the Field Return Type (e.g: boolean, text, phone, number, percent, etc)
  6. Use the Syntax Editor to write your formula
  7. Validate both the Syntax and the Return Values using the Formula Testing Panel

Source Fields

It might be the case that you configured a new Data Stream and later on, you need to add new source fields from the raw Data Source.

You can do so at any time. Follow these steps:

  1. Go to the tab “Data Streams” and select your Data Stream
  2. Click on Add Source Fields
  3. For GCS and AWS S3 data streams, you can decide if you want to AddFields Manually or Add Discovered Fields.
  4. If the fields are added Manually, enter the name of the file where the source schema is, and click Verify. Data Cloud will show you all the available source fields.
  5. For Discovered Fields, Data Cloud will show you a Field Last Detected column so you can assess if when a field was last discovered and if it’s relevant for you.
  6. Review the suggested Data Type for each of the fields you’re adding.
  7. Save.

Primary Keys and Data Lineage Fields

When working with Data Cloud, it’s crucial to work with the correct Primary Keys for each Data Source.

Deciding the PKs is part of the Data Discovery phase that should be carried out initially, before you even activate Data Cloud.

A Primary Key is the minimum field that can uniquely identify a row in an object.

E.g:
Name: John
Customer ID: 001

Name: John
Customer ID: 002

Were it not for the field Customer ID, which differs for each row, it would be impossible to know which John we’re talking about.

Primary Keys can also be Composite Keys, which means you identify unique rows using more than 1 column or field which are combined into a concatenated unique field.

E.g:
OrderID: 001
OrderLineItem: PFG

OrderID: 001
OrderLineItem: TDE

Composite Key: OrderID_OrderLineItem –> 001_PFG

If we’re looking at the Orders and different products (normally called “Line Items”) that John bought, the field OrderID would not be enough for us to uniquely identify an item.

We need the extra OrderLineItem field as Primary Key, so the Composite Key is a combination of the OrderID and the OrderLineItem fields, concatenated.

How to choose a Primary Key

Here are some best practices and things to avoid when choosing a Primary Key field:

  • Uniqueness:
    The PK must be a unique value.
  • Minimality:
    Try to use as few fields as possible. The more fields/values, the more confusing it can get.
  • Stability:
    If the PK changes over time, it may lead to issues. Choose a type of value that will be permanent.
  • Broad Scope:
    If you sell books and your PK is BookID, but you might also sell videogames in the future, BookID doesn’t seem like the best field for a Primary Key.
  • Data Available at time of entry:
    If, in an airline database, your Primary Key is a Composite Key of both FlightID and Departure_Date but the last field does not get populated when a client first registers, you’ll have issues.
  • Special Value Cases:
    Always ask about all the possible values that the field can get. E.g: are there flights without a FlightID value? Is it possible for a book not to have a BookID in our data set?
  • Not-Nullable Fields:
    The field or fields you choose to use as a PK must be not-nullable, so they always contain a value.
  • Composite Keys:
    If there is no single field that uniquely identifies a row, you may concatenate (with a formula) the values from two fields to create what’s called a Composite Key.
    E.g:
    CustomerID_EmaiAddress

Data Lineage Fields

When working in Data Cloud, you will have a lot of levels of source, connected objects, etc. Data Cloud comes with some fields that help identify the source or lineage of the data:

  1. Data Source
    This is a reference ID, also called external source ID, that identifies the name of the source.
    E.g: SFMC_6650912
    This means the source is SFMC and the Org ID is the number concatenated.
  2. Data Source Object
    This is a reference ID to identify which object the data comes from.
    E.g: SFMC_6650912_Lead
    This means the data is from SFMC, the Org ID is 6650912 and the source object is Lead.
  3. Internal Organization
    This is a reference ID to the Business Unit or any other organizational unit ID that the source has.
    In our example of SFMC, it would be the Business Unit MID.

Data Transforms

There are 2 main methods for Data Transformation in Salesforce Data Cloud:

  1. Batch Transformation
  2. Streaming Transformation

Batch Data Transforms

Batch Data Transforms allow you to set up an ongoing series of data operations to update and transform the data of a target Data Lake Object (DLO).

  • It works with a Visual Editor where you have nodes and drag and drop features to set up your transforms.
  • It allows you to perform a Full Refresh of the data, in bulk.
  • Once the first bulk run is over, it can either run Manually or On a Schedule.
  • It can have both a DLO or a DMO as source.
  • It accepts multiple source objects (e.g: joins).

To Create a Batch Data Transform:

  1. Go to the tab “Data Transforms” from Data Cloud
  2. Click New
  3. Select Batch Data Transform
  4. Use the visual editor to create your transform

Streaming Data Transforms

Unlike Batch Data Transforms, Streaming Data Transforms allow you to transform data in near real-time, as a continuous streaming process.

Streaming Data Transforms work like this:

  1. Read a record in source Data Lake Object 1.
  2. Reshape or transform the record data.
  3. Write one or more records to Data Lake Object 2.

Visually from left to right,the order of steps can be explained like:
Data Stream –> Source DLO –> Data Transform –> Target DLO

  • Streaming Data Transforms work with one record at a time.
  • They can only be used with DLOs, not DMOs.
  • They transform the data as it gets ingested.

To create a Streaming Data Transform, follow these steps:

  1. First, create a Target DLO: Data Lake Objects tab > New
  2. Enter Name and API Name.
  3. Select Category for the DLO.
  4. Once Active, go to Data Transforms tab > New
  5. Fill in required fields (e.g: label, target DLO, etc).
  6. In the Expression window, insert the SQL statement of your transformation.
  7. Click on Check Syntax to validate and save your data transform.
  8. The Streaming Data Transform starts immediately after being saved.

Summary – Data Cloud Ingestion

Data Cloud Data Ingestion is probably one of the most crucial aspects of using Data Cloud.

Here’s a recap of everything you need to account for when setting it up:

  1. Define Data Strategy
    Consider consent, data spaces, data sources, Primary Keys, connectors needed, etc.
  2. Select Data Sources
    Configure connectors or authenticate a new data source file, etc.
  3. Create Data Streams
    Choose a pre-defined bundle or select objects manually.
  4. Confirm Data Source Object Schema
    Data types, Primary keys, API names, etc.
  5. Apply Data Transforms
    Create Formula fields if needed, hardcoded or derived.
  6. Assign Data Space for the target DLO
    Assign a Data Space and decide if filters are needed based on data strategy.
  7. Configure Updates to Data Stream
    Refresh mode, schedule, etc.