Data Center Site Selection

Data Center Site Selection (DCSS) is a tool that combines network requirements with geospatial analysis to identify the best location for a new data center. DCSS takes into account factors such as network connectivity, power availability, and environmental risks to help organizations make informed decisions about where to locate their data centers. By using DCSS, organizations ensure that their data centers are located in the optimal location to meet their specific needs.

Let’s build it!

We can separate the method to build this solution into a few main categories:

Data Sourcing
Data Processing
Graph Creation
Model Formulation
Visualization

Each of these typically requires specialized knowledge and tools. However, with the help of open source libraries and cloud computing, we can build this solution with ease.

Data Sourcing

The first step in building our solution is to source the data, including data on network connectivity, power availability, and environmental risks. There are many sources for this type of data, including government agencies, private companies, and open data repositories. Some examples of data sources include:

OpenStreetMap for geospatial data
TeleGeography for network connectivity data

… and many more

But for the purpose of this project, we will use the following data sources:

The Azure US data center locations with ExpressRoute
The US Boundary Polygon from Github user @scdoshi

Data Processing

Now we want to use some web scraping approaches to get the data and transform it into GeoPandas.

This is possible using the requests package and then also combing some regex after inspecting the raw data. Also, GeoPandas can read directly from a URL!

We want to get the locations from the strings and use geocoding to get the coordinates. We can use the geopy package to do this.

def add_lat_lon(df):
    geolocator = Nominatim(user_agent="fused")
    df['location'] = df.apply(lambda x: geolocator.geocode(x.city), axis=1)

    df['lat'] = df['location'].apply(lambda x: x.latitude)
    df['lon'] = df['location'].apply(lambda x: x.longitude)

    return df

data = requests.get('https://raw.githubusercontent.com/MicrosoftDocs/azure-docs/88e1ce95875f10a8b634e2fc471d660963a12074/includes/expressroute-azure-regions-geopolitical-region.md')
locations = re.findall('North America.*', data.text)[0].split('|')[-2].split('<br/>')
locations = set([l.replace('2', '') for l in locations])

dc_df = pd.DataFrame(locations, columns=['city'])

dc_df = add_lat_lon(dc_df)

Then we convert everything into H3 hexagons.

def get_h3s():
    usa = gpd.read_file('https://raw.githubusercontent.com/scdoshi/us-geojson/master/geojson/nation/US.geojson')
    polygons = list(list(x.geoms) for x in usa.geometry)[0]

    polygons.sort(key=lambda x: x.area)
    coords = polygons[-1].exterior.coords[:]
    coords = list(c[:2][::-1] for c in coords)
    poly = h3.H3Poly(coords)

    return set(h3.h3shape_to_cells(poly, res=3))


dc_df['h3_3'] = dc_df.apply(lambda x: h3.latlng_to_cell(x.lat, x.lon, 3), axis=1)
us_h3s = get_h3s()

Now the data environment is set up, we can start to build the graph. There is also the added benefit that by using H3 as the interface we can ingest any future data sets and scale the granularity of the model too.

Graph Creation

The next step is to create a graph that represents the network connectivity between the data center locations. This graph will be used to model the network connectivity between the data centers and to identify the optimal location for a new data center. We can use the NetworkX library to create and manipulate the graph.

Also thinking about the kind of data structure we want when formulating the model. There are many choices but the graph library abstracts this. There are more efficient ways for large scale models like edge lists or adjacency matrices.

By taking the H3 indices we know which cells are next to one another. We separate out the demand locations from the data center locations and create a graph with the edges between them, along with data about costs and capacities

Next we create a toy graph that will let us go to the next step.

    # test_data
    d_graph = nx.DiGraph()
    d_graph.add_edge(0, 1, cost=100.)
    d_graph.add_edge(0, 2, cost=100.)
    d_graph.add_edge(3, 1, cost=100.)
    d_graph.add_edge(3, 2, cost=100.)
    d_graph.nodes[0]['cost'] = 10000.
    d_graph.nodes[0]['capacity'] = 1.
    d_graph.nodes[3]['cost'] = 10000.
    d_graph.nodes[3]['capacity'] = 1.

From this definition we can start to define the function that will create the graph from the data.

def create_assignment_graph(edges):
    d_graph = nx.DiGraph()

    data_centers = set()
    for (s, e), cost in edges.items():
        d_graph.add_edge(s, e, cost=cost)
        data_centers.add(s)

    for one in data_centers:
        d_graph.nodes[one]['cost'] = 20_000
        d_graph.nodes[one]['capacity'] = 1000

    return d_graph

Now the graph is created we can start to formulate the model.

Model Formulation

The model is a simple assignment problem. We want to assign the demand locations to the data center locations in a way that minimizes the total cost. The cost is a combination of the network connectivity cost and the cost of building a new data center. The model can be formulated as follows:

Here’s some math thats probably not fully correct but looks great thanks to MathJax and Hugo.

\[ \begin{array}{ll} \min & \sum_{i \in C} \sum_{j \in D} c_{ij} x_{ij} + \sum_{i \in C} c_i z_i \\ \text{s.t.} & \sum_{j \in C} x_{ij} = 1 \quad \forall j \in D \\ & \sum_{j \in D} x_{ij} - p_i z_i \leq 0 \quad \forall i \in C \\ & L \leq \sum_{i \in C} z_{i} \leq U \\ & x_{ij} \in \{0, 1\} \quad \forall i \in C, j \in D \\ & z_i \in \{0, 1\} \quad \forall i \in C \\ \end{array} \]\[ \begin{aligned} & where: & \text{ } \\ & D \text{ is the set of demand locations} \\ & C \text{ is the set of data center locations} \\ & U \text{ is the max number data center locations} \\ & L \text{ is the min number data center locations} \\ & c_{ij} \text{ is the cost of connecting demand location } j \text{ to data center location } i \\ & c_i \text{ is the cost of building a data center at location } i \\ & p_i \text{ is the capacity of data center at location } i \\ & x_{ij} \text{ is a binary variable that indicates whether demand location } j \text{ is assigned to data center location } i \\ & z_i \text{ is a binary variable that indicates whether a data center is used at location } i \end{aligned} \]

Transforming this into code is relatively simple with the help of the highspy library, which gets us free and open-source access to HiGHS, one of the latest and greatest mathematical optimization solvers.

HiGHS is based on the high performance dual revised simplex solver for LP developed by Qi Huangfu, the novel interior point solver for LP developed by Lukas Schork, the active set QP solver written by Michael Feldmeier, and the branch-and-cut MIP solver written by Leona Gottwald. The project is managed by Julian Hall, and Ivet Galabova continues to develop and maintain the underlying software engineering. ref

We can use highspy to define our mathematical optimization model as follows:

inf = highspy.kHighsInf
tasks = {j: f'{j}' for j in graph.nodes() if graph.out_degree[j] == 0}
print(f'{len(tasks)} demand locations')
cands = {i: f'{i}' for i in graph.nodes() if graph.in_degree[i] == 0}
print(f'{len(cands)} hub locations')
edges = {(s, e):f"{s}_{e}" for s, e in list(graph.edges())}
edge_assi = int_var_dict(h, edges, 'x', lb=0, ub=1)
c_nodes = int_var_dict(h, cands, 'z', variable_index_count=len(edge_assi), lb=0, ub=1)

# Creating task assignment contraints

for j in tasks:
    h.addRow(
        1, 1, len(graph.in_edges(j)),
        np.array([edge_assi[f'x_{i[0]}_{j}'] for i in graph.in_edges(j)]),
        np.array([1 for _ in graph.in_edges(j)])
    )
    for s, e, data in graph.in_edges(j, data=True):
        h.changeColCost(edge_assi[f'x_{s}_{e}'], data['cost'])

# Creating idicator to push cost to use minimal amount of hubs

for i in cands:
    h.addRow(
        -inf, 0, len(graph.out_edges(i)) + 1,
        [edge_assi[f'x_{i}_{j[1]}'] for j in graph.out_edges(i)] + [c_nodes[f'z_{i}']],
        [1 for _ in graph.out_edges(i)] + [-graph.nodes[i]['capacity']]
    )

h.addRow(
    1, 4, len(c_nodes),
    [c_nodes[f'z_{c}'] for c in cands], [1 for _ in c_nodes])

for c in cands:
    h.changeColIntegrality(c_nodes[f'z_{c}'], highspy.HighsVarType.kInteger)
    h.changeColCost(c_nodes[f'z_{c}'], graph.nodes[c]['cost'])

Now the model can be solved and we need to visualize the results. This is where the final step comes in.

Visualization

The final step in building this solution is to visualize the results.

Actually I visualized during the whole process of development, dynamically using Fused.

This let’s you see the data as you code and can accelerate the dev process. It also lets you share the code and data with others.

This end-to-end solution is available as a UDF!

This means you can dynamically run the code and see the results in the browser. Change a variable for cost, capacity or data source and it reevaluates in seconds!

This is then integrated with Mapbox and Deck to create the visualization you saw at the top of the page.

An of course this UDF becomes a API which you can integrate with your own data and systems.

Conclusion

In conclusion, the Data Center Site Selection (DCSS) is a powerful solution that combines network requirements with geospatial analysis to help organizations make informed decisions about where to locate their data centers. By using this tool, organizations can ensure that their data centers are located in the most optimal location to meet their specific needs. With the help of open source libraries and cloud computing, this solution can be built with relative ease.

The next step is to layer in business requirements and other data sources to make the model more accurate. This could include factors such as enerygy costs, carbon targets, network resilience, labor costs, tax incentives, and real estate prices. By incorporating these additional factors, the Data Center Site Selection tool can become an even more powerful tool for organizations looking to make data center location decisions.