Graph machine learn (GraphML) is a hot topic in machine learning. Data often fits better in a graph from rather than as tabular data. Graphs allow you to express the relationships between data (this person knows that person, this person read this book).
At the The Trade Desk I joined the AI Lab which focusing on exploring cutting edge AI and ML techniques to address business needs. However, I didn’t join as an AI expert, I joined as a software engineering whose focus on building the tooling to allow researchers to do their research with internet scale data.
When we started working with GraphML we found the the tooling was lacking. Everything from data ingestion to graph building and management and storage right through to graph algorithms break down when you deal with graphs when you get to billions of nodes and 10s of billions of edges. For example there are over 200M active internet domains. Imagine graphing the traffic between them.
This talk explores our experiences of working with graphs at this scale. We cover multiple aspects such as:
- Define graphs (nodes, edges, attributes)
- Transform and inject raw data into graph from
- Execute graph ML algorithms (e.g. FastRP) at scale
- make it scalable so it can be run for production workloads, not just one off experiments.
One of the goals is to provide the tools not just to build one graph, but to allow researcher to define and experiment with different graphs and different algorithms as easily as possible. This is not a ‘build a tool to generate this’ it is more ‘build a toolkit to allow people to build things like this’.
Every aspect has provided interesting challenges, from loading data into a system taking orders of magnitude to long, to overloading databases and scaling data processing to hundreds of nodes. We will discuss the power and limitations of tools such as Graph database and Spark as well as our solutions to the challenges.
Whilst covering a lot of ground this talk is accessible to all and will cover the engineering challenges.