Assessing credit card risk with RDFox rules
Every year, credit card fraud causes massive losses for banks, businesses and their customers, and prevention is a constant race between programmers and criminals. With the rise in popularity of online shopping, we have seen a steep increase in so-called “card-not-present” fraud, where credit card data (including security code) is stolen and used without the physical card ever leaving the owner’s wallet.
To counter this, a number of companies are implementing additional checks in their payment processes, like sending one-time passwords to shoppers’ phones, but the extra steps can be a deterrent for potential customers, which hurts the store’s bottom line. The other method of prevention is to identify and flag transactions that do not align with the card owner’s previous habits.
This article showcases a novel approach to preventing credit card fraud. RDFox and Data Lens have partnered up to demonstrate their unique capabilities which can be utilised to transform and analyse credit card data and prevent fraud.
RDFox is a knowledge graph and semantic reasoning engine. As a highly optimised in-memory solution, RDFox allows us to work with very large data sets without sacrificing speed. The flexible triplestore structure and incremental reasoning make it the perfect technology for this use case. Utilising its unique rules system we can ensure that our calculations take any new input into account.
2. Data Lens
We use Data Lens to make building knowledge graphs in RDFox much simpler and faster. The Data Lens platform can build knowledge graphs from any source database or data format. With no engineering, just configuration.
Suppose we are a credit card provider and collect a variety of data about our UK customers and their transactions. For each card we issue, we keep track of who the owner is and what other cards they have. When the customer makes a transaction, we check what country and city it took place in, what device was used and what vendor was involved. Soon we amass enough information to start assessing whether given behaviour is unusual for a given card-holder.
To assess whether given behaviour is unusual, we need to analyse the data. At present the data exists in two separate, disparate data sources. One is a CSV file, and another is a JSON file. We use Data Lens to read these sources of CSV and JSON data, transform the data into RDF, and then insert the data into RDFox, by following these steps:
Configure a Structured File Data Lens to transform the CSV data
Configure a Structured File Data Lens to transform the JSON data
Configure a Data Lens Writer to write the data to RDFox after conversion
Data Lens uses RML to configure the mapping between source data (in this case CSV or JSON) and target data (RDF) formats. RML (RDF mapping language) is a language for expressing customised mappings from heterogeneous data structures and serializations to the RDF data model.
Now that we have turned it into RDF, linked it, and inserted the data into RDFox using Data Lens, our data looks like this:
For this demonstration we have used a simple example. In real life RDFox can manage far more complex data sets which reflect the bigger picture.
Using reasoning, we can compute a risk score for each transaction. If the transaction is above a specified threshold, then it will be flagged for further investigation.
For the purposes of this article, we have chosen four factors that will contribute to a transaction’s overall score: device use history, amount transferred, vendor trustworthiness and location plausibility. To compute the value for each factor we use RDFox rules, which are more efficient than INSERT queries as when new triples are added to the store, rules are triggered automatically and evaluated incrementally (that means that we do not waste time re-deriving previously inferred triples).
First, to make things just a bit easier for ourselves, we add a direct link from a transaction to the person who made it, as well as to the previous transaction made by that person. This will make other rules simpler to write and quicker to materialise.
RDFox uses a powerful extension of the declarative rule language Datalog. An example rule that creates the link described above would look like this:
Each time a device is used we save important information, such as, whether this device has previously been used to make payments from a given account, whether the device has been used within the UK. For every transaction, we also calculate the average amount the person spent in one go during the previous month.
N.B. The dataset that we are using for this demonstration is limited to the UK, however the same strategy can be used elsewhere.
Then we determine the risk scores for all the factors using the following naive approach:
Device risk: If the device has never been used by a given person, we add 5. If the device has never been used in the UK, we add another 5
Vendor risk: For every transaction, we find what vendor it was made with and their trust rating (1,2,3,4 or 5). We then assign the transaction a vendor risk value equal to that rating (since in our data set a rating of 5 means that the merchant is untrustworthy)
Amount risk: We take either 50 or the average of the previous month’s transactions, whichever is higher. We divide the transaction amount by this number and then apply ATAN to it in order to bound the result. Finally, we multiply this by a factor of 10/π (so that the maximum possible value is 5)
Location risk: If a given transaction has a preceding one, we find the time and location (latitude and longitude) of both and compute the angle between those two points along the Earth’s surface. We multiply this angle by 30 to get approximately 1/210 of the actual distance travelled (as 210 x 30 = 6300 is the radius of the Earth in kilometres). We chose this number because 210kph makes sense as a sensible basis of speed, e.g. a high speed train travels at 1.5x this speed, a plane at 4x this speed, and we don’t want to flag transactions made right before and after a trip. If the resulting value is greater than 5, we add another 100 to indicate that this situation in not physically possible.
All of this can be achieved using RDFox’s advanced reasoning engine. Finally, we sum all the risk factor values for a given transaction and save that as its total risk. We ensure that the rule for that works even if not all the factors are present. This will help us avoid failure if we work with incomplete data.
We can use the RDFox web console and its exploration feature to more closely examine the data. Let us take a look at the transaction with the highest total risk score:
The most noticeable thing about this data is that the person’s previous transaction happened in the United States not even 20 minutes before and now they are in Germany, which is not possible. We can safely assume that this transaction is fraudulent.
But where should we put the threshold for flagging transactions?
Since this is an artificially generated data set, we happen to know exactly which ones we are looking for — there are a total of 43 fraudulent transactions. We pick 15 as a reasonable risk threshold and all transactions with a score higher than that will be flagged. Only one legitimate transaction has a total risk score of over 15, so if we chose that as our threshold, we would get a false positive rate of less than 7%.
Transaction risk scores on a logarithmic scale (Image generated by the author)
Moreover, if we look at that one legitimate transaction, we find that it follows directly after a fraudulent one (hence its high risk score caused by the impossible speed of travel required to get from where the previous transaction was made to where this one took place), so all the transactions we flag in this way are tied to accounts that have recently been compromised.
To sum up…
By transforming disparate data sources into RDF using Data Lens, gathering that information in a graph database and utilising RDFox’s powerful reasoning capabilities, we can detect and flag fraudulent credit card transactions based on complex relationships in our data. With the right rules and risk score definitions (these could be fine-tuned using machine learning) we can greatly reduce the losses caused by this type of crime. RDFox is perfectly suited for this type of application because its unmatched speed allows it to effortlessly process large amounts of information and the reasoning engine ensures we do not have to worry about keeping our data consistent.
Team and Resources
Oxford Semantic Technologies
The team behind Oxford Semantic Technologies started working on RDFox in 2011 at the Computer Science Department of the University of Oxford with the conviction that flexible and high-performance reasoning was a possibility for data intensive applications without jeopardising the correctness of the results. RDFox is the first market-ready knowledge graph designed from the ground up with reasoning in mind. Oxford Semantic Technologies is a spin out of the University of Oxford and is backed by leading investors including Samsung Venture Investment Corporation (SVIC), Oxford Sciences Innovation (OSI) and Oxford University’s investment arm (OUI). The author is proud to be a member of this team.
The team behind Data Lens started implementing knowledge graph solutions in 2010, and have gone on to complete production ready knowledge graph driven systems for a wide range of industries. Data Lens provide consulting services and tooling which greatly accelerates the process of building your knowledge graph, showing business value from graph analysis in record time.
Thanks to Felicity Mulford, Bernardo Cuenca Grau and especially to Diana Marks who wrote the majority of this blog on behalf of RDFox