ML UDF’s in Actian Data Platform, VectorH, and Vector – Part 2
Actian Corporation
August 6, 2020
In the first part of this two-part article, we created the model. In this installment, we will create the UDF’s in database.
We can convert the model to Python, and I used m2cgen to transpile the sklearn model to Python model
I will be making 3 types of UDF’s,
- Python UDF.
- JavaScript UDF.
- Saving the data to JSON and then classification using JSON data.
Python UDF
Following 1 line of code will give us a model in Python.
import m2cgen as m2c code = m2c.export_to_python(clf) The output we get is : def score(input): return [0.2834483510261381 + input[0] * 0.4156346233598447 + input[1] * 1.6410466066307934 + input[2] * -2.4486206310542746 + input[3] * -1.0775840518519904, 1.271492823301712 + input[0] * 0.7334652955571552 + input[1] * -2.0527321763924102 + input[2] * 0.4049181379167901 + input[3] * -1.4276343377888174, -1.403879236024429 + input[0] * -1.8886197150361799 + input[1] * -1.701005052157171 + input[2] * 2.803528886728411 + input[3] * 2.6749639949986195] Since m2cgen needs an array as input we give input as a string and split it to float values. input = [float(i) for i in lst.split(",")]
Linear Regression and Logistic Regression Maths
The formula for linear regression is;
Y=Wo+ X1W1+ X2W2+ X3W3+ ……
Linear regression is designed more for predicting numerical values than a classification problem.
However for logistic regression, we would like an output between 0 and 1, for that, we have to use a nonlinear function called sigmoid which is;
H(X) =1/(1 + e^ (-y))
The output is interpreted in terms of probabilities.
We have 4 input values and m2cgen easily gives us a linear equation with our inputs , coefficient and intercepts calculated by the model. We apply the sigmoid function on the equation shown above and it gives us one vs rest classification.
for i in range(0,len(var0)): arr1.append( 1 / (1 + math.exp(-var0[i]))) print(arr1)
After applying sigmod, we get an array. As our UDF is scalar, we need to modify the code to return the index of array, which has maximum probability. We just add code to return index with maximum value;
max_val=arr1[0] for i in range(0,len(arr1)): if arr1[i] > max_val: index=i max_val=arr1[i] return index+1 clubbing it all together makes our function as; import math def score (lst): input = [float(i) for i in lst.split(",")] var0= [((((0.24739569243110213) + ((input[0]) * (0.3677840447489154))) + ((input[1]) * (1.4151458935607388))) + ((input[2]) * (-2.124321635674482))) + ((input[3]) * (-0.9309638143559119)), ((((0.7060488675827682) + ((input[0]) * (0.6404296785789872))) + ((input[1]) * (-1.7889249557223028))) + ((input[2]) * (0.4056580513021318))) + ((input[3]) * (-1.106884750746711)), ((((-0.9529159159570318) + ((input[0]) * (-1.5615947756178696))) + ((input[1]) * (-1.1894433955845047))) + ((input[2]) * (2.1944766675309997))) + ((input[3]) * (2.0100453163309537))] print(var0) index=0 arr1=[] for i in range(0,len(var0)): arr1.append( 1 / (1 + math.exp(-var0[i]))) print(arr1) max_val=arr1[0] for i in range(0,len(arr1)): if arr1[i] > max_val: index=i max_val=arr1[i] return index+1 Our UDF structure is ready, we just need to add CREATE OR REPLACE FUNCTION for the syntax to be correct to be added to database. udf='''CREATE OR REPLACE FUNCTION Iris_classifier(lst VARCHAR(100) not null) return (int not null) AS LANGUAGE PYTHON SOURCE=' def score (lst): input = [float(i) for i in lst.split(",")] var0= [((((0.24739569243110213) + ((input[0]) * (0.3677840447489154))) + ((input[1]) * (1.4151458935607388))) + ((input[2]) * (-2.124321635674482))) + ((input[3]) * (-0.9309638143559119)), ((((0.7060488675827682) + ((input[0]) * (0.6404296785789872))) + ((input[1]) * (-1.7889249557223028))) + ((input[2]) * (0.4056580513021318))) + ((input[3]) * (-1.106884750746711)), ((((-0.9529159159570318) + ((input[0]) * (-1.5615947756178696))) + ((input[1]) * (-1.1894433955845047))) + ((input[2]) * (2.1944766675309997))) + ((input[3]) * (2.0100453163309537))] print(var0) index=0 max_val=var0[0] for i in range(0,len(var0)): if var0[i] > max_val: index=i max_val=var0[i] return index+1 return score (lst)' '''
Our UDF is ready, and we just have to add it to the database.
cursor.execute(udf) conn.commit()
The UDF is added to the database and can be used for classification of the data, notice the input is added as a string, since m2cgen expects string as input to the model.
predict='''select CASE WHEN species='Iris-setosa' THEN '1' WHEN species='Iris-versicolor' THEN '2' ELSE '3' END as species, Iris_classifier (sepallengthcm || ',' || sepalwidthcm || ',' || petallengthcm || ',' || petalwidthcm ) as classify from iris''' predict1=pd.read_sql(predict, conn) predict1.head()
JavaScript UDF
For transpiling the JavaScript UDF, we just need to provide the following:
code = m2c.export_to_javascript(clf) output generated is function score(input) { return [((((0.2901789521827987) + ((input[0]) * (0.4467535618836661))) + ((input[1]) * (1.5843519667681565))) + ((input[2]) * (-2.409947591791464))) + ((input[3]) * (-1.0736156286007468)), ((((0.8330817501826279) + ((input[0]) * (1.0259003852575856))) + ((input[1]) * (-2.4327046722797347))) + ((input[2]) * (0.48745117088772905))) + ((input[3]) * (-1.8329094419137872)), ((((-1.141764377255606) + ((input[0]) * (-2.35933886039274))) + ((input[1]) * (-1.2090666108474617))) + ((input[2]) * (2.981435002839768))) + ((input[3]) * (2.99871035615134))]; }
Here, I am returning the index for maximum value and not using sigmoid function as if don’t use sigmoid it works on the maximum likelihood estimations.
We just need to change it as per UDF qualification for Javascript.
udf_JS ='''create function iris_lr_javascript1(lst VARCHAR(100) not null) return (int not null) AS LANGUAGE JAVASCRIPT SOURCE=' function score(lst) { var input = lst.split(",").map(function(item) { return parseFloat(item); }); nums=[((((0.2901789521827987) + ((input[0]) * (0.4467535618836661))) + ((input[1]) * (1.5843519667681565))) + ((input[2]) * (-2.409947591791464))) + ((input[3]) * (-1.0736156286007468)), ((((0.8330817501826279) + ((input[0]) * (1.0259003852575856))) + ((input[1]) * (-2.4327046722797347))) + ((input[2]) * (0.48745117088772905))) + ((input[3]) * (-1.8329094419137872)), ((((-1.141764377255606) + ((input[0]) * (-2.35933886039274))) + ((input[1]) * (-1.2090666108474617))) + ((input[2]) * (2.981435002839768))) + ((input[3]) * (2.99871035615134))]; var index = 0; for (var i = 0; i < nums.length; i++) { index = nums[i] > nums[index] ? i : index; } return index+1; } return score(lst)' ''' cursor.execute(udf_JS) conn.commit()
And our UDF is added in database; likewise, we can use this for classification as we did for the Python UDF.
UDF with JSON data
VectorH 6.0 and Vector 6.0 also support the JSON datatype. So we can convert the model into JSON data, later to be used with our model.
import json model_param = {} model_param['coef'] = clf.coef_.tolist() model_param['intercept'] =clf.intercept_.tolist() b = json.dumps(model_param) print(b) {"coef": [[0.4467535618836661, 1.5843519667681565, -2.409947591791464, -1.0736156286007468], [1.0259003852575856, -2.4327046722797347, 0.48745117088772905, -1.8329094419137872], [-2.35933886039274, -1.2090666108474617, 2.981435002839768, 2.99871035615134]], "intercept": [0.2901789521827987, 0.8330817501826279, -1.141764377255606]} jsonstr='''create function iris_lr_json(lst VARCHAR(100) not null) return (int not null) AS LANGUAGE python SOURCE=' import json def score1 (lst): input = [float(i) for i in lst.split(",")] json_data = "{"coef": [[0.4156346233598447, 1.6410466066307934, -2.4486206310542746, -1.0775840518519904], [0.7334652955571552, -2.0527321763924102, 0.4049181379167901, -1.4276343377888174], [-1.8886197150361799, -1.701005052157171, 2.803528886728411, 2.6749639949986195]], "intercept": [0.2834483510261381, 1.271492823301712, -1.403879236024429]}" dictjson = json.loads(json_data) # raise Exception("it is %s" % str(dictjson)) var0=[dictjson["intercept"][0]+ input[0]*dictjson["coef"][0][0] + input[1]*dictjson["coef"][0][1]+input[2]*dictjson["coef"][0][2] + input[3]*dictjson["coef"][0][3], dictjson["intercept"][1]+ input[0]*dictjson["coef"][1][0] +input[1]*dictjson["coef"][1][1]+ input[2]*dictjson["coef"][1][2] + input[3]*dictjson["coef"][1][3], dictjson["intercept"][2]+ input[0]*dictjson["coef"][2][0] + input[1]*dictjson["coef"][2][1]+input[2]*dictjson["coef"][2][2] + input[3]*dictjson["coef"][2][3]] #print(var0) index=0 max_val=var0[0] for i in range(0,len(var0)): if var0[i] > max_val: index=i max_val=var0[i] return index+1 return score (lst)' ''' cursor.execute(jsonstr) conn.commit()
UDF can be called simply by the query, as shown in Python UDF.
Detailed code can be found at https://github.com/ActianCorp/MachineLearningUDFs/blob/master/Vector-UDF.ipynb
In case you have any questions, feel free to reach me at vidisha.sharma@actian.com
Conclusion
The attraction of on-database UDF’s for ML scoring is the ability to score data on database at break-neck speed. The model and data are in the same place, reducing data movement. With all operations running on database, your machine learning models will run extremely fast using the Vector X100 engine. Python is one of the most used languages for ML development, so it will be very easy to transition and manage Python and JavaScript models to on database Scalar UDFs for batch scoring and near-real-time scoring use cases.
To learn more about the capabilities of all the Actian products, visit our website.
Subscribe to the Actian Blog
Subscribe to Actian’s blog to get data insights delivered right to you.
- Stay in the know – Get the latest in data analytics pushed directly to your inbox
- Never miss a post – You’ll receive automatic email updates to let you know when new posts are live
- It’s all up to you – Change your delivery preferences to suit your needs