摘要: Veeva OpenData Explorer is a new web-based portal to access approximately 16 million healthcare professionals (HCPs), healthcare organizations (HCOs), and their affiliations spanning 34 countries. The open API simplifies integration of Veeva OpenData with third-party applications and services so companies can leverage their customer data where they need it. With these latest innovations, Veeva is giving customers greater choice in how they use Veeva OpenData and making it even easier to access accurate customer data.”

摘要: Netflix has open sourced Metaflow, an internally developed tool for building and managing Python-based data science projects. Metaflow addresses the entire data science workflow, from prototype to model deployment, and provides built-in integrations to AWS cloud services.

摘要: In the banking or pharmacy industry where regulations compel companies to have good governance in place, in industries such as publishing and telecom Data Governance often seems complicated and theoretical. That’s according to Sara Willovit, Product Data Governance at Becton Dickenson.

摘要: Data can be anywhere. Companies store data in the cloud, in data warehouses, in data lakes, on old mainframes, in applications, on drives — even on paper spreadsheets. Every day we create 2.5 quintillion bytes of data, and there are no signs of this slowing down anytime soon.

摘要: To build an effective learning model, it is must to understand the quality issues exist in data & how to detect and deal with it. In general, data quality issues are categories in four major sets.

摘要: Below picture represents the machine learning & data mining process in general. Data cleaning and Feature extraction is the most tedious job but you need to be good at it make your model more accurate.

摘要: Bayesian Target Encoding is a feature engineering technique used to map categorical variables into numeric variables. The Bayesian framework requires only minimal updates as new data is acquired and is thus well-suited for online learning. Furthermore, the Bayesian approach makes choosing and interpreting hyperparameters intuitive. I developed this technique in the recent Avito Kaggle Competition, where my team and I took 14th place out of 1,917 teams. We found that the Bayesian target encoding outperforms the built-in categorical encoding provided by the LightGBM package.

摘要: 在深度學習中除了兜模型外,最重要的就是模型內的參數,也就是weight部分,每個模型開始學習前都需要有一個對應的初始值。這時候有些人會覺得初始值不就隨機給或是給0開始學就好了啊,我一開始接觸也是這麼覺得的,對於簡單的應用(目標函數是convex)/方法這個方式可能有行,但對於神經網路而言若是有一個好的初始值對於模型學習更是事半功倍,若是初始值不好或是目標函數是non-convex問題則會造成神經網路學習到不好的結果。

摘要: 特徵工程,是對原始數據進行一系列的工程處理,將其提煉為特徵,作為輸入供算法和模型使用。特徵工程是一個表示和展現數據的過程。在實際工作中,特徵工程主要是去除原始數據中的雜質和冗餘,設計更高效的特徵以描述求解的問題和預測模型之間的關係。

Popular Tags