Data Science Concepts

  • hash function: any function that can be used to map data of arbitrary size to data of fixed size. One use is a data structure called a hash table, widely used in computer software for rapid data lookup. Hash functions accelerate table or database lookup by detecting duplicated records in a large file.
  • O(n): big O notation is used to classify algorithms according to how their running time or space requirements grow as the input size grows. In analytic number theory, big O notation is often used to express a bound on the difference between an arithmetical function and a better understood approximation.
  • Model Selection Techniques:
    • Probabilistic Measures: Scoring by performance and complexity of model.
    • Resampling Methods: Splitting in sub-train and sub-test datasets and scoring by mean values of repeated runs.

Techniques:

  • Resampling:
    • Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping)
    • Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re-randomization tests)
    • Validating models by using random subsets (bootstrapping, cross-validation)
  • Shrinkage
    • In relation to the general observation that, in regression analysis, a fitted relationship appears to perform less well on a new data set than on the data set used for fitting. In particular the value of the coefficient of determination ‘shrinks’.
    • To describe general types of estimators, or the effects of some types of estimation, whereby a naive or raw estimate is improved by combining it with other information (see shrinkage estimator).
  • Dimension Reduction(Dimensionality Reduction): the process of reducing the number of random variables under consideration by obtaining a set of principal variables.
  • Data Augmentation: a technique used to increase the amount of data by generating data using transformations such as rotation, scaling, or flipping to existing data.
  • Artifact: Artifact refers to an intermediate result in the data science development process. In the data science workflow, an artifact can be a model, a chart, a statistic, a dataframe, or a Feature function.
  • Data Pipeline: A pipeline refers to a series of steps that transform data into useful information/product and is often automated.