Saturday, December 10, 2022

Statistics Notes

Statistics
  • Measuring center of dataset : 1. Mean is sensitive to outliers its beneficial to use mean if dataset member are close to each other or variation in dataset is minimum. one minute/hour rollup (???). If it is used for measurement then it can lead to spike erosion as the mean data averages out over the large period of time. sample mean is represented using x-bar and population mean is represented using greek symbol mue. 2. Median: On the other hand median of sorted dataset is not sensitive to outliers, then its a good measurement of center if your dataset is suffering from outliers. median is less sensitive to data than mean. 3. Mode highest frequency value. its mainly used for categorical data or proportional data.

  • Measuring tail of dataset: 1. Quantile is a flexible tool that offers an alternative to the classical summary statistics which is less susceptible to outliers.The (empirical) cumulative distribution function CDF(y) for dataset X, at a value y, is the ratio of samples that are lower than the value y.

  • Measuring variation or dispersion in dataset. 1. Standerd deviation 2. Mean absolute deviation/Median absolute deviation 3. Percentile or quartile 4. InterQuartile range: is difference between 25th percentile and 75 percentile after sorting the data. Its less succeptible to outliers.

  • The coefficient of variation is defined as the ratio of the data’s standard deviation to its mean.We use this measure frequently when attempting to compare means, and it spreads across populations that exist at different scales.

  • The z-score is a way of telling us how far away a single data value is from the mean. By replacing each value with its z-score is same as original data value. It is a very effective way of normalizing data that exists on very different scales, and also to put data in context of their mean.

  • Assumption : Many statistical tests and hypotheses require the underlying data to come from a normally distributed population.

  • A desirable property is robustness against outliers. A single faulty measurement should not change a rough description of the dataset.

  • Central limit theorem

  • It is an observation phenomenon where even if data is not normally distributed the sample statistics observed from samples are normally distributed.
  • Sample distribution is the distribution of sample statistics like mean over mutliple sampling. Sample statistics from sample distribution is more bell curve than the actual data distribution. Also more the sample size more bell curved the test statistics would be.
  • It has major contribution in hypothesis testing and confidence interval.
  • central limit theorem states that the sampling distribution (the distribution of point estimates) will approach a normal distribution as we increase the number of samples taken. A sampling distribution is a distribution of several point estimates.

  • Standerd error measures the variablity in sample statistics.

  • confidence interval:

  • It covers the central region of a sample statistics data. It signifies that further sampling will likely to produce(some percentage) sample statistic that will lie in same central region. More generally an 95% confidence interval around a sample estimate should on average contain similar sample in 95% of the time. To do true estimate and if you have less data more confident you want to be then you have to choose a wider confidence interval. Confidence interval means how variable a sample estimate might be. A confidence interval is a range of values based on a point estimate that contains the true population parameter at some confidence level. A confidence level does not represent a “probability of being correct”; instead, it represents the frequency that the obtained answer will be accurate.

  • Hypothesis tests framework to determine whether the observed sample data deviates from what was to be expected from the population itself. A hypothesis test generally looks at two opposing hypotheses about a population. They are called the null hypothesis and the alternative hypothesis. The null hypothesis is the statement being tested and is the default correct answer. The alternative hypothesis is the statement that opposes the null hypothesis. Our test will tell us which hypothesis we should trust and which we should reject.

  • Correlation simply quantifies the degree to which variables change together.

  • Reasons for correlation b/w two variables
    • causation is the idea that one variable actually determines the value of another. It is based on cofounding factor. While Cofounding factor is a third random variable that direct show the causation betweeen two correlated variables
    • conincidence . While coincidence shows that correlation does not imply causation. When we ignore cofounding variables then correlation become very misleading.
  • Simpson paradox shows there might be another cofounded variable which can break the hypothesis of two correlated variables by showing anti-correlation. The main takeaway from Simpson’s paradox is that we should not unduly give causational power to correlated variables. There might be confounding variables that have to be examined.

  • Correlation cofficient for Bivariate analysis

  • It describe the relation between two variables. Relationship might go from linear to polynomial based on level of fitness that can be find out by mapping different model on testing data.
  • The hypothesis relationship between two variable must be tested further. We will need to use more sophisticated statistical methods and machine learning algorithms to solidify these assumptions and hypotheses.
  • Its value can be in between -1 to 1.
  • Correlation cofficient is also sensitive to outliers.

  • Correlation matrix is a matrix representaion of two or more random variables correlation cofficients e.g. X Y {{1.00 0.67} {0.67 1.00}}. Note correlation cofficient is 1 for a variable with itself since its perfectly correlatable with itself.

  • Sampling

  • Big data use cases are when not only data is in big quantity but also when its sparse enough.
  • Two main type of resampling technique : Bootstating and permutation tests. In permutation test multiple dataset are combined and resuffled. Then datasets are again choosen in same numbers as the original size and statistics of interest again completed.
  • The number of degree of freedom forms the calculation to standerdize test statistics so that they can be compared to reference distribution e.g. t-distribution and F-distribtion. The concept of Degree of freedom lies behind factoring of categorical variables into n-1 indicator or dummy variables when doing regression (to avoid multicollinearity)
  • Sampling with replacement : choosing an element from the population for the sample but also put copy of element again in the popultion to get rechoosen again.
  • Sample bias is when bad data is choosen which does not represent all the members of population equally or proportionaly. It results into wrong predictions. Stratified Sampling is used to solve sample bias. Given a population with homogenous subgroup then sampling according to proportion of a strata is what entails the stratified sampling
  • Error bias is other form of bias in which data is incorrect due to incorrect sampling process.

  • Student-t distribution: its a normal distribution but its thicker and longer in tails. Its extensively used in depicting sample distributions of sample statistics.

  • For events that occur at constant rate of event per unit time or space Poisson distribution can be more appropriate.
  • In constant rate scenario if we want to model the time or space between two subsequent event, we can use exponential distribution- A changing event over time can be modelled with the Weibull distribution.

  • Statistical hypothesis testing was invented as a way to protect researcher from crediting random chances to some significance. Null hypothesis test further to the A/B testing. Null hypothesis which is credited to a chance hence called Null. Alternative hypothesis which is opposite of null that we want to prove. Hypothesis tests’ framework to determine whether the observed sample data deviates from what was to be expected from the population itself. Our test will tell us which hypothesis we should trust and which we should reject.

  • Statistically significant when the result is beyond the realm of chance variation.
  • p-value is the probability that results as extreme as observed can occur given the null hypothesis model. p-value does not measure that given hypothesis is true, It alone should be considered as proof for hypothesis.
  • Type 1 error means to consider an hypothesis true but it is purly a chance. Type 2 error means to reject an hypothesis but even if it is true.

  • Chi-square test is used with count data to see how fit it is the expected distribution. Chi-square statistics measure the dispersion between expected and observed data. Chi square distribution is skewed with a long tail to the right.

(RMSE), - Residual standard error (RSE) is same as Root mean square error(RMSE) as but have additional degree of freedom in the denominator. - R2 (R-square) is the matrics for seeing how fit the data is fit to the model is. its R2=1- Sum(actual(y)-predicted(y))/ sum(actual(y)-mean(y)). - t-statistics is (coficient)/standard error(cofficient) tells the significance level of cofficient. Higher the t value is more significant the predictor is. Stepwise regression is a way to automatically determine which variable should be included in model.Confidence intervals quantify uncertainty around regression Prediction.coefficients intervals quantify uncertainty in individual predictions. - Correlation of multi-variate variables: When predicator variables are highly correlated with each other (e.g. houseSize and noOfBedroom in regression to estimate house price ) then its very difficult to interpret correlation cofficients. When predictor variables are perfect or near-perfect correlated to each other, then regression is very difficult to compute this is called Multicollinearnity.This is equivalent to include a predictor multiple times in regression equation. B. Cofounding variable if we forgot to include an important varaiable (e.g. location in equation of house price) in regression equation then this can lead to unstable predictions. C. Main effect(Predictor variables) often have intractions which should also be included in the regression material.

Regression Diagnostics: A. Standardized residual which is calculated by dividing residual by standardized error can be used to predict outlier. Outlier can also be used to find anomaly, frauds and accident detection. B. A record /values which change the regression formula significantly known as influental value/observation. This observation have high leverage on regression equation. Hat values and cook’s distance can be used to measure influence. In small no of observations removing influential value can lead to more fitting regression model while in large amount of data you will rarely sees influential data observations. C.Heteroskedasticity : your model is suffering from heteroskedasticity if variance or residuals is more in some range of data. It should be constant across the range of data. It means regression model have miss something or incomplete. D. Partial residual plot: can be used to see relationship between individual predictor variable and the output variable qualitatively. The relationship can be non-linear which means non-linear regression model should be used.

Friday, April 8, 2022

DistributedSystem

NoSql

NoSql Database types

  1. Document Based
  2. Column based Usecases: Write-large number of small updates Read - read sequentially. Example: HBase

NoSql schema designing

  1. When to have multiple collections in nosql. - If the objects you are going to embed may be accessed in a isolated way (it makes sense to access it out of the document context) you have a reason for not embedding. - If the array with embedded objects may grow in an unbounded way, you have another reason for not embedding.Embedding one to many relationship on the one side can help in saving extra queries. But gain can quickly turn into lose if those objects are getting updated very frequently.

  2. Three basic different schema design One-to-N relationship in NoSql:

  3. Embed the N side if the cardinality is one-to-few and there is no need to access the embedded object outside the context of the parent object.
  4. Use an array of references to the N-side objects if the cardinality is one-to-many or if the N-side objects can be queried independently of 1 side.
  5. Use a reference to the One-side in the N-side objects if the cardinality is one-to-squillions(large indefinte size)

What to choose - sql or nosql

  1. Nosql When to use : - Schema structure If the data in your application has a document like structure(i.e. a tree with one to many relationships where typically the entire tree is loaded at once) then its probabily is good idea to use document model. However the relational technique of shreddig- splitting the document into multiple tables can lead to cumbersome schema and unnecessay complicated application code. - flexible and evolving schema. - size Suitable for big volume of data. - Compliance Suitable where eventual consistency can be tolerated. - JSON schema has better locality than the multi table schema.

cons: 1. Many to one and Many to many relatioships are very weakly supported.. As projects get bigger they tend to have more usecases. And subobjects in a document are queried independently of the main object. As soon as these usecases start to have many-to-many and many-to-one queries. It does not fit well in Json schema. This leads to breaking of hierarchial model(JSON) to relational model. 2. querying a small piece of data from a big document will fetch the whole document. 3. updation of document size form some update in some field require rewritten of whole document again. information.

  1. Sql when to use : - size : RDBMS are at their best when performing intensive read/write operations on small or medium sized data sets. Need strong consistency. - Compliance : Usecases that require strict ACID compliance e.g. finance, Banking, ecommerce etc - Schema structure if the schema is consistent and does not change much. Also data size is limited.

cons: Does not scale well in horizontal scalability bcause of ACID rules

Notes : For highly interconnected data the document model is awkward, the relational model is acceptable and graph model are most neutral.

There is an implicit schema because the application need some kind of structure but it is not enforced by the database. A more accurate term is schema-on-read(the structure of the data is implicit and only interpreted when data is read) , in contrast to schema on write (the traditional approach of relational database where schema is explicit and the database ensures all written conforms to all).

Scheama on read advantages: Case1: there are many different type of objects and it is not practical to put each type of object in its own table. case2:The structue of data is determined by external systems over which you have no control and which may change at any time.

Famous Non sql Database

Cassandra: Records are sharded based on partition keys. Within same partition key records are sorted based on a key. BigTable: It combines multiple files in a single block to store on disk. And is very efficient in reading a small amount of data. HDFS/GlusterFS: Distributed File storage system.Suggested for Video binary stroage

Bloom filter

  • It gives definite answer in case a particular key is not present and it may give false positive in which case key might also not be present.
  • Given N bits of master set then each key is passed to K hash functions, each function will return a bit position which is then set to mark the presence of the key.

CAP theorem

It states that any networked shared-data system can have at most two of three desirable properties: 1. Consistency - Every node will serve the latest copy of the data. 2. High Availability - Any non failing node will replies to the request in a certain amount of time 3. Partition Tolerance - System will continue to function even in case of network partition.

As a consequence of CAP theoren in practice, we categorized distributed system following ways : 1.CA system: Since this particular system needs to be consistent, therefore in case of network partition the whole system will stopped working as all the nodes need to serve latest data. It is not a coherent design for any distributed application. e.g. RDBMS

  1. CP system: This particular kind of system is very similar to CA system but in the case of network partition node will retry indefinitely(loosing Availiblity - read definition) until client times out. e.g. Big tabl, HBASE

  2. AP system: These system will continue to serve stale data in case of network partition without comprimising availability. But system will not be consistent. e.g Cassandra, Mongo etc. AP system is also called BASE mean Basically Available Soft-state and Eventually Consistent.

Extension to CAP theorem is PACELC theorem where PAC is from cap theorem which says in case Partition(P) system can choose either Availibility(A) or Consistency(C). And ELC means in case of no Parition (E) system can choose either Latency(L) or Consistency(C). ???

Sunday, March 21, 2021

Plants [In progress]

Table 1. Plants Maintainence

Plant name

Light

Soil

Water

Native

Remark

Portulaca Orleacea

Part to full sun

Dry soil

250ml twice a week

Asia

more info. It is a weed so keep away from other plance https://www.gardeningknowhow.com/edible/herbs/purslane/edible-purslane-herb.htm

Polka dot

Any light. Best color in lower light situation

Indoor once a month

Medagascar

more info https://www.gardeningknowhow.com/houseplants/polka-dot-plant/growing-polka-dot-plants.htm

Money plant

Any light

Well drained soil (some sand mixed in soil.)

once in a week need dryness between each watering. spraying leaves in winter

French polynesia island

More info https://www.fnp.com/article/how-to-take-care-of-money-plants


Sunday, August 2, 2020

Generalization of list operations using foldr

Introduction:

List is a generic type or polymorphic type means it can take many shape like List[Int], List[String] etc. And to provide strong type checking, static-type language’s compiler have to know the type of data it will contain. Generally we define it with mentioning the containing type in the definition itself or contained type infered by the compiler itself if it is not mentioned. e.g.

val a:List[Int] = List(1,2,3,4) //defined with contained type

val b = List(1,2,3,4) //defined without contained type

List type also provide many operations which can behave uniformly to many contained types. Example include length, map, filter etc. Each of these operations act on list elements one at a time. Following we will see how these operation can be defined using a more generalized scheme that is provided to us by one other list operation called foldr.

List Operations length, map, filter

list operations usually apply their respective algorithm to each member of list one at a time. And recursively calls the next element till it reaches the end of the list. Following are the definition of some common operations.

def map[A, B](fn: A -> B, ls: List[A]): List[B] = {
        ls match {
                case Nil => Nil
                case h:t => fn(h): map(fn, xs)
        }
}

def filter[A](fn: A -> Boolean, ls: List[A]): List[A] = {
        ls match {
                case Nil => Nil
                case h:t => if (fn(h)) h:filter(fn, t) else filter(fn, t)
        }
}

def length(ls:List[_]): Int = {
        ls match {
                case Nil => 0
                case h:t => 1 + length(t)
        }
}

If we observe these operations closely they uses same common pattern.

  • Recursion scheme : each operation is recursing the list calling itself till it reaches the end of the list.

Following are the variable part in above operations.

  1. Computation and Aggregation: In case non empty list, each operation is applying some function (fn argument in map and filter) to the head of the list. In case of length we can assume a function which ignores its input and return constant 1. After computing we do some kind of aggregation, which helps in assembling the final result. In case of length its add(+). And In case of filter and map its cons(:) operator of list.

  2. Empty list return type.

If we abstract out the above two variable part as the argument. We can further genralise and obtain a more abstract function. It will have the the signature like below

def operateAndAggregateOnList[A, B](compAgg: (A -> B) -> B, zero: B, ls: List[A]): B

But we do not need that as list class already have similar operation called foldr. Following is the definition for foldr

def foldr(fn: (A -> B) -> B, zero: B, ls: List[A]): B = {
        ls match {
                case Nil => zero
                case h:t => fn(A, foldr(fn, zero, tail))
        }
}

As you can see foldr definition looks very similar to the operations map, filter and length except we have abstracted out base case return type and compute and aggregate function. We can further define map, length and filter in terms of foldr.

def map[A, B](fn: A -> B, ls: List[A]): List[B] = foldr((a,b) => fn(a):b, Nil, ls)

def filter[A](fn: A -> Boolean, ls: List[A]): List[A] = foldr((a,b) => if(fn(a)) a:b else b, Nil, ls)

def length(ls:List[A]): Int = foldr((a,b) => 1+b, 0, ls)

concat in terms of foldr

Last I want to show you one another function concat which concat two lists. General scheme is same in this case as well.

def concat[A](firstLs: List[A], secLs: List[A]): List[A] = {
        firstLs match {
                case Nil => secLs
                case h:t => h:concat(t, secLs)
        }
}

Can we define concat in terms of foldr of course we can. In foldr argument teminology zero argument is second list secLs, fn is A → List[B] → List[B]

def concat[A](firtLs: List[A], secLs: List[A]): List[A] = foldr((a,b) => a:b, secLs, firtLs)

conclusion

foldr gives the abstraction over computation and aggregation scheme over a list. As we have seen its a gernealized function and can be used to define other list function which have common list traversal pattern.


Statistics Notes

Statistics Measuring center of dataset : 1. Mean is sensitive to outliers its beneficial to use mean if dataset member are close to eac...