Skip to main content

Posts

Box plot is super easy

 

Levenshtein distance

In information theory, linguistics and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. It is named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965. Levenshtein distance may also be referred to as edit distance, although that term may also denote a larger family of distance metrics known collectively as edit distance. It is closely related to pairwise string alignments.

Differences between Hadoop and Spark?

In fact, the key  difference between Hadoop  MapReduce and  Spark  lies in the approach to processing:  Spark  can do it in-memory, while  Hadoop  MapReduce has to read from and write to a disk. As a result, the speed of processing differs significantly –  Spark  may be up to 100 times faster.

scikit-learn random state in splitting dataset

Random_state as the name suggests, is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case. This is to check and validate the data when running the code multiple times. Setting random_state a fixed value will guarantee that same sequence of random numbers are generated each time you run the code. And unless there is some other randomness present in the process, the results produced will be same as always. This helps in verifying the output.

What is difference between "inplace = True" and "inplace = False?

Both inplace= true and inplace = False are used to do some operation on the data but: When  inplace = True  is used, it performs operation on data and nothing is returned. df.some_operation(inplace=True) When  inplace=False  is used, it performs operation on data and returns a new copy of data. df = df.an_operation(inplace=False)

Why Data is important?

Data are characteristics or information, usually numerical, that are collected through observation. In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects, while a datum (singular of data) is a single value of a single variable.  Although the terms "data" and "information" are often used interchangeably, these terms have distinct meanings. In some popular publications, data are sometimes said to be transformed into information when they are viewed in context or in post-analysis. In academic treatments of the subject, however, data are simply units of information. Data are employed in scientific research, businesses management (e.g., sales data, revenue, profits, stock price), finance, governance (e.g., crime rates, unemployment rates, literacy rates), and in virtually every other form of human organizational activity (e.g., censuses of the number of homeless people by non-profit organizations...

Machine Learning

Machine learning (ML) is the study of computer algorithms that improve automatically through experience. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or infeasible to develop conventional algorithms to perform the needed tasks. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a related field of study, focusing on exploratory data analysis through unsupervised learning. In its application across business problems, machine learning is also refer...