This is not really an answer to your question, essentially because there isn't (currently) a question in your post, but it is too long for a comment.

Your statement that

A co-ordinate transformation is linear map from a vector to itself with a change of basis.

is muddled and ultimately incorrect. Take some vector space $V$ and two bases $\beta$ and $\gamma$ for $V$. Each of these bases can be used to establish a representation map $r_\beta:\mathbb R^n\to V$, given by
$$r_\beta(v)=\sum_{j=1}^nv_j e_j$$
if $v=(v_1,\ldots,v_n)$ and $\beta=\{e_1,\ldots,e_n\}$. The coordinate transformation is **not** a linear map from $V$ to itself. Instead, it is the map
$$r_\gamma^{-1}\circ r_\beta:\mathbb R^n\to\mathbb R^n,\tag 1$$
and takes coordinates to coordinates.

Now, to go to the heart of your confusion, it should be stressed that **covectors are not members of $V$**; as such, the representation maps do not apply to them directly in any way. Instead, they belong to the *dual space* $V^\ast$, which I'm hoping you're familiar with. (In general, I would strongly discourage you from reading texts that pretend to lay down the law on the distinction between vectors and covectors without talking at length about the dual space.)

The dual space is the vector space of all linear functionals from $V$ into its scalar field:
$$V=\{\varphi:V\to\mathbb R:\varphi\text{ is linear}\}.$$
This has the same dimension as $V$, and any basis $\beta$ has a unique dual basis $\beta^*=\{\varphi_1,\ldots,\varphi_n\}$ characterized by $\varphi_i(e_j)=\delta_{ij}$. Since it is a different basis to $\beta$, it is not surprising that the corresponding representation map is different.

To lift the representation map to the dual vector space, one needs the notion of the adjoint of a linear map. As it happens, there is in general no way to lift a linear map $L:V\to W$ to a map from $V^*$ to $W^*$; instead, one needs to reverse the arrow. Given such a map, a functional $f\in W^*$ and a vector $v\in V$, there is only one combination which makes sense, which is $f(L(v))$. The mapping $$v\mapsto f(L(v))$$ is a linear mapping from $V$ into $\mathbb R$, and it's therefore in $V^*$. It is denoted by $L^*(f)$, and defines the action of the adjoint $$L^*:W^*\to V^*.$$

If you apply this to the representation maps on $V$, you get the adjoints $r_\beta^*:V^*\to\mathbb R^{n,*}$, where the latter is canonically equivalent to $\mathbb R^n$ because it has a canonical basis. The inverse of this map, $(r_\beta^*)^{-1}$, is the representation map $r_{\beta^*}:\mathbb R^n\cong\mathbb R^{n,*}\to V^*$. This is the origin of the 'inverse transpose' rule for transforming covectors.

To get the transformation rule for covectors between two bases, you need to string two of these together:
$$
\left((r_\gamma^*)^{-1}\right)^{-1}\circ(r_\beta^*)^{-1}=r_\gamma^*\circ (r_\beta^*)^{-1}:\mathbb R^n\to \mathbb R^n,
$$
which is very different to the one for vectors, (1).

Still think that vectors and covectors are the same thing?

Addendum

Let me, finally, address another misconception in your question:

An inner product is between elements of the same vector space and not between two vector spaces, it is not how it is defined.

Inner products are indeed defined by taking both inputs from the same vector space. Nevertheless, it is still perfectly possible to define a bilinear form $\langle \cdot,\cdot\rangle:V^*\times V\to\mathbb R$ which takes one covector and one vector to give a scalar; it is simple the action of the former on the latter:
$$\langle\varphi,v\rangle=\varphi(v).$$
This bilinear form is always guaranteed and presupposes strictly *less* structure than an inner product. This is the 'inner product' which reads $\varphi_j v^j$ in Einstein notation.

Of course, this does relate to the inner product structure $ \langle \cdot,\cdot\rangle_\text{I.P.}$ on $V$ when there is one. Having such a structure enables one to identify vectors and covectors in a canonical way: given a vector $v$ in $V$, its corresponding covector is the linear functional
$$
\begin{align}
i(v)=\langle v,\cdot\rangle_\text{I.P.} : V&\longrightarrow\mathbb R \\
w&\mapsto \langle v,w\rangle_\text{I.P.}.
\end{align}
$$
By construction, both bilinear forms are canonically related, so that the 'inner product' $\langle\cdot,\cdot\rangle$ between $v\in V^*$ and $w\in V$ is exactly the same as the inner product $\langle\cdot,\cdot\rangle_\text{I.P.}$ between $i(v)\in V$ and $w\in V$. That use of language is perfectly justified.

Addendum 2, on your question about the gradient.

I should really try and convince you at this point that the transformation laws are in fact enough to show something is a covector. (The way the argument goes is that one can define a linear functional on $V$ via the form in $\mathbb R^{n*}$ given by the components, and the transformation laws ensure that this form in $V^*$ is independent of the basis; alternatively, given the components $f_\beta,f_\gamma\in\mathbb R^n$ with respect to two basis, the representation maps give the forms $r_{\beta^*}(f_\beta)=r_{\gamma^*}(f_\gamma)\in V^*$, and the two are equal because of the transformation laws.)

However, there is indeed a deeper reason for the fact that the gradient is a covector. Essentially, it is to do with the fact that the equation
$$df=\nabla f\cdot dx$$
does not actually need a dot product; instead, it relies on the simpler structure of the dual-primal bilinear form $\langle \cdot,\cdot\rangle$.

To make this precise, consider an arbitrary function $T:\mathbb R^n\to\mathbb R^m$. The derivative of $T$ at $x_0$ is defined to be the (unique) linear map $dT_{x_0}:\mathbb R^n\to\mathbb R^m$ such that
$$
T(x)=T(x_0)+dT_{x_0}(x-x_0)+O(|x-x_0|^2),
$$
if it exists. The gradient is exactly this map; it was *born* as a linear functional, whose coordinates over *any* basis are $\frac{\partial f}{\partial x_j}$ to ensure that the multi-dimensional chain rule,
$$
df=\sum_j \frac{\partial f}{\partial x_j}d x_j,
$$
is satisfied. To make things easier to understand to undergraduates who are fresh out of 1D calculus, this linear map is most often 'dressed up' as the corresponding vector, which is uniquely obtainable through the Euclidean structure, and whose action must therefore go back through that Euclidean structure to get to the original $df$.

This post imported from StackExchange Physics at 2014-03-28 17:09 (UCT), posted by SE-user Emilio Pisanty